Abstract
Deep reinforcement learning has proven to be effective in various video games, such as Atari games, StarCraft II, Google research football (GRF), and Dota II. We participated in the 2022 IEEE Conference on Games Football AI Competition and ranked in the top eight. Despite recent efforts, building agents for GRF still suffers from multi-agent coordination, sparse rewards, and stochastic environments. To address these issues and achieve good outcomes in the competition, we devised a reinforcement learning algorithm that uses deep reinforcement learning from demonstrations and policy distillation. In this study, we innovatively propose a two-stage algorithm named mimic-to-counteract reinforcement learning (MCRL) based on the historical game logs of opponents, we encountered during the warm-up session and formulated partner agents function similarly to human sparring partners, whereby they simulate opponents with diverse styles of play, enabling primary players to practice against a range of policies, they may encounter in real competitions. Additionally, we trained numerous mentor agents capable of restraining the sparring partners. We distilled their policies and amalgamated them to train a potent primary agent. Empirical results show that the proposed MCRL algorithm can efficiently search for valuable strategies with stable updates and balance the relationship between policy iteration and policy style deviation. Also, the primary agent can learn diverse but coordinated counteracting strategies and ranks in the top eight in the competition.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Code availability
The source code and data have been made available at https://github.com/zjj1224073665/football-distillation-SEED.
References
Agarwal R, Schuurmans D, Norouzi M (2019) Striving for simplicity in off-policy deep reinforcement learning. CoRR. arXiv:1907.04543
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning. CoRR. arXiv:1312.5602
Vinyals O, Babuschkin I, Czarnecki W, Mathieu M, Dudzik A, Chung J, Choi D, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou J, Jaderberg M, Silver D (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. https://doi.org/10.1038/s41586-019-1724-z
Kurach K, Raichuk A, Stanczyk P, Zajac M, Bachem O, Espeholt L, Riquelme C, Vincent D, Michalski M, Bousquet O, Gelly S (2019) Google research football: a novel reinforcement learning environment. CoRR. arXiv:1907.11180
Berner C, Brockman G, Chan B, Cheung V, Debiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C, Józefowicz R, Gray S, Olsson C, Pachocki J, Petrov M, Oliveira Pinto HP, Raiman J, Salimans T, Schlatter J, Schneider J, Sidor S, Sutskever I, Tang J, Wolski F, Zhang S (2019) Dota 2 with large scale deep reinforcement learning. CoRR. arXiv:1912.06680
Rashid T, Samvelyan M, Witt CS, Farquhar G, Foerster JN, Whiteson S (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. CoRR. arXiv:1803.11485
Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multi-agent variational exploration. CoRR. arXiv:1910.07483
Yu C, Velu A, Vinitsky E, Wang Y, Bayen AM, Wu Y (2021) The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR. arXiv:2103.01955
Taïga AA, Fedus W, Machado MC, Courville AC, Bellemare MG (2021) On bonus-based exploration methods in the arcade learning environment. CoRR. arXiv:2109.11052
Zhang T, Xu H, Wang X, Wu Y, Keutzer K, Gonzalez JE. Tian Y (2020) Bebold: Exploration beyond the boundary of explored regions. CoRR. arXiv:2012.08621
Zhao R, Song J, Yuan Y, Haifeng H, Gao Y, Wu Y, Sun Z, Wei Y (2022) Maximum entropy population-based training for zero-shot human-AI coordination
Kapturowski S, Campos V, Jiang R, Rakićević N, Hasselt H, Blundell C, Badia AP (2022) Human-level Atari 200x faster
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529:484–489
Cobbe K, Hesse C, Hilton J, Schulman J (2020) Leveraging procedural generation to benchmark reinforcement learning
Ye D, Chen G, Zhang W, Chen S, Yuan B, Liu B, Chen J, Liu Z, Qiu F, Yu H, Yin Y, Shi B, Wang L, Shi T, Fu Q, Yang W, Huang L, Liu W (2020) Towards playing full MOBA games with deep reinforcement learning
Huang S, Chen W, Zhang L, Li Z, Zhu F, Ye D, Chen T, Zhu J (2021) Tikick: towards playing multi-agent football full games from single-agent demonstrations. CoRR. arXiv:2110.04507
Lin F, Huang S, Pearce T, Chen W, Tu W-W (2023) TiZero: mastering multi-agent football with curriculum learning and self-play
Liu X, Jia H, Wen Y, Yang Y, Hu Y, Chen Y, Fan C, Hu Z (2021) Unifying behavioral and response diversity for open-ended learning in zero-sum games. CoRR. arXiv:2106.04958
Li C, Wu C, Wang T, Yang J, Zhao Q, Zhang C (2021) Celebrating diversity in shared multi-agent reinforcement learning. CoRR. arXiv:2106.02195
Yang Y, Wang J (2021) An overview of multi-agent reinforcement learning from game theoretical perspective
Kajii Y, Yamada K (2017) Multi-agent reinforcement learning. In: The proceedings of JSME annual conference on robotics and mechatronics (Robomec), pp 2–109
Uddin Mondal W, Aggarwal V, Ukkusuri SV (2022) Mean-field approximation of cooperative constrained multi-agent reinforcement learning (CMARL)
Galliera R, Venable KB, Bassani M, Suri N (2023) Learning collaborative information dissemination with graph-based multi-agent reinforcement learning
Mishra S, Anand A, Hoffmann J, Heess N, Riedmiller M, Abdolmaleki A, Precup D (2023) Policy composition in reinforcement learning via multi-objective policy optimization
Maria, Grazia, Vigliotti: decentralized execution of constraint handling rules for ensembles. Comput Rev (2014)
Rashid T, Samvelyan M, De Witt CS, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning
Schreuder N, Brunel V-E, Dalalyan A (2020) Statistical guarantees for generative models without domination
Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K (2017) Population based training of neural networks
Yu H, Zhang X, Song L, Jiang L, Huang X, Chen W, Zhang C, Li J, Yang J, Hu Z, Duan Q, Chen W, He X, Fan J, Jiang W, Zhang L, Qiu C, Gu M, Sun W, Zhang Y, Peng G, Shen W, Fu G (2020) Large-scale gastric cancer screening and localization using multi-task deep neural network
Hüllermeier E, Waegeman W (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn 110(3):457–506. https://doi.org/10.1007/s10994-021-05946-3
Gehrig M, Shrestha SB, Mouritzen, D, Scaramuzza D (2020) Event-based angular velocity regression with spiking networks
Ge Y, Xu S, Liu S, Fu Z, Sun F, Zhang Y (2020) Learning personalized risk preferences for recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. SIGIR ’20. Association for Computing Machinery, New York, pp. 409–418. https://doi.org/10.1145/3397271.3401056
Li Q, Huang J, Hu J, Gong S (2022) Feature-distribution perturbation and calibration for generalized person ReID
Gampe H, Griffin C (2023) Dynamics of a binary option market with exogenous information and price sensitivity. Commun Nonlinear Sci Numer Simul 118:106994. https://doi.org/10.1016/j.cnsns.2022.106994
Liu Z, Li X (2022) A novel Lagrange multiplier approach with relaxation for gradient flows
Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Dulac-Arnold G, Osband I, Agapiou J, Leibo JZ, Gruslys A (2017) Deep Q-learning from demonstrations
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning
Wen G, Li B (2022) Optimized leader-follower consensus control using reinforcement learning for a class of second-order nonlinear multiagent systems. IEEE Trans Syst Man Cybern: Syst 52(9):5546–5555. https://doi.org/10.1109/TSMC.2021.3130070
Song Z, Ma C, Ding M, Yang HH, Qian Y, Zhou X (2023) Personalized federated deep reinforcement learning-based trajectory optimization for multi-UAV assisted edge computing
Cazenavette G, Wang T, Torralba A, Efros AA, Zhu J-Y (2022) Dataset distillation by matching training trajectories
Tu V, Pham TL, Dao PN (2022) Disturbance observer-based adaptive reinforcement learning for perturbed uncertain surface vessels. ISA Trans 130:277–292. https://doi.org/10.1016/j.isatra.2022.03.027
Du Y, Wu D (2022) Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids. IEEE Trans Sustain Energy 13:1062–1072
Tang Z, Shi Y, Xu X (2023) CSGP: closed-loop safe grasp planning via attention-based deep reinforcement learning from demonstrations. IEEE Robot Autom Lett 8:3158–3165
Martins FB, Machado MG, Bassani HF, Braga PHM, Barros ES (2021) rSoccer: A framework for studying reinforcement learning in small and very small size robot soccer
Stone P, Sutton RS, Kuhlmann G (2005) Reinforcement learning for robocup soccer keepaway. Adapt Behav 13(3):165–188. https://doi.org/10.1177/105971230501300301
Kitano H, Asada M, Kuniyoshi Y, Noda I, Osawa E (1997) Robocup: the robot world cup initiative. In: Proceedings of the first international conference on autonomous agents. AGENTS ’97. Association for Computing Machinery, New York, pp 340–347. https://doi.org/10.1145/267658.267738
Liu S, Lever G, Wang Z, Merel J, Eslami SMA, Hennes D, Czarnecki WM, Tassa Y, Omidshafiei S, Abdolmaleki A, Siegel NY, Hasenclever L, Marris L, Tunyasuvunakool S, Song HF, Wulfmeier M, Muller P, Haarnoja T, Tracey BD, Tuyls K, Graepel T, Heess N (2021) From motor control to team play in simulated humanoid football
Fengming Zhu ZL, Zhu, K (2020) WeKick. https://www.kaggle.com/ c/google-football/discussion/202232
Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs, 1st edn. Springer, Berlin
Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Dulac-Arnold G, Osband I, Agapiou J, Leibo JZ, Gruslys A (2017) Deep Q-learning from demonstrations
Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M (2018) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations
Liang X, Wang T, Yang L, Xing E (2018) CIRL: controllable imitative reinforcement learning for vision-based self-driving
Fu J, Luo K, Levine S (2018) Learning robust rewards with adversarial inverse reinforcement learning
Hausman K, Chebotar Y, Schaal S, Sukhatme G, Lim J (2017) Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets
Zhang M, Wang Y, Ma X, Xia L, Yang J, Li Z, Li X (2020) Wasserstein distance guided adversarial imitation learning with reward shape exploration. CoRR. arXiv:2006.03503
Weng L (2019) From GAN to WGAN. CoRR. arXiv:1904.08994
Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Annu Rev Stat Appl 6(1):405–431. https://doi.org/10.1146/annurev-statistics-030718-104938
Xing J, Nagata T, Zou X, Neftci E, Krichmar JL (2023) Achieving efficient interpretability of reinforcement learning via policy distillation and selective input gradient regularization. Neural Netw 161:228–241
Xing J, Nagata T, Zou X, Neftci E, Krichmar JL (2022) Policy distillation with selective input gradient regularization for efficient interpretability
Rusu AA, Colmenarejo SG, Gülçehre, Desjardins G, Kirkpatrick J, Pascanu R, Mnih V, Kavukcuoglu K, Hadsell R (2015) Policy distillation. CoRR. arXiv:1511.06295
Nowozin S, Cseke B, Tomioka R (2016) f-GAN: training generative neural samplers using variational divergence minimization
Yu C, Velu A, Vinitsky E, Wang Y, Bayen AM, Wu Y (2021) The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR. arXiv:2103.01955
Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. CoRR. arXiv:1706.02275
Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen L (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR. arXiv:1801.04381
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR. arXiv:1704.04861
Cho K, Merrienboer B, Gülçehre Ç, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR. arXiv:1406.1078
Hochreiter S (1997) Long short-term memory. Neural Comput 9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735
Yu X, Li G, Chai C, Tang N (2020) Reinforcement learning with tree-LSTM for join order selection. In: 2020 IEEE 36th international conference on data engineering (ICDE), pp 1297–1308. https://doi.org/10.1109/ICDE48307.2020.00116
Saxe AM, McClelland JL, Ganguli S (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. https://doi.org/10.48550/ARXIV.1312.6120. arXiv:1312.6120
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR. arXiv:1412.6980
Espeholt L, Marinier R, Stanczyk P, Wang K, Michalski M (2020) SEED RL: scalable and efficient deep-RL with accelerated central inference
Czarnecki WM, Pascanu R, Osindero S, Jayakumar SM, Swirszcz G, Jaderberg M (2019) Distilling policy distillation. CoRR. arXiv:1902.02186
Automation CAoS RLChina Reinforcement Learning Community. Institute of Automation, Chinese Academy of Sciences. www.rlchina.org
Automation CAoS Jidi. Institute of Automation, Chinese Academy of Sciences. http://www.jidiai.cn/
Acknowledgements
This study was partially supported by the National Natural Science Foundation of China Youth Fund (62306135), the Ministry of Education Youth Fund (23YJC630156) and the Jiangsu Provincial Youth Fund (BK20230783).
Author information
Authors and Affiliations
Contributions
Research problem, method design, implementation, and experiments have been performed by JZ. Data analysis has been performed by JZ. JZ, JL, XZ, and YL have made results analysis. JZ, XZ, and YS read and approved the final manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
This article does not involve human subject for data collection. There is no need for ethical approval.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, J., Lin, J., Zhang, X. et al. From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football. Neural Comput & Applic 36, 7203–7219 (2024). https://doi.org/10.1007/s00521-024-09455-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09455-x