[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Deep reinforcement learning has proven to be effective in various video games, such as Atari games, StarCraft II, Google research football (GRF), and Dota II. We participated in the 2022 IEEE Conference on Games Football AI Competition and ranked in the top eight. Despite recent efforts, building agents for GRF still suffers from multi-agent coordination, sparse rewards, and stochastic environments. To address these issues and achieve good outcomes in the competition, we devised a reinforcement learning algorithm that uses deep reinforcement learning from demonstrations and policy distillation. In this study, we innovatively propose a two-stage algorithm named mimic-to-counteract reinforcement learning (MCRL) based on the historical game logs of opponents, we encountered during the warm-up session and formulated partner agents function similarly to human sparring partners, whereby they simulate opponents with diverse styles of play, enabling primary players to practice against a range of policies, they may encounter in real competitions. Additionally, we trained numerous mentor agents capable of restraining the sparring partners. We distilled their policies and amalgamated them to train a potent primary agent. Empirical results show that the proposed MCRL algorithm can efficiently search for valuable strategies with stable updates and balance the relationship between policy iteration and policy style deviation. Also, the primary agent can learn diverse but coordinated counteracting strategies and ranks in the top eight in the competition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Code availability

The source code and data have been made available at https://github.com/zjj1224073665/football-distillation-SEED.

References

  1. Agarwal R, Schuurmans D, Norouzi M (2019) Striving for simplicity in off-policy deep reinforcement learning. CoRR. arXiv:1907.04543

  2. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing atari with deep reinforcement learning. CoRR. arXiv:1312.5602

  3. Vinyals O, Babuschkin I, Czarnecki W, Mathieu M, Dudzik A, Chung J, Choi D, Powell R, Ewalds T, Georgiev P, Oh J, Horgan D, Kroiss M, Danihelka I, Huang A, Sifre L, Cai T, Agapiou J, Jaderberg M, Silver D (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature. https://doi.org/10.1038/s41586-019-1724-z

    Article  PubMed  Google Scholar 

  4. Kurach K, Raichuk A, Stanczyk P, Zajac M, Bachem O, Espeholt L, Riquelme C, Vincent D, Michalski M, Bousquet O, Gelly S (2019) Google research football: a novel reinforcement learning environment. CoRR. arXiv:1907.11180

  5. Berner C, Brockman G, Chan B, Cheung V, Debiak P, Dennison C, Farhi D, Fischer Q, Hashme S, Hesse C, Józefowicz R, Gray S, Olsson C, Pachocki J, Petrov M, Oliveira Pinto HP, Raiman J, Salimans T, Schlatter J, Schneider J, Sidor S, Sutskever I, Tang J, Wolski F, Zhang S (2019) Dota 2 with large scale deep reinforcement learning. CoRR. arXiv:1912.06680

  6. Rashid T, Samvelyan M, Witt CS, Farquhar G, Foerster JN, Whiteson S (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. CoRR. arXiv:1803.11485

  7. Mahajan A, Rashid T, Samvelyan M, Whiteson S (2019) MAVEN: multi-agent variational exploration. CoRR. arXiv:1910.07483

  8. Yu C, Velu A, Vinitsky E, Wang Y, Bayen AM, Wu Y (2021) The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR. arXiv:2103.01955

  9. Taïga AA, Fedus W, Machado MC, Courville AC, Bellemare MG (2021) On bonus-based exploration methods in the arcade learning environment. CoRR. arXiv:2109.11052

  10. Zhang T, Xu H, Wang X, Wu Y, Keutzer K, Gonzalez JE. Tian Y (2020) Bebold: Exploration beyond the boundary of explored regions. CoRR. arXiv:2012.08621

  11. Zhao R, Song J, Yuan Y, Haifeng H, Gao Y, Wu Y, Sun Z, Wei Y (2022) Maximum entropy population-based training for zero-shot human-AI coordination

  12. Kapturowski S, Campos V, Jiang R, Rakićević N, Hasselt H, Blundell C, Badia AP (2022) Human-level Atari 200x faster

  13. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529:484–489

    Article  ADS  CAS  PubMed  Google Scholar 

  14. Cobbe K, Hesse C, Hilton J, Schulman J (2020) Leveraging procedural generation to benchmark reinforcement learning

  15. Ye D, Chen G, Zhang W, Chen S, Yuan B, Liu B, Chen J, Liu Z, Qiu F, Yu H, Yin Y, Shi B, Wang L, Shi T, Fu Q, Yang W, Huang L, Liu W (2020) Towards playing full MOBA games with deep reinforcement learning

  16. Huang S, Chen W, Zhang L, Li Z, Zhu F, Ye D, Chen T, Zhu J (2021) Tikick: towards playing multi-agent football full games from single-agent demonstrations. CoRR. arXiv:2110.04507

  17. Lin F, Huang S, Pearce T, Chen W, Tu W-W (2023) TiZero: mastering multi-agent football with curriculum learning and self-play

  18. Liu X, Jia H, Wen Y, Yang Y, Hu Y, Chen Y, Fan C, Hu Z (2021) Unifying behavioral and response diversity for open-ended learning in zero-sum games. CoRR. arXiv:2106.04958

  19. Li C, Wu C, Wang T, Yang J, Zhao Q, Zhang C (2021) Celebrating diversity in shared multi-agent reinforcement learning. CoRR. arXiv:2106.02195

  20. Yang Y, Wang J (2021) An overview of multi-agent reinforcement learning from game theoretical perspective

  21. Kajii Y, Yamada K (2017) Multi-agent reinforcement learning. In: The proceedings of JSME annual conference on robotics and mechatronics (Robomec), pp 2–109

  22. Uddin Mondal W, Aggarwal V, Ukkusuri SV (2022) Mean-field approximation of cooperative constrained multi-agent reinforcement learning (CMARL)

  23. Galliera R, Venable KB, Bassani M, Suri N (2023) Learning collaborative information dissemination with graph-based multi-agent reinforcement learning

  24. Mishra S, Anand A, Hoffmann J, Heess N, Riedmiller M, Abdolmaleki A, Precup D (2023) Policy composition in reinforcement learning via multi-objective policy optimization

  25. Maria, Grazia, Vigliotti: decentralized execution of constraint handling rules for ensembles. Comput Rev (2014)

  26. Rashid T, Samvelyan M, De Witt CS, Farquhar G, Foerster J, Whiteson S (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning

  27. Schreuder N, Brunel V-E, Dalalyan A (2020) Statistical guarantees for generative models without domination

  28. Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Vinyals O, Green T, Dunning I, Simonyan K, Fernando C, Kavukcuoglu K (2017) Population based training of neural networks

  29. Yu H, Zhang X, Song L, Jiang L, Huang X, Chen W, Zhang C, Li J, Yang J, Hu Z, Duan Q, Chen W, He X, Fan J, Jiang W, Zhang L, Qiu C, Gu M, Sun W, Zhang Y, Peng G, Shen W, Fu G (2020) Large-scale gastric cancer screening and localization using multi-task deep neural network

  30. Hüllermeier E, Waegeman W (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Mach Learn 110(3):457–506. https://doi.org/10.1007/s10994-021-05946-3

    Article  MathSciNet  Google Scholar 

  31. Gehrig M, Shrestha SB, Mouritzen, D, Scaramuzza D (2020) Event-based angular velocity regression with spiking networks

  32. Ge Y, Xu S, Liu S, Fu Z, Sun F, Zhang Y (2020) Learning personalized risk preferences for recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. SIGIR ’20. Association for Computing Machinery, New York, pp. 409–418. https://doi.org/10.1145/3397271.3401056

  33. Li Q, Huang J, Hu J, Gong S (2022) Feature-distribution perturbation and calibration for generalized person ReID

  34. Gampe H, Griffin C (2023) Dynamics of a binary option market with exogenous information and price sensitivity. Commun Nonlinear Sci Numer Simul 118:106994. https://doi.org/10.1016/j.cnsns.2022.106994

    Article  MathSciNet  Google Scholar 

  35. Liu Z, Li X (2022) A novel Lagrange multiplier approach with relaxation for gradient flows

  36. Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Dulac-Arnold G, Osband I, Agapiou J, Leibo JZ, Gruslys A (2017) Deep Q-learning from demonstrations

  37. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning

  38. Wen G, Li B (2022) Optimized leader-follower consensus control using reinforcement learning for a class of second-order nonlinear multiagent systems. IEEE Trans Syst Man Cybern: Syst 52(9):5546–5555. https://doi.org/10.1109/TSMC.2021.3130070

    Article  Google Scholar 

  39. Song Z, Ma C, Ding M, Yang HH, Qian Y, Zhou X (2023) Personalized federated deep reinforcement learning-based trajectory optimization for multi-UAV assisted edge computing

  40. Cazenavette G, Wang T, Torralba A, Efros AA, Zhu J-Y (2022) Dataset distillation by matching training trajectories

  41. Tu V, Pham TL, Dao PN (2022) Disturbance observer-based adaptive reinforcement learning for perturbed uncertain surface vessels. ISA Trans 130:277–292. https://doi.org/10.1016/j.isatra.2022.03.027

  42. Du Y, Wu D (2022) Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids. IEEE Trans Sustain Energy 13:1062–1072

    Article  ADS  Google Scholar 

  43. Tang Z, Shi Y, Xu X (2023) CSGP: closed-loop safe grasp planning via attention-based deep reinforcement learning from demonstrations. IEEE Robot Autom Lett 8:3158–3165

    Article  Google Scholar 

  44. Martins FB, Machado MG, Bassani HF, Braga PHM, Barros ES (2021) rSoccer: A framework for studying reinforcement learning in small and very small size robot soccer

  45. Stone P, Sutton RS, Kuhlmann G (2005) Reinforcement learning for robocup soccer keepaway. Adapt Behav 13(3):165–188. https://doi.org/10.1177/105971230501300301

    Article  Google Scholar 

  46. Kitano H, Asada M, Kuniyoshi Y, Noda I, Osawa E (1997) Robocup: the robot world cup initiative. In: Proceedings of the first international conference on autonomous agents. AGENTS ’97. Association for Computing Machinery, New York, pp 340–347. https://doi.org/10.1145/267658.267738

  47. Liu S, Lever G, Wang Z, Merel J, Eslami SMA, Hennes D, Czarnecki WM, Tassa Y, Omidshafiei S, Abdolmaleki A, Siegel NY, Hasenclever L, Marris L, Tunyasuvunakool S, Song HF, Wulfmeier M, Muller P, Haarnoja T, Tracey BD, Tuyls K, Graepel T, Heess N (2021) From motor control to team play in simulated humanoid football

  48. Fengming Zhu ZL, Zhu, K (2020) WeKick. https://www.kaggle.com/ c/google-football/discussion/202232

  49. Oliehoek FA, Amato C (2016) A concise introduction to decentralized POMDPs, 1st edn. Springer, Berlin

    Book  Google Scholar 

  50. Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Dulac-Arnold G, Osband I, Agapiou J, Leibo JZ, Gruslys A (2017) Deep Q-learning from demonstrations

  51. Vecerik M, Hester T, Scholz J, Wang F, Pietquin O, Piot B, Heess N, Rothörl T, Lampe T, Riedmiller M (2018) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards

  52. Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations

  53. Liang X, Wang T, Yang L, Xing E (2018) CIRL: controllable imitative reinforcement learning for vision-based self-driving

  54. Fu J, Luo K, Levine S (2018) Learning robust rewards with adversarial inverse reinforcement learning

  55. Hausman K, Chebotar Y, Schaal S, Sukhatme G, Lim J (2017) Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets

  56. Zhang M, Wang Y, Ma X, Xia L, Yang J, Li Z, Li X (2020) Wasserstein distance guided adversarial imitation learning with reward shape exploration. CoRR. arXiv:2006.03503

  57. Weng L (2019) From GAN to WGAN. CoRR. arXiv:1904.08994

  58. Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Annu Rev Stat Appl 6(1):405–431. https://doi.org/10.1146/annurev-statistics-030718-104938

    Article  MathSciNet  Google Scholar 

  59. Xing J, Nagata T, Zou X, Neftci E, Krichmar JL (2023) Achieving efficient interpretability of reinforcement learning via policy distillation and selective input gradient regularization. Neural Netw 161:228–241

    Article  PubMed  Google Scholar 

  60. Xing J, Nagata T, Zou X, Neftci E, Krichmar JL (2022) Policy distillation with selective input gradient regularization for efficient interpretability

  61. Rusu AA, Colmenarejo SG, Gülçehre, Desjardins G, Kirkpatrick J, Pascanu R, Mnih V, Kavukcuoglu K, Hadsell R (2015) Policy distillation. CoRR. arXiv:1511.06295

  62. Nowozin S, Cseke B, Tomioka R (2016) f-GAN: training generative neural samplers using variational divergence minimization

  63. Yu C, Velu A, Vinitsky E, Wang Y, Bayen AM, Wu Y (2021) The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR. arXiv:2103.01955

  64. Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. CoRR. arXiv:1706.02275

  65. Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen L (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR. arXiv:1801.04381

  66. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR. arXiv:1704.04861

  67. Cho K, Merrienboer B, Gülçehre Ç, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR. arXiv:1406.1078

  68. Hochreiter S (1997) Long short-term memory. Neural Comput 9:1735–80. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  CAS  PubMed  Google Scholar 

  69. Yu X, Li G, Chai C, Tang N (2020) Reinforcement learning with tree-LSTM for join order selection. In: 2020 IEEE 36th international conference on data engineering (ICDE), pp 1297–1308. https://doi.org/10.1109/ICDE48307.2020.00116

  70. Saxe AM, McClelland JL, Ganguli S (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. https://doi.org/10.48550/ARXIV.1312.6120. arXiv:1312.6120

  71. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. CoRR. arXiv:1412.6980

  72. Espeholt L, Marinier R, Stanczyk P, Wang K, Michalski M (2020) SEED RL: scalable and efficient deep-RL with accelerated central inference

  73. Czarnecki WM, Pascanu R, Osindero S, Jayakumar SM, Swirszcz G, Jaderberg M (2019) Distilling policy distillation. CoRR. arXiv:1902.02186

  74. Automation CAoS RLChina Reinforcement Learning Community. Institute of Automation, Chinese Academy of Sciences. www.rlchina.org

  75. Automation CAoS Jidi. Institute of Automation, Chinese Academy of Sciences. http://www.jidiai.cn/

Download references

Acknowledgements

This study was partially supported by the National Natural Science Foundation of China Youth Fund (62306135), the Ministry of Education Youth Fund (23YJC630156) and the Jiangsu Provincial Youth Fund (BK20230783).

Author information

Authors and Affiliations

Authors

Contributions

Research problem, method design, implementation, and experiments have been performed by JZ. Data analysis has been performed by JZ. JZ, JL, XZ, and YL have made results analysis. JZ, XZ, and YS read and approved the final manuscript.

Corresponding authors

Correspondence to Xianzhong Zhou or Yuxiang Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical approval

This article does not involve human subject for data collection. There is no need for ethical approval.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, J., Lin, J., Zhang, X. et al. From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football. Neural Comput & Applic 36, 7203–7219 (2024). https://doi.org/10.1007/s00521-024-09455-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09455-x

Keywords

Navigation