Please use this identifier to cite or link to this item:
Title: Robust multi-agent team behaviors in uncertain environment via reinforcement learning
Authors: Yan, Kok Hong
Keywords: Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Issue Date: 2022
Publisher: Nanyang Technological University
Source: Yan, K. H. (2022). Robust multi-agent team behaviors in uncertain environment via reinforcement learning. Master's thesis, Nanyang Technological University, Singapore.
Abstract: Many state-of-the-art cooperative multi-agent reinforcement learning (MARL) approaches, such as MADDPG, COMA, and QMIX have focused mainly on performing well in idealized scenarios. Agents face similar environmental conditions and opponents encountered during training. The resulting policies are often fragile and brittle from overfitting to the training environment. These policies cannot be easily deployed out of the laboratory. While adversarial learning is a way to train robust policies, many of these works have focused on single-agent RL and adversarial updates to the static environment. Some robust MARL works are designed based on adversarial training. These works have focused on specialized settings. M3DDPG focuses on an extreme setting in which all other agents are assumed to be adversarial. Phan et al. looked at the setting where agents malfunction and turn adversarial. Many of these works have compromised on team coordination to achieve robustness. There is little emphasis on maintaining good team coordination while ensuring robustness. This is an obvious gap where robustness should be part of the MARL algorithm design objectives besides performance, rather than an afterthought. This work focuses on learning robust team policy that would perform well even when the environment and opponent behaviour is significantly different from training. We propose the Signal-mediated Team Maxmin (STeaM) framework. STeaM is an end-to-end MARL framework that approximates the game-theoretic solution concept of team-maxmin equilibrium with a correlation device (TMECor), to address issues of agent coordination and policy robustness. STeaM uses a pre-agreed signal to coordinate team actions and approximate TMECor policies through consistency and diversity regularizations together with a best-response gradient descent self-play equilibrium learning procedure. Our experiments show that STeaM can learn team agent policies that approximate TMECor well. These policies can consistently achieve higher rewards in adversarial and uncertain situations over policies produced by other state-of-art models. The STeaM produced policies also exhibit bounded performance degradation when tested previously unseen policies.
Schools: School of Computer Science and Engineering 
Rights: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:SCSE Theses

Files in This Item:
File Description SizeFormat 
YanKokHong-MENG-SCSE-NTU-Thesis.pdf2.71 MBAdobe PDFThumbnail

Page view(s)

Updated on Jun 20, 2024

Download(s) 50

Updated on Jun 20, 2024

Google ScholarTM


Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.