Abstract:
Multi-agent reinforcement learning is becoming increasingly relevant as a solution method for coordination problems between multiple decision-making entities. Current state-of-the-art methods rely on model-free approaches in which all agent interactions take place in a real environment. This can be a limiting factor due to the large number of trail-and-error interactions that are typically required to learn successful coordinated behaviours. One solution is to use environment models that can generate experience outside of the real environment. This can improve the sample efficiency of learning as experience is no longer bound by potentially slow interactions with real environments. Furthermore, an environment model introduces an opportunity to improve upon random search methods by using planning to facilitate faster learning of optimal behaviours. A significant barrier to using model-based methods for real-world environments is the difficulty of defining explicit rules for their dynamics. Recent work has shown that deep neural networks can be used to model environments and improve both sample efficiency and solution quality on single-agent problems in Atari 2600 games. This research explores the application of these methods to the more complex environments of the StarCraft II Multi-agent Challenge (SMAC), a recently developed benchmark for multi-agent decision making. Two model-based approaches are presented: 1) reinforcement learning with supplementary experience derived from a learned StarCraft II model and 2) environment exploration guided by model-based planning. Each approach is evaluated on two SMAC scenarios representing simple and complex environment dynamics. The planning approach was found to improve sample efficiency over the state-of-the-art model-free method on the simple environment while matching sample efficiency on the complex environment. Reinforcement learning using model experience underperformed the model-free method for both scenarios. The results show that model-based planning methods are feasible in multi-agent settings and could have wider applicability with more accurate environment models.