Please use this identifier to cite or link to this item:
|Title:||Improving sample efficiency using attention in deep reinforcement learning||Authors:||Ong, Dorvin Poh Jie||Keywords:||Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence||Issue Date:||2021||Publisher:||Nanyang Technological University||Source:||Ong, D. P. J. (2021). Improving sample efficiency using attention in deep reinforcement learning. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/150563||Project:||SCSE20-0527||Abstract:||Reinforcement learning is becoming increasingly popular due to its cumulative feats in mainstream games such as DOTA2 and Go as well as its applicability to many fields. It has displayed potential in exceeding human levels of performance in complicated environments and sequential decision-making problems. However, one limitation that has plagued reinforcement learning is the lacking sample efficiency. Reinforcement learning, amongst the three paradigms of machine learning, requires the most samples to produce a useful result. With more samples, more energy and time would be required to train a useful model, which is expensive. In this report, we conducted rigorous study into the reinforcement learning field, implemented the Proximal Policy Algorithm (PPO) and attempted to improve sample efficiency of reinforcement learning algorithms using self-attention models. Borrowing ideas from previous implementation of self-attention models, we experiment on variants of the Self-Attending Network(SAN) such as Channel-wise Self Attending (C-SAN) and Cross Attending Network (CAN), which is a combination of channel-column-wise and channel-row-wise attention. Our results have shown that CAN was distinctly more sample efficient than the original SAN and the vanilla PPO (No Attention) model in the game of Pong. However, shifting implementations towards Stable Baselines3 has returned results that differs from our findings in the earlier experiments. We attribute the discrepancy of the results to the implementation differences in the PPO algorithm. On the next experiment, we tested SAN, C-SAN and CAN on 49 Atari 2600 games. C-SAN was found to be better than the No Attention model by 15.36% on average while CAN and SAN were found to be worse by -14.44% and -1.47% respectively. Based on the results, we hypothesize that self-attention models could potentially perform better in complex environments because the benefits of a better state representation could facilitate learning a better policy. Further re-evaluation on more complex environments for a longer training duration has shown potential in CAN which managed to outperform other models. However, preliminary investigation of the reasons why self-attention works was inconclusive. Nevertheless, we provide some hypothesis in explaining the effect of self-attention models.||URI:||https://hdl.handle.net/10356/150563||Fulltext Permission:||restricted||Fulltext Availability:||With Fulltext|
|Appears in Collections:||SCSE Student Reports (FYP/IA/PA/PI)|
Updated on May 17, 2022
Updated on May 17, 2022
Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.