Please use this identifier to cite or link to this item: https://hdl.handle.net/10356/159633
Title: Intelligent trainer for Dyna-style model-based deep reinforcement learning
Authors: Dong, Linsen
Li, Yuanlong
Zhou, Xin
Wen, Yonggang
Guan, Kyle
Keywords: Engineering::Computer science and engineering
Issue Date: 2020
Source: Dong, L., Li, Y., Zhou, X., Wen, Y. & Guan, K. (2020). Intelligent trainer for Dyna-style model-based deep reinforcement learning. IEEE Transactions On Neural Networks and Learning Systems, 32(6), 2758-2771. https://dx.doi.org/10.1109/TNNLS.2020.3008249
Project: NRF2017EWT-EP003-023 
NRF2015ENC-GDCR01001-003
BSEWWT2017_2_06
Journal: IEEE Transactions on Neural Networks and Learning Systems 
Abstract: Model-based reinforcement learning (MBRL) has been proposed as a promising alternative solution to tackle the high sampling cost challenge in the canonical RL, by leveraging a system dynamics model to generate synthetic data for policy training purpose. The MBRL framework, nevertheless, is inherently limited by the convoluted process of jointly optimizing control policy, learning system dynamics, and sampling data from two sources controlled by complicated hyperparameters. As such, the training process involves overwhelmingly manual tuning and is prohibitively costly. In this research, we propose a "reinforcement on reinforcement" (RoR) architecture to decompose the convoluted tasks into two decoupled layers of RL. The inner layer is the canonical MBRL training process which is formulated as a Markov decision process, called training process environment (TPE). The outer layer serves as an RL agent, called intelligent trainer, to learn an optimal hyperparameter configuration for the inner TPE. This decomposition approach provides much-needed flexibility to implement different trainer designs, referred to "train the trainer." In our research, we propose and optimize two alternative trainer designs: 1) an unihead trainer and 2) a multihead trainer. Our proposed RoR framework is evaluated for five tasks in the OpenAI gym. Compared with three other baseline methods, our proposed intelligent trainer methods have a competitive performance in autotuning capability, with up to 56% expected sampling cost saving without knowing the best parameter configurations in advance. The proposed trainer framework can be easily extended to tasks that require costly hyperparameter tuning.
URI: https://hdl.handle.net/10356/159633
ISSN: 2162-237X
DOI: 10.1109/TNNLS.2020.3008249
Schools: School of Computer Science and Engineering 
Rights: © 2020 IEEE. All rights reserved.
Fulltext Permission: none
Fulltext Availability: No Fulltext
Appears in Collections:SCSE Journal Articles

SCOPUSTM   
Citations 50

5
Updated on Sep 30, 2023

Web of ScienceTM
Citations 50

5
Updated on Sep 29, 2023

Page view(s)

48
Updated on Sep 30, 2023

Google ScholarTM

Check

Altmetric


Plumx

Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.