Continuous control for robot based on deep reinforcement learning
Date of Issue2019
School of Electrical and Electronic Engineering
One of the main targets of artificial intelligence is to solve the complex control problems which have high-dimensional observation spaces. Recently, the combination of deep learning and reinforcement learning has made remarkable progress, including the high-level performance in the video and board games, 3D navigations and robotic control. In this thesis, deep reinforcement learning algorithms are studied to perform some robotic tasks with continuous action spaces. Firstly, we use deep deterministic policy gradient (DDPG) and hindsight experience replay (HER) with a simple binary reward to achieve a multi-goal reinforcement learning task, making the redundant manipulator learn the policy of reaching any given position. Then we use DDPG with a shaped reward to train the redundant manipulator to complete the same task. By referring to the idea of HER, we propose a $future$ and $random$ strategy to obtain some additional goals combined with the shaped reward to generate some new transitions, which can help to improve the sample efficiency. After that, we use DDPG with prioritized experience replay to realize the trajectory tracking task of a SCARA robot and a mobile robot. Two training strategies, random referenced state initialization and early termination, are introduced to enable the robots to learn effectively from the referenced trajectories. Secondly, we focus on the distributed deep reinforcement learning. We use asynchronous advantage actor-critic (A3C) and synchronous advantage actor-critic (A2C) algorithms, both of which have multiple workers to collect the transitions and compute the gradients, to train the redundant manipulator to complete the multi-goal task. We propose a new reward function to optimize the reaching path of the end-effector. The performances of agents trained by different algorithms and reward functions are compared. Next, we propose a distributed framework of DDPG, where the synchronous workers generate transitions and compute gradients for the global network and the collecting workers only produce transitions for the shared replay memory with different policies and exploration noises. We use this proposed distributed DDPG with prioritized experience replay to train the SCARA robot and mobile robot to track the same trajectories, which presents a faster learning speed and smaller tracking errors compared with the single-worker DDPG. Next, we study on the proximal policy optimization algorithm (PPO) with generalized advantage estimation (GAE). We propose a distributed framework of PPO by running multiple workers to collect transitions for the global network at the same time. Then we use this distributed PPO with GAE to train the redundant manipulator to achieve the multi-goal task and make comparison with the previous methods. After that, we use distributed PPO with GAE and the improved training strategies to train the mobile robot to track the trajectories. In order to improve the training and sample efficiency, a two-stage training strategy which consists of the supervised pre-training and fine-training by distributed PPO is proposed. This two-stage training strategy can also obtain a better tracking performance. Then we introduce LSTM to represent the actor and critic, and use buffers to store the cell state and hidden state of LSTM used for the initialization of each episode to solve the problem of inaccurate initial LSTM states. By introducing LSTM, the tracking performance of mobile robot can be improved compared with distributed PPO with fully-connected networks. Finally, we utilize deep reinforcement learning to train a autonomous vehicle to learn the driving behaviors. Deep reinforcement learning provides an end-to-end method for the autonomous driving by directly mapping the high-dimensional raw sensory input to the control command output. We design a reward function which encourages the vehicle to drive along the road smoothly and overtake other vehicles. We adopt a two-stage training strategy which consists of the imitation learning stage and deep reinforcement learning stage. The imitation learning stage could help to solve the exploration and sample efficiency problem of reinforcement learning. We use DDPG and the improved algorithm, TD3 to train the autonomous vehicle in the second training stage, respectively. We find that TD3 could improve the driving performance of autonoumous vehicle.
Engineering::Electrical and electronic engineering::Control and instrumentation::Robotics