TY - GEN
T1 - Rethinking Population-Assisted Off-policy Reinforcement Learning
AU - Zheng, Bowen
AU - Cheng, Ran
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/7/15
Y1 - 2023/7/15
N2 - While off-policy reinforcement learning (RL) algorithms are sample efficient due to gradient-based updates and data reuse in the replay buffer, they struggle with convergence to local optima due to limited exploration. On the other hand, population-based algorithms offer a natural exploration strategy, but their heuristic black-box operators are inefficient. Recent algorithms have integrated these two methods, connecting them through a shared replay buffer. However, the effect of using diverse data from population optimization iterations on off-policy RL algorithms has not been thoroughly investigated. In this paper, we first analyze the use of off-policy RL algorithms in combination with population-based algorithms, showing that the use of population data could introduce an overlooked error and harm performance. To test this, we propose a uniform and scalable training design and conduct experiments on our tailored framework in robot locomotion tasks from the OpenAI gym. Our results substantiate that using population data in off-policy RL can cause instability during training and even degrade performance. To remedy this issue, we further propose a double replay buffer design that provides more on-policy data and show its effectiveness through experiments. Our results offer practical insights for training these hybrid methods.
AB - While off-policy reinforcement learning (RL) algorithms are sample efficient due to gradient-based updates and data reuse in the replay buffer, they struggle with convergence to local optima due to limited exploration. On the other hand, population-based algorithms offer a natural exploration strategy, but their heuristic black-box operators are inefficient. Recent algorithms have integrated these two methods, connecting them through a shared replay buffer. However, the effect of using diverse data from population optimization iterations on off-policy RL algorithms has not been thoroughly investigated. In this paper, we first analyze the use of off-policy RL algorithms in combination with population-based algorithms, showing that the use of population data could introduce an overlooked error and harm performance. To test this, we propose a uniform and scalable training design and conduct experiments on our tailored framework in robot locomotion tasks from the OpenAI gym. Our results substantiate that using population data in off-policy RL can cause instability during training and even degrade performance. To remedy this issue, we further propose a double replay buffer design that provides more on-policy data and show its effectiveness through experiments. Our results offer practical insights for training these hybrid methods.
KW - evolutionary reinforcement learning
KW - neuroevolution
KW - off-policy learning
UR - https://www.scopus.com/pages/publications/85167728613
U2 - 10.1145/3583131.3590512
DO - 10.1145/3583131.3590512
M3 - Conference article published in proceeding or book
AN - SCOPUS:85167728613
T3 - GECCO 2023 - Proceedings of the 2023 Genetic and Evolutionary Computation Conference
SP - 624
EP - 632
BT - GECCO 2023 - Proceedings of the 2023 Genetic and Evolutionary Computation Conference
PB - Association for Computing Machinery, Inc
T2 - 2023 Genetic and Evolutionary Computation Conference, GECCO 2023
Y2 - 15 July 2023 through 19 July 2023
ER -