论文标题
通过参数化激活对DDPG参与者进行了完善的连续控制
Refined Continuous Control of DDPG Actors via Parametrised Activation
论文作者
论文摘要
在本文中,我们提出了通过参数介绍最终的演员层来增强演员批评的增强剂,该层在与环境互动期间在不同的负载条件下,在不同的负载条件下,产生了行动以适应不同执行器的行为差异。我们建议将作用层分支在演员中,以学习控制激活层(例如Tanh和Sigmoid)的调整参数。然后,学习的参数用于为每个执行器创建量身定制的激活功能。我们在三个OpenAI健身环境中进行了实验,即Pendulum-V0,Lunarlanderconcontinuun-V2和BipedalWalker-V2。结果表明,Lunarlanderconcontinuum-V2和BipedalWalker-V2环境的总发作奖励平均增加了23.15%和33.80%。摆锤V0环境没有显着改善,但是与最先进的方法相比,提出的方法会产生更稳定的致动信号。提出的方法允许增强学习参与者产生更强大的动作,以适应执行者的响应功能中的差异。这对于执行器根据负载和与环境的相互作用表现出不同的响应功能的现实生活中特别有用。这也可以通过微调参数化激活层来简化传输学习问题,而不是每次更换执行器时都会重新训练整个策略。最后,提出的方法将使生物力学系统中的生物执行器(例如肌肉)更好地适应。
In this paper, we propose enhancing actor-critic reinforcement learning agents by parameterising the final actor layer which produces the actions in order to accommodate the behaviour discrepancy of different actuators, under different load conditions during interaction with the environment. We propose branching the action producing layer in the actor to learn the tuning parameter controlling the activation layer (e.g. Tanh and Sigmoid). The learned parameters are then used to create tailored activation functions for each actuator. We ran experiments on three OpenAI Gym environments, i.e. Pendulum-v0, LunarLanderContinuous-v2 and BipedalWalker-v2. Results have shown an average of 23.15% and 33.80% increase in total episode reward of the LunarLanderContinuous-v2 and BipedalWalker-v2 environments, respectively. There was no significant improvement in Pendulum-v0 environment but the proposed method produces a more stable actuation signal compared to the state-of-the-art method. The proposed method allows the reinforcement learning actor to produce more robust actions that accommodate the discrepancy in the actuators' response functions. This is particularly useful for real life scenarios where actuators exhibit different response functions depending on the load and the interaction with the environment. This also simplifies the transfer learning problem by fine tuning the parameterised activation layers instead of retraining the entire policy every time an actuator is replaced. Finally, the proposed method would allow better accommodation to biological actuators (e.g. muscles) in biomechanical systems.