论文标题

弹性步骤DQN:一种新颖的多步骤算法,可减轻深度Qnetworks的高估

Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks

论文作者

Ly, Adrian, Dazeley, Richard, Vamplew, Peter, Cruz, Francisco, Aryal, Sunil

论文摘要

Deep Q-Networks算法(DQN)是使用深神经网络的第一批增强算法,以成功地超过许多ATARI学习环境中的人类水平表现。但是,在DQN中,持续存在的行为是长期存在的问题。不稳定的行为通常以高估$ q $值的高估为特征,通常称为高估偏差。为了解决高估的偏见和不同的行为,已经提出了许多启发式扩展。值得注意的是,多步更新已被证明可以大大降低不稳定的行为,同时改善代理商的训练性能。但是,代理通常对选择多步更新范围($ n $)的选择高度敏感,并且我们的经验实验表明,在许多情况下,$ n $的静态价值差会导致性能差于单步DQN。受$ n $ step DQN的成功以及多步更新对高估偏差的影响的启发,本文提出了一种新算法,我们称之为“ Elastic Step DQN”(ES-DQN)。它会根据访问的状态的相似性在多步骤更新中动态变化。我们的经验评估表明,在几个OpenAI健身环境中,ES-DQN超出$ n $ n $ step,带有固定的$ n $更新,双DQN和平均DQN,同时减轻了高估偏见。

Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon ($n$), and our empirical experiments show that a poorly chosen static value for $n$ can in many cases lead to worse performance than single-step DQN. Inspired by the success of $n$-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call `Elastic Step DQN' (ES-DQN). It dynamically varies the step size horizon in multi-step updates based on the similarity of states visited. Our empirical evaluation shows that ES-DQN out-performs $n$-step with fixed $n$ updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源