弹性步骤DQN：一种新颖的多步骤算法，可减轻深度Qnetworks的高估

论文标题

弹性步骤DQN：一种新颖的多步骤算法，可减轻深度Qnetworks的高估

Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks

论文作者

Ly, Adrian, Dazeley, Richard, Vamplew, Peter, Cruz, Francisco, Aryal, Sunil

论文摘要

Deep Q-Networks算法（DQN）是使用深神经网络的第一批增强算法，以成功地超过许多ATARI学习环境中的人类水平表现。但是，在DQN中，持续存在的行为是长期存在的问题。不稳定的行为通常以高估$ q $值的高估为特征，通常称为高估偏差。为了解决高估的偏见和不同的行为，已经提出了许多启发式扩展。值得注意的是，多步更新已被证明可以大大降低不稳定的行为，同时改善代理商的训练性能。但是，代理通常对选择多步更新范围（$ n $）的选择高度敏感，并且我们的经验实验表明，在许多情况下，$ n $的静态价值差会导致性能差于单步DQN。受$ n $ step DQN的成功以及多步更新对高估偏差的影响的启发，本文提出了一种新算法，我们称之为“ Elastic Step DQN”（ES-DQN）。它会根据访问的状态的相似性在多步骤更新中动态变化。我们的经验评估表明，在几个OpenAI健身环境中，ES-DQN超出$ n $ n $ step，带有固定的$ n $更新，双DQN和平均DQN，同时减轻了高估偏见。

Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using deep neural network to successfully surpass human level performance in a number of Atari learning environments. However, divergent and unstable behaviour have been long standing issues in DQNs. The unstable behaviour is often characterised by overestimation in the $Q$-values, commonly referred to as the overestimation bias. To address the overestimation bias and the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-step updates have been shown to drastically reduce unstable behaviour while improving agent's training performance. However, agents are often highly sensitive to the selection of the multi-step update horizon ($n$), and our empirical experiments show that a poorly chosen static value for $n$ can in many cases lead to worse performance than single-step DQN. Inspired by the success of $n$-step DQN and the effects that multi-step updates have on overestimation bias, this paper proposes a new algorithm that we call `Elastic Step DQN' (ES-DQN). It dynamically varies the step size horizon in multi-step updates based on the similarity of states visited. Our empirical evaluation shows that ES-DQN out-performs $n$-step with fixed $n$ updates, Double DQN and Average DQN in several OpenAI Gym environments while at the same time alleviating the overestimation bias.

下载PDF全文

下载文献需遵守相关版权规定

论文标题