论文标题
演员批评还是批评家?两个时间尺度的故事
Actor-Critic or Critic-Actor? A Tale of Two Time Scales
论文作者
论文摘要
我们重新审视表格参与者批判算法的标准公式作为两个时间尺度的随机近似,其价值函数在更快的时间尺度上计算出的值函数,并在较慢的时间表上计算出的策略。这模拟了政策迭代。我们观察到,时间尺度的逆转实际上将模仿价值迭代,并且是一种合法的算法。我们提供了收敛的证明,并在凭经验上以有效近似(具有线性和非线性函数近似器)进行经验比较,并观察到我们所提出的评论家算法在准确性和计算工作方面都在与Actor-Critic上表现出色。
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.