MM-KTD：多种模型Kalman的时间差异用于增强学习

论文标题

MM-KTD：多种模型Kalman的时间差异用于增强学习

MM-KTD: Multiple Model Kalman Temporal Differences for Reinforcement Learning

论文作者

Malekzadeh, Parvin, Salimibeni, Mohammad, Mohammadi, Arash, Assa, Akbar, Plataniotis, Konstantinos N.

论文摘要

随着直接从智能代理与环境的互动中学习最佳控制政策的智能方法，对先进强化学习（RL）系统的开发（RL）系统的兴趣越来越大。目标：在具有连续状态空间的无模型RL方法中，通常需要近似状态的价值函数。在这方面，深神经网络（DNN）提供了一种有吸引力的建模机制，可以使用样品转换近似值函数。但是，基于DNN的溶液对参数选择的敏感性很高，容易过度拟合，并且不是很有效。另一方面，基于卡尔曼的方法可以用作有效的替代方法。但是，这种方法通常需要有关系统（例如噪声统计）的A-Priori信息才能有效执行。本文的主要目的是解决这个问题。方法：作为对上述问题的补救措施，本文提出了一种创新的多重模型Kalman时间差异（MM-KTD）框架，该框架使用观察到的状态和奖励来适应过滤器的参数。此外，提出了一种主动学习方法来提高系统的采样效率。更具体地说，利用了价值函数的估计不确定性，以形成行为策略，从而导致更多访问较少的值，从而提高整体学习样本效率。结果，与基于DNN的对应物相比，提出的MM-KTD框架可以通过样品数量明显减少来学习最佳策略。结果：为了评估所提出的MM-KTD框架的性能，我们基于三个RL基准进行了一组全面的实验。实验结果表明，与其最先进的对应物相比，MM-KTD框架的优势。

There has been an increasing surge of interest on development of advanced Reinforcement Learning (RL) systems as intelligent approaches to learn optimal control policies directly from smart agents' interactions with the environment. Objectives: In a model-free RL method with continuous state-space, typically, the value function of the states needs to be approximated. In this regard, Deep Neural Networks (DNNs) provide an attractive modeling mechanism to approximate the value function using sample transitions. DNN-based solutions, however, suffer from high sensitivity to parameter selection, are prone to overfitting, and are not very sample efficient. A Kalman-based methodology, on the other hand, could be used as an efficient alternative. Such an approach, however, commonly requires a-priori information about the system (such as noise statistics) to perform efficiently. The main objective of this paper is to address this issue. Methods: As a remedy to the aforementioned problems, this paper proposes an innovative Multiple Model Kalman Temporal Difference (MM-KTD) framework, which adapts the parameters of the filter using the observed states and rewards. Moreover, an active learning method is proposed to enhance the sampling efficiency of the system. More specifically, the estimated uncertainty of the value functions are exploited to form the behaviour policy leading to more visits to less certain values, therefore, improving the overall learning sample efficiency. As a result, the proposed MM-KTD framework can learn the optimal policy with significantly reduced number of samples as compared to its DNN-based counterparts. Results: To evaluate performance of the proposed MM-KTD framework, we have performed a comprehensive set of experiments based on three RL benchmarks. Experimental results show superiority of the MM-KTD framework in comparison to its state-of-the-art counterparts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题