决定建模的内容：增强学习的价值等效抽样

论文标题

决定建模的内容：增强学习的价值等效抽样

Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning

论文作者

Arumugam, Dilip, Van Roy, Benjamin

论文摘要

基于典型的模型的增强式学习代理迭代地完善了其对环境真正基础模型的估计或先前的信念。然而，在基于模型的强化学习中，最新的经验成功是通过功能近似进行的，避免了真正的模型，而有利于替代物，虽然忽略了环境的各个方面，但仍然促进了对行为的有效计划。最近正式为价值等价原理，这种算法技术可能不可避免地是不可避免的，因为实际强化学习需要考虑一个简单，计算中的代理，与压倒性的复杂环境相互作用，其潜在的动态可能超过了代理的代理能力。在这项工作中，我们考虑了以下场景，即代理限制可能完全排除确定准价值的模型，这立即在识别一个足够简单的模型之间进行了权衡，而该模型足够简单，而仅产生有限的亚次优先性。为了解决这个问题，我们介绍了一种算法，该算法使用速率延伸理论，迭代地计算了对环境的近似值等值的损耗压缩，代理可以可靠地靶向以代替真实模型。我们证明了一种信息理论的贝叶斯遗憾，该算法限制了任何有限的，情节的顺序决策问题。至关重要的是，我们的遗憾界限可以以两种可能的形式之一表示，提供了绩效保证，可以找到达到所需的亚典型性差距的最简单模型，或者，最佳模型限制了代理容量的限制。

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题