更具集中式培训，仍然分散的执行：多代理有条件政策分解

论文标题

更具集中式培训，仍然分散的执行：多代理有条件政策分解

More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization

论文作者

Wang, Jiangxing, Ye, Deheng, Lu, Zongqing

论文摘要

在合作的多代理增强学习（MARL）中，将价值分解与参与者 - 批评的结合使代理人可以学习随机策略，这些策略更适合于部分可观察到的环境。鉴于学习能够分散执行的地方政策的目标，通常认为代理人彼此独立，即使在集中式培训中也是如此。但是，这样的假设可能禁止代理人学习最佳的联合政策。为了解决这个问题，我们明确地将代理商之间的依赖性纳入集中式培训。尽管这导致了最佳联合政策，但可能不会因分散执行而被分解。然而，从理论上讲，从这样的联合政策中，我们始终可以得出另一项联合政策，该政策可实现相同的最优性，但可以分解以分散的执行。为此，我们提出了多代理有条件政策分解（MACPF），该政策分解（MACPF）需要进行更集中的培训，但仍可以实现分散的执行。我们在各种合作的MARL任务中验证MACPF，并证明MACPF比基线获得更好的性能或更快的融合。我们的代码可在https://github.com/pku-rl/fop-dmac-macpf上找到。

In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines. Our code is available at https://github.com/PKU-RL/FOP-DMAC-MACPF.

下载PDF全文

下载文献需遵守相关版权规定

论文标题