马尔可夫干扰实验

论文标题

马尔可夫干扰实验

Markovian Interference in Experiments

论文作者

Farias, Vivek F., Li, Andrew A., Peng, Tianyi, Zheng, Andrew

论文摘要

我们考虑在动态系统中进行实验，在某些实验单元上的干预措施通过限制约束（例如有限的库存）影响其他单位。尽管具有实践意义，但这个“马尔可夫”干扰问题的最佳估计量在很大程度上是启发式性的，而且他们的偏见尚未得到充分理解。我们在政策评估之一等实验中正式推断了推断问题。与最先进的启发式方法相比，虽然公正的估计量显然受到了差异的巨大惩罚。我们介绍了一个上的估计器：Q（DQ）估计器中的差异。我们表明，DQ估计器一般可以比异级评估具有指数级的差异。同时，其偏见是干预措施的二阶。这产生了惊人的偏见差异权衡，因此DQ估计器有效地主导了最新的替代方案。从理论的角度来看，我们介绍了三种独立的新型技术，这些技术对强化学习理论具有独立感兴趣（RL）。我们的经验评估包括一组在城市级乘车模拟器上进行的实验。

We consider experiments in dynamical systems where interventions on some experimental units impact other units through a limiting constraint (such as a limited inventory). Despite outsize practical importance, the best estimators for this `Markovian' interference problem are largely heuristic in nature, and their bias is not well understood. We formalize the problem of inference in such experiments as one of policy evaluation. Off-policy estimators, while unbiased, apparently incur a large penalty in variance relative to state-of-the-art heuristics. We introduce an on-policy estimator: the Differences-In-Q's (DQ) estimator. We show that the DQ estimator can in general have exponentially smaller variance than off-policy evaluation. At the same time, its bias is second order in the impact of the intervention. This yields a striking bias-variance tradeoff so that the DQ estimator effectively dominates state-of-the-art alternatives. From a theoretical perspective, we introduce three separate novel techniques that are of independent interest in the theory of Reinforcement Learning (RL). Our empirical evaluation includes a set of experiments on a city-scale ride-hailing simulator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题