蒸馏汤普森抽样：通过模仿学习实用有效的汤普森抽样

论文标题

蒸馏汤普森抽样：通过模仿学习实用有效的汤普森抽样

Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

论文作者

Namkoong, Hongseok, Daulton, Samuel, Bakshy, Eytan

论文摘要

汤普森采样（TS）已成为上下文匪徒问题的强大技术。但是，TS需要后期推断和优化行动，禁止其在许多在线平台中使用，在许多在线平台中，在这些平台上延迟和易于部署。我们通过提出一种基于模仿学习的新型算法来操作TS，该算法将TS策略提炼成明确的策略表示形式，从而可以快速决策和在基于移动和服务器的环境中进行快速的决策和轻松部署。使用在模仿策略中收集的批处理数据，我们的算法迭代对TS策略进行脱机更新，并学习新的明确策略表示形式以模仿它。从经验上讲，我们的模仿政策实现了与批次TS相当的绩效，同时允许降低决策时间延迟的数量级。由于低延迟和实现的简单性，我们的算法已成功部署在多个视频上传系统中。使用随机对照试验，我们显示我们的算法可显着改善视频质量和观看时间。

Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many online platforms where latency and ease of deployment are of concern. We operationalize TS by proposing a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation, allowing fast decision-making and easy deployment in mobile and server-based environments. Using batched data collected under the imitation policy, our algorithm iteratively performs offline updates to the TS policy, and learns a new explicit policy representation to imitate it. Empirically, our imitation policy achieves performance comparable to batch TS while allowing more than an order of magnitude reduction in decision-time latency. Buoyed by low latency and simplicity of implementation, our algorithm has been successfully deployed in multiple video upload systems for Meta. Using a randomized controlled trial, we show our algorithm resulted in significant improvements in video quality and watch time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题