非平稳的潜在土匪

论文标题

非平稳的潜在土匪

Non-Stationary Latent Bandits

论文作者

Hong, Joey, Kveton, Branislav, Zaheer, Manzil, Chow, Yinlam, Ahmed, Amr, Ghavamzadeh, Mohammad, Boutilier, Craig

论文摘要

推荐系统的用户通常会以非平稳的方式行事，因为它们随着时间的流逝而不断发展的偏好和口味。在这项工作中，我们建议对非平稳用户快速个性化的实用方法。关键的想法是将此问题构架为潜在的强盗，在该匪徒离线的原型模型是从线学习的，并且用户的潜在状态是从与模型的交互中在线推断出来的。我们将此问题称为非平稳的潜在强盗。我们提出了汤普森采样算法，以使其在非平稳的潜在土匪中最小化后悔，分析它们并在现实世界中的数据集中对其进行评估。我们方法的主要优势在于，它可以与丰富的离线学习型号结合使用，该模型可以被弄错，并随后使用后验采样在线进行微调。这样，我们自然地结合了离线和在线学习的优势。

Users of recommender systems often behave in a non-stationary fashion, due to their evolving preferences and tastes over time. In this work, we propose a practical approach for fast personalization to non-stationary users. The key idea is to frame this problem as a latent bandit, where the prototypical models of user behavior are learned offline and the latent state of the user is inferred online from its interactions with the models. We call this problem a non-stationary latent bandit. We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset. The main strength of our approach is that it can be combined with rich offline-learned models, which can be misspecified, and are subsequently fine-tuned online using posterior sampling. In this way, we naturally combine the strengths of offline and online learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题