自动匪徒

论文标题

Autoregressive Bandits

论文作者

Bacchiocchi, Francesco, Genalti, Gianmarco, Maran, Davide, Mussi, Marco, Restelli, Marcello, Gatti, Nicola, Metelli, Alberto Maria

论文摘要

自然回归过程自然出现在各种各样的现实情况下，包括股票市场，销售预测，天气预测，广告和定价。在这种情况下面临顺序决策问题时，应正确考虑连续观察之间的时间依赖性，以确保融合到最佳策略。在这项工作中，我们提出了一个新颖的在线学习环境，即自回归土匪（ARB），其中观察到的奖励受订单$ k $的自回归过程的控制，其参数取决于所选的动作。我们表明，在对奖励过程的温和假设下，可以方便地计算最佳策略。然后，我们设计了一种新的乐观的遗憾最小化算法，即自回归的上流信心（AR-UCB），它遭受了Sublritear的订单后悔$ \ widetilde {\ Mathcal {\ Mathcal {o}} \ left（ \ frac {（k+1）^{3/2} \ sqrt {nt}}} {（1-γ）^2} \ right）$，其中$ t $是优化范围，$ n $是动作数量，$γ<1 <1 $是该过程的稳定性指数。最后，我们从经验上验证了我们的算法，说明了其优势W.R.T.强盗基线及其对关键参数不合格的鲁棒性。

Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-Γ)^2}\right)$, where $T$ is the optimization horizon, $n$ is the number of actions, and $Γ< 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

自动匪徒

Autoregressive Bandits

论文作者

论文摘要

加入微信交流群