论文标题
自动匪徒
Autoregressive Bandits
论文作者
论文摘要
自然回归过程自然出现在各种各样的现实情况下,包括股票市场,销售预测,天气预测,广告和定价。在这种情况下面临顺序决策问题时,应正确考虑连续观察之间的时间依赖性,以确保融合到最佳策略。在这项工作中,我们提出了一个新颖的在线学习环境,即自回归土匪(ARB),其中观察到的奖励受订单$ k $的自回归过程的控制,其参数取决于所选的动作。我们表明,在对奖励过程的温和假设下,可以方便地计算最佳策略。然后,我们设计了一种新的乐观的遗憾最小化算法,即自回归的上流信心(AR-UCB),它遭受了Sublritear的订单后悔$ \ widetilde {\ Mathcal {\ Mathcal {o}} \ left( \ frac {(k+1)^{3/2} \ sqrt {nt}}} {(1-γ)^2} \ right)$,其中$ t $是优化范围,$ n $是动作数量,$γ<1 <1 $是该过程的稳定性指数。最后,我们从经验上验证了我们的算法,说明了其优势W.R.T.强盗基线及其对关键参数不合格的鲁棒性。
Autoregressive processes naturally arise in a large variety of real-world scenarios, including stock markets, sales forecasting, weather prediction, advertising, and pricing. When facing a sequential decision-making problem in such a context, the temporal dependence between consecutive observations should be properly accounted for guaranteeing convergence to the optimal policy. In this work, we propose a novel online learning setting, namely, Autoregressive Bandits (ARBs), in which the observed reward is governed by an autoregressive process of order $k$, whose parameters depend on the chosen action. We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed. Then, we devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $\widetilde{\mathcal{O}} \left( \frac{(k+1)^{3/2}\sqrt{nT}}{(1-Γ)^2}\right)$, where $T$ is the optimization horizon, $n$ is the number of actions, and $Γ< 1$ is a stability index of the process. Finally, we empirically validate our algorithm, illustrating its advantages w.r.t. bandit baselines and its robustness to misspecification of key parameters.