论文标题

改进的睡眠匪徒,具有随机动作集和对抗奖励

Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards

论文作者

Saha, Aadirupa, Gaillard, Pierre, Valko, Michal

论文摘要

在本文中,我们考虑了带有随机动作集和对抗性奖励的睡眠匪徒的问题。在这种情况下,与大多数土匪中的大多数作品相比,这些动作可能始终不可用。例如,某些产品可能在项目推荐中缺货。对于此问题的最佳现有高效(即多项式时间)算法,只能保证$ O(T^{2/3})$上限。但是,基于EXP4的效率低下算法可以实现$ O(\ sqrt {t})$。在本文中,我们提供了一种新的计算高效算法,该算法的灵感来自EXP3的遗憾$ o(\ sqrt {t})$当每个操作$ i \ in \ ca $的可用性是独立的。然后,我们研究了问题的最通用版本,在每个回合中,可从某些未知的任意分布(即,没有独立假设)生成,并提出了具有$ O(\ sqrt {2^k t})的有效算法。我们的理论结果通过实验评估得到了证实。

In this paper, we consider the problem of sleeping bandits with stochastic action sets and adversarial rewards. In this setting, in contrast to most work in bandits, the actions may not be available at all times. For instance, some products might be out of stock in item recommendation. The best existing efficient (i.e., polynomial-time) algorithms for this problem only guarantee an $O(T^{2/3})$ upper-bound on the regret. Yet, inefficient algorithms based on EXP4 can achieve $O(\sqrt{T})$. In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(\sqrt{T})$ when the availabilities of each action $i \in \cA$ are independent. We then study the most general version of the problem where at each round available sets are generated from some unknown arbitrary distribution (i.e., without the independence assumption) and propose an efficient algorithm with $O(\sqrt {2^K T})$ regret guarantee. Our theoretical results are corroborated with experimental evaluations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源