论文标题
离线上下文匪徒,具有过度参数化的模型
Offline Contextual Bandits with Overparameterized Models
论文作者
论文摘要
监督学习的最新结果表明,尽管过度参数化模型具有过度合适的能力,但实际上它们概括了。我们询问离线上下文土匪是否发生相同的现象。我们的结果混合在一起。基于价值的算法受益于与过度参数的监督学习相同的概括行为,但基于政策的算法却没有。我们表明,这种差异是由于其目标的\ emph {行动稳定性}所致。如果存在预测(动作值向量或动作分布),则目标是稳定的,无论观察到哪种动作,这都是最佳的。尽管基于价值的目标是行动稳定的,但基于策略的目标是不稳定的。我们正式证明了上界的遗憾,基于价值的学习和基于政策的算法的遗憾的下限。在我们对大型神经网络的实验中,基于动作稳定的基于价值的目标和基于策略的不稳定目标之间的差距会导致重大的绩效差异。
Recent results in supervised learning suggest that while overparameterized models have the capacity to overfit, they in fact generalize quite well. We ask whether the same phenomenon occurs for offline contextual bandits. Our results are mixed. Value-based algorithms benefit from the same generalization behavior as overparameterized supervised learning, but policy-based algorithms do not. We show that this discrepancy is due to the \emph{action-stability} of their objectives. An objective is action-stable if there exists a prediction (action-value vector or action distribution) which is optimal no matter which action is observed. While value-based objectives are action-stable, policy-based objectives are unstable. We formally prove upper bounds on the regret of overparameterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.