论文标题
动态基准的理论
A Theory of Dynamic Benchmarks
论文作者
论文摘要
动态基准测试与编织模型拟合和数据收集,以减轻静态基准的局限性。与静态环境的广泛理论和经验研究相反,由于经验研究有限,迄今为止尚无明显的理论基础,动态对应物落后于落后。为了应对这种赤字,我们启动了对动态基准测试的理论研究。我们检查了两个实现,一种捕获当前实践,另一个建模更复杂的设置。在第一个模型中,数据收集和模型拟合依次替代,我们证明了模型性能最初会有所改善,但只能在三轮比赛后停滞。例如,由注释者分歧引起的标签噪声会导致更强的负面结果。我们的第二个模型将第一个概括为数据收集和模型拟合具有层次依赖性结构的情况。我们表明,这种设计比第一次确保了严格的进步,尽管复杂性大大提高。我们通过在两个流行数据集上模拟动态基准来支持我们的理论分析。这些结果阐明了动态基准测试的好处和实际局限性,为经验工作中观察到的瓶颈提供了理论基础和因果解释。
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.