论文标题
对口语与视觉解释的人类评估QA
Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA
论文作者
论文摘要
尽管对用户的开放域质量检查系统(ODQA)的预测进行了研究的研究正在获得动力,但大多数作品未能评估解释改善用户信任的程度。尽管很少有作品评估使用用户研究的解释,但它们采用了可能偏离最终用户在野外用法的设置:ODQA在语音辅助因子中最普遍存在,但是当前的研究只能使用视觉显示来评估解释,并且可能会错误地解释对其他模态的最大效果解释。为了减轻这些问题,我们进行了用户研究,以衡量解释是否有助于用户正确决定何时接受或拒绝ODQA系统的答案。与先前的工作不同,我们控制解释方式,例如,无论是通过口语还是视觉界面将它们传达给用户,以及跨模态的对比度。我们的结果表明,从检索到的证据段落中得出的解释可以超越强度强度的基线(校准信心),但实际上是随着模式而改变的最佳解释策略。我们展示了当前解释的常见失败案例,强调对解释的端到端评估,并谨慎对待与部署不同的代理方式评估它们。
While research on explaining predictions of open-domain QA systems (ODQA) to users is gaining momentum, most works have failed to evaluate the extent to which explanations improve user trust. While few works evaluate explanations using user studies, they employ settings that may deviate from the end-user's usage in-the-wild: ODQA is most ubiquitous in voice-assistants, yet current research only evaluates explanations using a visual display, and may erroneously extrapolate conclusions about the most performant explanations to other modalities. To alleviate these issues, we conduct user studies that measure whether explanations help users correctly decide when to accept or reject an ODQA system's answer. Unlike prior work, we control for explanation modality, e.g., whether they are communicated to users through a spoken or visual interface, and contrast effectiveness across modalities. Our results show that explanations derived from retrieved evidence passages can outperform strong baselines (calibrated confidence) across modalities but the best explanation strategy in fact changes with the modality. We show common failure cases of current explanations, emphasize end-to-end evaluation of explanations, and caution against evaluating them in proxy modalities that are different from deployment.