论文标题

学会解释:通过思维链的多模式推理用于科学问题回答

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

论文作者

Lu, Pan, Mishra, Swaroop, Xia, Tony, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, Kalyan, Ashwin

论文摘要

在回答问题时,人类会利用各种方式可用的信息来综合一致,完整的思想链(COT)。在深度学习模型(例如大规模语言模型)的情况下,这个过程通常是一个黑匣子。最近,科学问题基准已用于诊断AI系统的多跳推理能力和解释性。但是,现有数据集无法为答案提供注释,或者仅限于仅文本模式,小尺度和有限的域多样性。为此,我们介绍了科学问题答案(ScienceQA),这是一个新的基准,由〜21k多模式多种选择问题组成,其中包含各种各样的科学主题和答案的注释,并提供相应的讲座和解释。我们进一步设计语言模型,以学习将讲座和解释作为思想链(COT),以模仿回答ScienceQA问题时的多跳上推理过程。 ScienceQA在语言模型中展示了COT的实用性,因为COT将问题的答案绩效提高了1.20%,而在微调的UnifiedQA中,COT将COT的效用提高了1.20%和3.99%。我们还探索了模型的上限,以通过喂食输入中的那些来利用解释;我们观察到它将GPT-3的少量性能提高了18.96%。我们的分析进一步表明,类似于人类的语言模型受益于解释,从较少的数据中学习并仅使用40%的数据实现相同的性能。数据和代码可在https://scienceqa.github.io上找到。

When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源