开源代码审查和发出讨论中的不可细性检测

论文标题

开源代码审查和发出讨论中的不可细性检测

Incivility Detection in Open Source Code Review and Issue Discussions

论文作者

Ferreira, Isabella, Rafiq, Ahlaam, Cheng, Jinghui

论文摘要

鉴于开源开发的民主性质，代码审查和问题讨论可能是不文明的。不可思议的不可活力被定义为传达不必要的不尊重语气的讨论特征，可能会对开源社区产生负面影响。为了防止或最大程度地减少这些负面后果，开源平台包括从讨论中删除不文明语言的机制。但是，这种方法需要手动检查，考虑到大量讨论，这可能是压倒性的。为了帮助开源社区解决此问题，在本文中，我们旨在将六个经典的机器学习模型与BERT进行比较，以检测开源代码审查和发出讨论中的不可活力。此外，我们评估添加上下文信息是否可以改善模型的性能以及模型在跨平台设置中的性能。我们发现，Bert的表现要比经典的机器学习模型更好，最佳F1得分为0.95。此外，古典机器学习模型往往表现不佳，以检测非技术和民事讨论。我们的结果表明，将上下文信息添加到BERT并没有提高其性能，并且在跨平台设置中，没有一个分析的分类器的性能出色。最后，我们提供了对分类器错误分类的音调的见解。

Given the democratic nature of open source development, code review and issue discussions may be uncivil. Incivility, defined as features of discussion that convey an unnecessarily disrespectful tone, can have negative consequences to open source communities. To prevent or minimize these negative consequences, open source platforms have included mechanisms for removing uncivil language from the discussions. However, such approaches require manual inspection, which can be overwhelming given the large number of discussions. To help open source communities deal with this problem, in this paper, we aim to compare six classical machine learning models with BERT to detect incivility in open source code review and issue discussions. Furthermore, we assess if adding contextual information improves the models' performance and how well the models perform in a cross-platform setting. We found that BERT performs better than classical machine learning models, with a best F1-score of 0.95. Furthermore, classical machine learning models tend to underperform to detect non-technical and civil discussions. Our results show that adding the contextual information to BERT did not improve its performance and that none of the analyzed classifiers had an outstanding performance in a cross-platform setting. Finally, we provide insights into the tones that the classifiers misclassify.

下载PDF全文

下载文献需遵守相关版权规定

论文标题