使用源代码指标和静态分析的关于错误严重性估算的实证研究

论文标题

使用源代码指标和静态分析的关于错误严重性估算的实证研究

An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis

论文作者

Mashhadi, Ehsan, Chowdhury, Shaiful, Modaberi, Somayeh, Hemmati, Hadi, Uddin, Gias

论文摘要

在过去的几十年中，大量的研究工作一直致力于预测软件错误（即缺陷）。通常，这些作品利用各种指标，工具和技术来预测哪些类，方法，行或提交是错误的。但是，该域中的大多数现有工作都对所有错误都相同，在实践中并非如此。虫子越严重，其后果就越高。因此，对于缺陷预测方法，估计已确定的错误的严重程度很重要，以便较高的严重程度立即引起人们的注意。在本文中，我们使用10个常见的源代码指标和两个流行的静态分析工具（Spotbugs and peash），提供了两个流行数据集（缺陷4J和BUGS.JAR）的定量和定性研究，以分析其预测缺陷及其严重性的能力。我们研究了与19个Java开源项目不同严重性标签的3,358种越野车方法。结果表明，尽管代码指标可用于预测错误代码（代码的行，可维护索引，粉丝和努力指标是最好的），但它们无法估计错误的严重性水平。此外，我们观察到静态分析工具在预测错误（F1得分范围为3.1％-7.1％）及其严重性标签（F1分数低于2％）中的性能较弱。我们还手动研究了严重错误的特征，以确定代码指标性能较弱和静态分析工具的可能原因，以估计其严重性。同样，我们的分类表明，在大多数情况下，安全错误的严重性很高，而边缘/边界故障的严重性较低。最后，我们讨论了结果的实际含义，并为未来的研究提出了新的方向。

In the past couple of decades, significant research efforts have been devoted to the prediction of software bugs (i.e., defects). In general, these works leverage a diverse set of metrics, tools, and techniques to predict which classes, methods, lines, or commits are buggy. However, most existing work in this domain treats all bugs the same, which is not the case in practice. The more severe the bugs the higher their consequences. Therefore, it is important for a defect prediction method to estimate the severity of the identified bugs, so that the higher severity ones get immediate attention. In this paper, we provide a quantitative and qualitative study on two popular datasets (Defects4J and Bugs.jar), using 10 common source code metrics, and two popular static analysis tools (SpotBugs and Infer) for analyzing their capability to predict defects and their severity. We studied 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that although code metrics are useful in predicting buggy code (Lines of the Code, Maintainable Index, FanOut, and Effort metrics are the best), they cannot estimate the severity level of the bugs. In addition, we observed that static analysis tools have weak performance in both predicting bugs (F1 score range of 3.1%-7.1%) and their severity label (F1 score under 2%). We also manually studied the characteristics of the severe bugs to identify possible reasons behind the weak performance of code metrics and static analysis tools in estimating their severity. Also, our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity. Finally, we discuss the practical implications of the results and propose new directions for future research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题