论文标题
成本敏感的机器学习分类,用于大规模结核病言语筛查
Cost-Sensitive Machine Learning Classification for Mass Tuberculosis Verbal Screening
论文作者
论文摘要
基于得分的结核病算法(TB)言语筛查的表现较差,导致错误分类导致遗漏的案例和不必要的昂贵实验室测试。我们将临床医生定义的基于得分的分类与机器学习分类(例如SVM-RBF,Logistic回归和XGBoost)进行了比较。我们将分析限制在成年人的数据,受结核病影响最大的人群中,并研究了未调节和未加权分类器与成本敏感的分类器之间的差异。将预测与相应的GenExpert MTB/RIF结果进行了比较。在将正等级的重量调整为XGBoost的40之后,我们达到了96.64%的灵敏度和35.06%的特异性。因此,与临床医生定义的传统基于得分的方法相比,我们的标识符的敏感性增加了1.26%,而特异性的绝对值增加了13.19%。我们的方法进一步表明,只有2000个数据点足以使该模型收敛。结果表明,即使使用有限的数据,我们实际上可以设计一种更好的方法来识别口头筛查中的结核病嫌疑犯。
Score-based algorithms for tuberculosis (TB) verbal screening perform poorly, causing misclassification that leads to missed cases and unnecessary costly laboratory tests for false positives. We compared score-based classification defined by clinicians to machine learning classification such as SVM-RBF, logistic regression, and XGBoost. We restricted our analyses to data from adults, the population most affected by TB, and investigated the difference between untuned and unweighted classifiers to the cost-sensitive ones. Predictions were compared with the corresponding GeneXpert MTB/Rif results. After adjusting the weight of the positive class to 40 for XGBoost, we achieved 96.64% sensitivity and 35.06% specificity. As such, the sensitivity of our identifier increased by 1.26% while specificity increased by 13.19% in absolute value compared to the traditional score-based method defined by our clinicians. Our approach further demonstrated that only 2000 data points were sufficient to enable the model to converge. The results indicate that even with limited data we can actually devise a better method to identify TB suspects from verbal screening.