论文标题
苏打水:一种自然语言处理包,用于提取癌症研究健康的社会决定因素
SODA: A Natural Language Processing Package to Extract Social Determinants of Health for Cancer Studies
论文作者
论文摘要
目的:我们旨在开发开源自然语言处理(NLP)软件包(即社会决定因素),并采用预先培训的变压器模型来提取癌症患者健康的社会决定因素(SDOH),检查苏打水对新的疾病领域的推广性(即,使用opoile)(即,使用SDO癌症)。 方法:我们确定了SDOH类别和属性,并使用一般癌症队列的临床注释开发了SDOH语料库。我们比较了四个基于变压器的NLP模型来提取SDOH,研究了NLP模型对与阿片类药物开具的患者的概括性的普遍性,并探索了定制策略以提高性能。我们将最佳的NLP模型应用于从乳房(n = 7,971),肺(n = 11,804)和结直肠癌(n = 6,240)同类中提取19类SDOH。 结果和结论:我们开发了629名癌症患者的语料库,注释为SDOH的19个类别的注释为13,193个SDOH概念/属性。来自变形金刚(BERT)模型的双向编码器表示,用于SDOH概念提取的最佳严格/宽松F1分数为0.9216和0.9441,将属性链接到SDOH概念,为0.9617和0.9626。使用阿片类药物使用的新注释对NLP模型进行微调,将严格/宽松的F1分数从0.8172/0.8502提高到0.8312/0.8679。 19种SDOH类别中的提取率差异很大,其中10个SDOH可以从> 70%的癌症患者中提取,但9 SDOH的提取率较低(<70%的癌症患者)。带有预训练变压器模型的苏打套件可在https://github.com/uf-hobiinformatics-lab/sdoh_soda上公开获得。
Objective: We aim to develop an open-source natural language processing (NLP) package, SODA (i.e., SOcial DeterminAnts), with pre-trained transformer models to extract social determinants of health (SDoH) for cancer patients, examine the generalizability of SODA to a new disease domain (i.e., opioid use), and evaluate the extraction rate of SDoH using cancer populations. Methods: We identified SDoH categories and attributes and developed an SDoH corpus using clinical notes from a general cancer cohort. We compared four transformer-based NLP models to extract SDoH, examined the generalizability of NLP models to a cohort of patients prescribed with opioids, and explored customization strategies to improve performance. We applied the best NLP model to extract 19 categories of SDoH from the breast (n=7,971), lung (n=11,804), and colorectal cancer (n=6,240) cohorts. Results and Conclusion: We developed a corpus of 629 cancer patients notes with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH. The Bidirectional Encoder Representations from Transformers (BERT) model achieved the best strict/lenient F1 scores of 0.9216 and 0.9441 for SDoH concept extraction, 0.9617 and 0.9626 for linking attributes to SDoH concepts. Fine-tuning the NLP models using new annotations from opioid use patients improved the strict/lenient F1 scores from 0.8172/0.8502 to 0.8312/0.8679. The extraction rates among 19 categories of SDoH varied greatly, where 10 SDoH could be extracted from >70% of cancer patients, but 9 SDoH had a low extraction rate (<70% of cancer patients). The SODA package with pre-trained transformer models is publicly available at https://github.com/uf-hobiinformatics-lab/SDoH_SODA.