EVA：使用条件变异自动编码器生成纵向电子健康记录

论文标题

EVA：使用条件变异自动编码器生成纵向电子健康记录

EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

论文作者

Biswal, Siddharth, Ghosh, Soumya, Duke, Jon, Malin, Bradley, Stewart, Walter, Sun, Jimeng

论文摘要

研究人员要求及时访问现实世界纵向电子健康记录（EHR），以开发，测试，验证和实施机器学习解决方案，以提高医疗保健的质量和效率。相反，卫生系统重视患者隐私和数据安全。取消识别的EHR不能充分满足卫生系统的需求，因为去识别的数据容易重新识别，并且其数量也受到限制。合成EHR提供了潜在的解决方案。在本文中，我们提出了EHR变异自动编码器（EVA），以合成离散EHR遇到的序列（例如临床访问）和遭遇特征（例如诊断，药物，程序）。我们说明EVA可以产生现实的EHR序列，解释患者之间的个体差异，并且可以在特定的疾病状况下进行条件，从而实现特定疾病的研究。我们通过将随机梯度马尔可夫链蒙特卡洛与摊销变异推断相结合，设计有效，准确的推理算法。我们评估了包含超过250，000名患者的大型现实世界EHR存储库中这些方法的实用性。我们的实验包括对知识渊博的临床医生的用户研究，表明生成的EHR序列是现实的。我们证实了接受合成数据训练的预测模型的性能与对实际EHR培训的模型相似。此外，我们的发现表明，通过合成EHRS增强实际数据会导致最佳预测性能 - 在TOP -20召回率中，最佳基线提高了多达8％。

Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall.

下载PDF全文

下载文献需遵守相关版权规定

论文标题