Machine learning prediction of incidence of Alzheimer's disease using large-scale administrative health data

被引:62
|
作者
Park, Ji Hwan [1 ]
Cho, Han Eol [2 ,3 ]
Kim, Jong Hun [4 ]
Wall, Melanie M. [5 ]
Stern, Yaakov [5 ,6 ]
Lim, Hyunsun [7 ]
Yoo, Shinjae [1 ]
Kim, Hyoung Seop [8 ]
Cha, Jiook [5 ,9 ,10 ,11 ]
机构
[1] Brookhaven Natl Lab, Computat Sci Initiat, Upton, NY 11973 USA
[2] Yonsei Univ, Gangnam Severance Hosp, Dept Rehabil Med, Coll Med, Seoul, South Korea
[3] Yonsei Univ, Coll Med, Rehabil Inst Neuromuscular Dis, Seoul, South Korea
[4] Ilsan Hosp, Dementia Ctr, Dept Neurol, Natl Hlth Insurance Serv, Goyang, South Korea
[5] Columbia Univ, Vagelos Coll Phys & Surg, Dept Psychiat, New York, NY 10025 USA
[6] Columbia Univ, Vagelos Coll Phys & Surg, Dept Neurol, New York, NY 10025 USA
[7] Ilsan Hosp, Natl Hlth Insurance Serv, Res & Anal Team, Goyang, South Korea
[8] Ilsan Hosp, Dementia Ctr, Dept Phys Med & Rehabil, Natl Hlth Insurance Serv, Goyang, South Korea
[9] Seoul Natl Univ, Dept Psychol, Seoul, South Korea
[10] Seoul Natl Univ, Dept Brain & Cognit Sci, Seoul, South Korea
[11] Seoul Natl Univ, Grad Sch Data Sci, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
DEMENTIA RISK; COGNITIVE DEFICITS; OLDER PERSONS; POPULATION; DYSFUNCTION; MODELS; ANEMIA; SAMPLE; COHORT;
D O I
10.1038/s41746-020-0256-0
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Nationwide population-based cohort provides a new opportunity to build an automated risk prediction model based on individuals' history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer's disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N = 40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness and socio-demographics. To define incident AD we considered two operational definitions: "definite AD" with diagnostic codes and dementia medication (n = 614) and "probable AD" with only diagnosis (n = 2026). We trained and validated random forest, support vector machine and logistic regression to predict incident AD in 1, 2, 3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on "definite AD" and "probable AD" outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] The Mount Sinai cohort of large-scale genomic, transcriptomic and proteomic data in Alzheimer's disease
    Minghui Wang
    Noam D. Beckmann
    Panos Roussos
    Erming Wang
    Xianxiao Zhou
    Qian Wang
    Chen Ming
    Ryan Neff
    Weiping Ma
    John F. Fullard
    Mads E. Hauberg
    Jaroslav Bendl
    Mette A. Peters
    Ben Logsdon
    Pei Wang
    Milind Mahajan
    Lara M. Mangravite
    Eric B. Dammer
    Duc M. Duong
    James J. Lah
    Nicholas T. Seyfried
    Allan I. Levey
    Joseph D. Buxbaum
    Michelle Ehrlich
    Sam Gandy
    Pavel Katsel
    Vahram Haroutunian
    Eric Schadt
    Bin Zhang
    Scientific Data, 5
  • [42] A framework for generating large-scale microphone array data for machine learning
    Kujawski, Adam
    Pelling, Art J. R.
    Jekosch, Simon
    Sarradj, Ennes
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 31211 - 31231
  • [43] A framework for generating large-scale microphone array data for machine learning
    Adam Kujawski
    Art J. R. Pelling
    Simon Jekosch
    Ennes Sarradj
    Multimedia Tools and Applications, 2024, 83 : 31211 - 31231
  • [44] An online incremental learning support vector machine for large-scale data
    Jun Zheng
    Furao Shen
    Hongjun Fan
    Jinxi Zhao
    Neural Computing and Applications, 2013, 22 : 1023 - 1035
  • [45] An Online Incremental Learning Support Vector Machine for Large-scale Data
    Zheng, Jun
    Yu, Hui
    Shen, Furao
    Zhao, Jinxi
    ARTIFICIAL NEURAL NETWORKS-ICANN 2010, PT II, 2010, 6353 : 76 - +
  • [46] An online incremental learning support vector machine for large-scale data
    Zheng, Jun
    Shen, Furao
    Fan, Hongjun
    Zhao, Jinxi
    NEURAL COMPUTING & APPLICATIONS, 2013, 22 (05): : 1023 - 1035
  • [47] Large-scale data classification method based on machine learning model
    Department of Electrical Engineering, Dalian Institute of Science and Technology, Dalian, China
    Int. J. Database Theory Appl., 2 (71-80):
  • [48] Diagnosis of Alzheimer's Disease using Machine Learning
    Lodha, Priyanka
    Talele, Ajay
    Degaonkar, Kishori
    2018 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION (ICCUBEA), 2018,
  • [49] Rich Punctuations Prediction Using Large-scale Deep Learning
    Wu, Xueyang
    Zhu, Su
    Wu, Yue
    Yu, Kai
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [50] Machine Learning Framework for the Prediction of Alzheimer's Disease Using Gene Expression Data Based on Efficient Gene Selection
    El-Gawady, Aliaa
    Makhlouf, Mohamed A.
    Tawfik, BenBella S.
    Nassar, Hamed
    SYMMETRY-BASEL, 2022, 14 (03):