Prediction of Diabetes Using Data Mining and Machine Learning Algorithms: A Cross-Sectional Study

被引:2
|
作者
Shojaee-Mend, Hassan [1 ]
Velayati, Farnia [2 ]
Tayefi, Batool [3 ]
Babaee, Ebrahim [3 ,4 ,5 ]
机构
[1] Gonabad Univ Med Sci, Infect Dis Res Ctr, Gonabad, Iran
[2] Shahid Beheshti Univ Med Sci, Natl Res Inst TB & Lung Dis NRITLD, Telemed Res Ctr, Tehran, Iran
[3] Iran Univ Med Sci, Psychosocial Hlth Res Inst, Prevent Med & Publ Hlth Res Ctr, Sch Med,Dept Community & Family Med, Tehran, Iran
[4] Iran Univ Med Sci, Vaccine Res Ctr, Tehran, Iran
[5] Iran Univ Med Sci, Psychosocial Hlth Res Inst, Prevent Publ Hlth Res Ctr, POB 14665-354, Tehran 1449614535, Iran
关键词
Diabetes Mellitus; Machine Learning; Data Mining; Decision Trees; Risk Factors;
D O I
10.4258/hir.2024.30.1.73
中图分类号
R-058 [];
学科分类号
摘要
Objectives: This study aimed to develop a model to predict fasting blood glucose status using machine learning and data mining, since the early diagnosis and treatment of diabetes can improve outcomes and quality of life. Methods: This crosssectional study analyzed data from 3376 adults over 30 years old at 16 comprehensive health service centers in Tehran, Iran who participated in a diabetes screening program. The dataset was balanced using random sampling and the synthetic minority over-sampling technique (SMOTE). The dataset was split into training set (80%) and test set (20%). Shapley values were calculated to select the most important features. Noise analysis was performed by adding Gaussian noise to the numerical features to evaluate the robustness of feature importance. Five different machine learning algorithms, including CatBoost, random forest, XGBoost, logistic regression, and an artificial neural network, were used to model the dataset. Accuracy, sensitivity, specificity, accuracy, the F1-score, and the area under the curve were used to evaluate the model. Results: Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important factors for predicting fasting blood glucose status. Though the models achieved similar predictive ability, the CatBoost model performed slightly better overall with 0.737 area under the curve (AUC). Conclusions: A gradient boosted decision tree model accurately identified the most important risk factors related to diabetes. Age, waist-to-hip ratio, body mass index, and systolic blood pressure were the most important risk factors for diabetes, respectively. This model can support planning for diabetes management and prevention.
引用
收藏
页码:73 / 82
页数:10
相关论文
共 50 条
  • [1] COMPARISON OF MACHINE LEARNING ALGORITHMS FOR THE PREDICTION OF MISSING CROSS-SECTIONAL COST DATA
    Rueda, J.
    Valencia, C. F.
    Mullins, C. D.
    Onukwugha, E.
    Zhan, M.
    Slejko, J. F.
    VALUE IN HEALTH, 2018, 21 : S4 - S4
  • [2] Prediction of metastatic pheochromocytoma and paraganglioma: a machine learning modelling study using data from a cross-sectional cohort
    Pamporaki, Christina
    Berends, Annika M. A.
    Filippatos, Angelos
    Prodanov, Tamara
    Meuter, Leah
    Prejbisz, Alexander
    Beuschlein, Felix
    Fassnacht, Martin
    Timmers, Henri J. L. M.
    Noelting, Svenja
    Abhyankar, Kaushik
    Constantinescu, Georgiana
    Kunath, Carola
    de Haas, Robbert J.
    Wang, Katharina
    Remde, Hanna
    Bornstein, Stefan R.
    Januszewicz, Andrzeij
    Robledo, Mercedes
    Lenders, Jacques W. M.
    Kerstens, Michiel N.
    Pacak, Karel
    Eisenhofer, Graeme
    LANCET DIGITAL HEALTH, 2023, 5 (09): : E551 - E559
  • [3] Sarcopenia feature selection and risk prediction using machine learning A cross-sectional study
    Kang, Yang-Jae
    Yoo, Jun-Il
    Ha, Yong-chan
    MEDICINE, 2019, 98 (43)
  • [4] Using advanced machine learning algorithms to predict academic major completion: A cross-sectional study
    Kordbagheri, Alireza
    Kordbagheri, Mohammadreza
    Tayim, Natalie
    Fakhrou, Abdulnaser
    Davoudi, Mohammadreza
    Computers in Biology and Medicine, 2025, 184
  • [5] Diabetes Prediction using Machine Learning Algorithms
    Mujumdar, Aishwarya
    Vaidehi, V.
    2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 : 292 - 299
  • [6] Prediction of Suicidal Ideation among Korean Adults Using Machine Learning: A Cross-Sectional Study
    Oh, Bumjo
    Yun, Je-Yeon
    Yeo, Eun Chong
    Kim, Dong-Hoi
    Kim, Jin
    Cho, Bum-Joo
    PSYCHIATRY INVESTIGATION, 2020, 17 (04) : 331 - 340
  • [7] A review on prediction of diabetes using machine learning and data mining classification techniques
    Pati, Abhilash
    Parhi, Manoranjan
    Pattanayak, Binod Kumar
    INTERNATIONAL JOURNAL OF BIOMEDICAL ENGINEERING AND TECHNOLOGY, 2023, 41 (01) : 83 - 109
  • [8] Comparative study on risk prediction model of type 2 diabetes based on machine learning theory: a cross-sectional study
    Wang, Shu
    Chen, Rong
    Wang, Shuang
    Kong, Danli
    Cao, Rudai
    Lin, Chunwen
    Luo, Ling
    Huang, Jialu
    Zhang, Qiaoli
    Yu, Haibing
    Ding, Yuan Lin
    BMJ OPEN, 2023, 13 (08):
  • [9] Educational data mining: prediction of students' academic performance using machine learning algorithms
    Mustafa Yağcı
    Smart Learning Environments, 9
  • [10] Educational data mining: prediction of students' academic performance using machine learning algorithms
    Yagci, Mustafa
    SMART LEARNING ENVIRONMENTS, 2022, 9 (01)