Sample Size Requirements for Popular Classification Algorithms in Tabular Clinical Data: Empirical Study

被引:0
|
作者
Silvey, Scott [1 ]
Liu, Jinze [1 ]
机构
[1] Virginia Commonwealth Univ, Sch Publ Hlth, Dept Biostat, 830 East Main St, Richmond, VA 23219 USA
关键词
medical informatics; machine learning; sample size; research design; decision trees; classification algorithm; clinical research; learning-curve analysis; analysis; analyses; guidelines; ML; decision making; algorithm; curve analysis; dataset; SELECTION; MODELS; AREA;
D O I
10.2196/60231
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: The performance of a classification algorithm eventually reaches a point of diminishing returns, where the additional sample added does not improve the results. Thus, there is a need to determine an optimal sample size that maximizes performance while accounting for computational burden or budgetary concerns. Objective: This study aimed to determine optimal sample sizes and the relationships between sample size and dataset-level characteristics over a variety of binary classification algorithms. Methods: A total of 16 large open-source datasets were collected, each containing a binary clinical outcome. Furthermore, 4 machine learning algorithms were assessed: XGBoost (XGB), random forest (RF), logistic regression (LR), and neural networks (NNs). For each dataset, the cross-validated area under the curve (AUC) was calculated at increasing sample sizes, and learning curves were fit. Sample sizes needed to reach the observed full-dataset AUC minus 2 points (0.02) were calculated from the fitted learning curves and compared across the datasets and algorithms. Dataset-level characteristics, minority class proportion, full-dataset AUC, number of features, type of features, and degree of nonlinearity were examined. Negative binomial regression models were used to quantify relationships between these characteristics and expected sample sizes within each algorithm. A total of 4 multivariable models were constructed, which selected the best-fitting combination of dataset-level characteristics. Results: Among the 16 datasets (full-dataset sample sizes ranging from 70,000-1,000,000), median sample sizes were 9960 (XGB), 3404 (RF), 696 (LR), and 12,298 (NN) to reach AUC stability. For all 4 algorithms, more balanced classes (multiplier: 0.93-0.96 for a 1% increase in minority class proportion) were associated with decreased sample size. Other characteristics varied in importance across algorithms-in general, more features, weaker features, and more complex relationships between the predictors and the response increased expected sample sizes. In multivariable analysis, the top selected predictors were minority class proportion among all 4 algorithms assessed, full-dataset AUC (XGB, RF, and NN), and dataset nonlinearity (XGB, RF, and NN). For LR, the top predictors were minority class proportion, percentage of strong linear features, and number of features. Final multivariable sample size models had high goodness-of-fit, with dataset-level predictors explaining a majority (66.5%-84.5%) Conclusions: The sample sizes needed to reach AUC stability among 4 popular classification algorithms vary by dataset and method and are associated with dataset-level characteristics that can be influenced or estimated before the start of a research
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Prototype Reduction Algorithms Comparison in Nearest Neighbor Classification for Sensor Data: Empirical Study
    Rosero-Montalvo, Paul
    Peluffo-Ordonez, Diego H.
    Umaquinga, Ana
    Anaya, Andres
    Serrano, Jorge
    Rosero, Edwin
    Vasquez, Carlos
    Suarez, Luis
    2017 IEEE SECOND ECUADOR TECHNICAL CHAPTERS MEETING (ETCM), 2017,
  • [22] An empirical study on sample size for the central limit theorem using Japanese firm data
    Fukuda, Kosei
    TEACHING STATISTICS, 2024, 46 (03) : 184 - 191
  • [23] Optimizing for Recall in Automatic Requirements Classification: An Empirical Study
    Winkler, Jonas Paul
    Groenberg, Jannis
    Vogelsang, Andreas
    2019 27TH IEEE INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE (RE 2019), 2019, : 40 - 50
  • [24] Reduction in sample size requirements for clinical trials while holding study power constant.
    Fries, JF
    Bjorner, J
    Hubert, HB
    ARTHRITIS AND RHEUMATISM, 2005, 52 (09): : S266 - S266
  • [25] An Empirical Study on the Membership Inference Attack against Tabular Data Synthesis Models
    Hyeong, Jihyeon
    Kim, Jayoung
    Park, Noseong
    Jajodia, Sushil
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 4064 - 4068
  • [26] The Influence of Inconsistent Data on Cost-Sensitive Classification Using Prism Algorithms: An Empirical Study
    Hao, Zhiyong
    Yao, Li
    Wang, Yanjuan
    JOURNAL OF COMPUTERS, 2014, 9 (08) : 1880 - 1885
  • [27] Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications
    Yiyan Zhang
    Yi Xin
    Qin Li
    Jianshe Ma
    Shuai Li
    Xiaodan Lv
    Weiqi Lv
    BioMedical Engineering OnLine, 16
  • [28] Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications
    Zhang, Yiyan
    Xin, Yi
    Li, Qin
    Ma, Jianshe
    Li, Shuai
    Lv, Xiaodan
    Lv, Weiqi
    BIOMEDICAL ENGINEERING ONLINE, 2017, 16
  • [29] Sample Size Requirements for Establishing Clinical Test-Retest Standards
    McMillan, Garnett P.
    Hanson, Timothy E.
    EAR AND HEARING, 2014, 35 (02): : 283 - 286
  • [30] Sample Size Requirements for Training to a κ Agreement Criterion on Clinical Dementia Ratings
    Tractenberg, Rochelle E.
    Yumoto, Futoshi
    Jin, Shelia
    Morris, John C.
    ALZHEIMER DISEASE & ASSOCIATED DISORDERS, 2010, 24 (03): : 264 - 268