Sample Size Requirements for Popular Classification Algorithms in Tabular Clinical Data: Empirical Study

被引：0

作者：

Silvey, Scott ^{[1
]}

Liu, Jinze ^{[1
]}

机构：

[1] Virginia Commonwealth Univ, Sch Publ Hlth, Dept Biostat, 830 East Main St, Richmond, VA 23219 USA

来源：

JOURNAL OF MEDICAL INTERNET RESEARCH | 2024年 / 26卷

关键词：

medical informatics; machine learning; sample size; research design; decision trees; classification algorithm; clinical research; learning-curve analysis; analysis; analyses; guidelines; ML; decision making; algorithm; curve analysis; dataset; SELECTION; MODELS; AREA;

D O I：

10.2196/60231

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Background: The performance of a classification algorithm eventually reaches a point of diminishing returns, where the additional sample added does not improve the results. Thus, there is a need to determine an optimal sample size that maximizes performance while accounting for computational burden or budgetary concerns. Objective: This study aimed to determine optimal sample sizes and the relationships between sample size and dataset-level characteristics over a variety of binary classification algorithms. Methods: A total of 16 large open-source datasets were collected, each containing a binary clinical outcome. Furthermore, 4 machine learning algorithms were assessed: XGBoost (XGB), random forest (RF), logistic regression (LR), and neural networks (NNs). For each dataset, the cross-validated area under the curve (AUC) was calculated at increasing sample sizes, and learning curves were fit. Sample sizes needed to reach the observed full-dataset AUC minus 2 points (0.02) were calculated from the fitted learning curves and compared across the datasets and algorithms. Dataset-level characteristics, minority class proportion, full-dataset AUC, number of features, type of features, and degree of nonlinearity were examined. Negative binomial regression models were used to quantify relationships between these characteristics and expected sample sizes within each algorithm. A total of 4 multivariable models were constructed, which selected the best-fitting combination of dataset-level characteristics. Results: Among the 16 datasets (full-dataset sample sizes ranging from 70,000-1,000,000), median sample sizes were 9960 (XGB), 3404 (RF), 696 (LR), and 12,298 (NN) to reach AUC stability. For all 4 algorithms, more balanced classes (multiplier: 0.93-0.96 for a 1% increase in minority class proportion) were associated with decreased sample size. Other characteristics varied in importance across algorithms-in general, more features, weaker features, and more complex relationships between the predictors and the response increased expected sample sizes. In multivariable analysis, the top selected predictors were minority class proportion among all 4 algorithms assessed, full-dataset AUC (XGB, RF, and NN), and dataset nonlinearity (XGB, RF, and NN). For LR, the top predictors were minority class proportion, percentage of strong linear features, and number of features. Final multivariable sample size models had high goodness-of-fit, with dataset-level predictors explaining a majority (66.5%-84.5%) Conclusions: The sample sizes needed to reach AUC stability among 4 popular classification algorithms vary by dataset and method and are associated with dataset-level characteristics that can be influenced or estimated before the start of a research

引用

页数：15

共 50 条

[1] Study on Performance of Classification Algorithms Based on the Sample Size for Crop Prediction
Rajeshwari, I
Shyamala, K.
COMPUTATIONAL VISION AND BIO-INSPIRED COMPUTING, 2020, 1108 : 1051 - 1058
[2] Sample size algorithms in clinical trials
Wu, Chien-Hua
Won, Shu-Mei
Yang, Yu-Chun
Huang, Chiung-Yu
DRUG INFORMATION JOURNAL, 2008, 42 (05): : 429 - 439
[3] Sample Size Algorithms in Clinical Trials
Chien-Hua Wu
Shu-Mei Wan
Yu-Chun Yang
Chiung-Yu Huang
Drug information journal : DIJ / Drug Information Association, 2008, 42 : 429 - 439
[4] The methods for handling missing data in clinical trials influence sample size requirements
Auleley, GR
Giraudeau, B
Baron, G
Maillefert, JF
Dougados, M
Ravaud, P
JOURNAL OF CLINICAL EPIDEMIOLOGY, 2004, 57 (05) : 447 - 453
[5] Sample Size Requirements for Applying Diagnostic Classification Models
Sen, Sedat
Cohen, Allan S.
FRONTIERS IN PSYCHOLOGY, 2021, 11
[6] Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
Guo, Yu
Graber, Armin
McBurney, Robert N.
Balasubramanian, Raji
BMC BIOINFORMATICS, 2010, 11
[7] Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms
Yu Guo
Armin Graber
Robert N McBurney
Raji Balasubramanian
BMC Bioinformatics, 11
[8] An Empirical Comparison of Machine Learning Algorithms for Classification of Software Requirements
Li, Law Foong
Jin-An, Nicholas Chia
Kasirun, Zarinah Mohd
Piaw, Chua Yan
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (11) : 258 - 263
[9] SENSITIVITY OF HYPERSPECTRAL CLASSIFICATION ALGORITHMS TO TRAINING SAMPLE SIZE
Lee, Matthew A.
Prasad, Saurabh
Bruce, Lori Mann
West, Terrance R.
Reynolds, Daniel
Irby, Trent
Kalluri, Hemanth
2009 FIRST WORKSHOP ON HYPERSPECTRAL IMAGE AND SIGNAL PROCESSING: EVOLUTION IN REMOTE SENSING, 2009, : 235 - +
[10] Dimensionality, sample size, and classification error of nonparametric linear classification algorithms
Raudys, S
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (06) : 667 - 671

← 1 2 3 4 5 →