Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study

被引:0
|
作者
Lee, Kyung Hyun [1 ]
Assassi, Shervin [2 ]
Mohan, Chandra [3 ]
Pedroza, Claudia [1 ]
机构
[1] Univ Texas Hlth Sci Ctr Houston, Inst Clin Res & Learning Hlth Care, McGovern Med Sch, Dept Pediat, Houston, TX 77030 USA
[2] Univ Texas Hlth Sci Ctr Houston, Dept Internal Med Rheumatol, Houston, TX USA
[3] Univ Houston, Dept Biomed Engn, Houston, TX USA
来源
BMC GENOMICS | 2024年 / 25卷 / 01期
关键词
Machine learning; Proteomics data; Performance metrics; Small sample sizes; PREDICTIVE MODELS; APTAMERS;
D O I
10.1186/s12864-024-11018-2
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results. Results In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme. Conclusions When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Addressing challenges in the production and analysis of illumina sequencing data
    Martin Kircher
    Patricia Heyn
    Janet Kelso
    BMC Genomics, 12
  • [32] A simulation study of sample size for DNA barcoding
    Luo, Arong
    Lan, Haiqiang
    Ling, Cheng
    Zhang, Aibing
    Shi, Lei
    Ho, Simon Y. W.
    Zhu, Chaodong
    ECOLOGY AND EVOLUTION, 2015, 5 (24): : S869 - S879
  • [33] Sample size and statistical power in the small-animal analgesia literature
    Hofmeister, E. H.
    King, J.
    Read, M. R.
    Budsberg, S. C.
    JOURNAL OF SMALL ANIMAL PRACTICE, 2007, 48 (02) : 76 - 79
  • [34] Sample Size and Data Monitoring for Clinical Trials With Extremely Low Incidence Rates
    Shein-Chung Chow
    Shih-Ting Chiu
    Therapeutic Innovation & Regulatory Science, 2013, 47 : 438 - 446
  • [35] Sample Size and Data Monitoring for Clinical Trials With Extremely Low Incidence Rates
    Chow, Shein-Chung
    Chiu, Shih-Ting
    THERAPEUTIC INNOVATION & REGULATORY SCIENCE, 2013, 47 (04) : 438 - 446
  • [36] THE ANALYSIS OF SMALL SAMPLE DYNAMIC DATA
    THAYER, JF
    WOOD, P
    PSYCHOPHYSIOLOGY, 1995, 32 : S2 - S2
  • [37] Design, statistical analysis and sample size calculation of dose response study of telmisartan and hydrochlorothiazide
    Horie, Yoshiharu
    Higaki, Jitsuo
    Takeuchi, Masahiro
    CONTEMPORARY CLINICAL TRIALS, 2007, 28 (05) : 647 - 653
  • [38] Sample size and predictive performance of machine learning methods with survival data: A simulation study
    Infante, Gabriele
    Miceli, Rosalba
    Ambrogi, Federico
    STATISTICS IN MEDICINE, 2023, 42 (30) : 5657 - 5675
  • [39] Weighting by inverse variance or by sample size in meta-analysis: A simulation study
    Sanchez-Meca, J
    Marin-Martinez, F
    EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1998, 58 (02) : 211 - 220
  • [40] Impact of correlation structure on sample size requirements of statistical methods for multiple binary outcomes: A simulation study
    Fuyama, Kanako
    Sakamaki, Kentaro
    Uemura, Kohei
    Yokota, Isao
    CLINICAL TRIALS, 2025,