Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology

被引:16
|
作者
Tkachev, Victor [1 ]
Sorokin, Maxim [1 ,2 ]
Borisov, Constantin [3 ]
Garazha, Andrew [1 ]
Buzdin, Anton [1 ,2 ,4 ,5 ]
Borisov, Nicolas [1 ,2 ,4 ]
机构
[1] OmicsWayCorp, Walnut, CA 91788 USA
[2] IM Sechenov First Moscow State Med Univ, Inst Personailzed Med, Moscow 119991, Russia
[3] Natl Res Univ Higher Sch Econ, Moscow 101000, Russia
[4] Moscow Inst Phys & Technol, Moscow 141701, Russia
[5] Shemyakin Ovchinnikov Inst Bioorgan Chem, Moscow 117997, Russia
基金
俄罗斯基础研究基金会;
关键词
bioinformatics; personalized medicine; oncology; chemotherapy; machine learning; omics profiling; COMPLETE RESPONSE; II ERROR; EXPRESSION; CLASSIFICATION; CHEMOTHERAPY; THERAPY; CANCER; BORTEZOMIB; INHIBITOR; SELECTION;
D O I
10.3390/ijms21030713
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naive Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41-235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61-0.88 range to 0.70-0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] Pan-cancer classification of multi-omics data based on machine learning models
    Cava, Claudia
    Sabetian, Soudabeh
    Salvatore, Christian
    Castiglioni, Isabella
    NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):
  • [42] Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics
    Ding, Michael Q.
    Chen, Lujia
    Cooper, Gregory F.
    Young, Jonathan D.
    Lu, Xinghua
    MOLECULAR CANCER RESEARCH, 2018, 16 (02) : 269 - 278
  • [43] PERFORMANCE OF MACHINE LEARNING METHODS IN CLASSIFICATION MODELS WITH HIGH-DIMENSIONAL DATA
    Zekic-Susac, Marijana
    Pfeifer, Sanja
    Sarlija, Natasa
    SOR'13 PROCEEDINGS: THE 12TH INTERNATIONAL SYMPOSIUM ON OPERATIONAL RESEARCH IN SLOVENIA, 2013, : 219 - 224
  • [44] Performance comparison of Extreme Learning Machines and other machine learning methods on WBCD data set
    Keskin, Omer Selim
    Durdu, Akif
    Aslan, Muhammet Fatih
    Yusefi, Abdullah
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [45] Selecting critical features for data classification based on machine learning methods
    Rung-Ching Chen
    Christine Dewi
    Su-Wen Huang
    Rezzy Eko Caraka
    Journal of Big Data, 7
  • [46] Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification
    Ruiz-Chavez, Zoila
    Salvador-Meneses, Jaime
    Garcia-Rodriguez, Jose
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 297 - 304
  • [47] Editorial: Machine Learning-Based Methods for RNA Data Analysis
    Peng, Lihong
    Yang, Jialiang
    Wang, Minxian
    Zhou, Liqian
    FRONTIERS IN GENETICS, 2022, 13
  • [48] Machine learning based data governance methods for demand response databases
    Wang, Yu
    Tang, Bihong
    JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2024, 24 (02) : 907 - 920
  • [49] Water consumption prediction based on machine learning methods and public data
    Kesornsit, Witwisit
    Sirisathitkul, Yaowarat
    ADVANCES IN COMPUTATIONAL DESIGN, AN INTERNATIONAL JOURNAL, 2022, 7 (02): : 113 - 128
  • [50] Machine Learning Methods for Credibility Assessment of Interviewees Based on Posturographic Data
    Saripalle, Sashi K.
    Vemulapalli, Spandana
    King, Gregory W.
    Burgoon, Judee K.
    Derakhshani, Reza
    2015 37TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2015, : 6708 - 6711