Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology

被引：16

作者：

Tkachev, Victor ^{[1
]}

Sorokin, Maxim ^{[1
,2
]}

Borisov, Constantin ^{[3
]}

Garazha, Andrew ^{[1
]}

Buzdin, Anton ^{[1
,2
,4
,5
]}

Borisov, Nicolas ^{[1
,2
,4
]}

机构：

[1] OmicsWayCorp, Walnut, CA 91788 USA

[2] IM Sechenov First Moscow State Med Univ, Inst Personailzed Med, Moscow 119991, Russia

[3] Natl Res Univ Higher Sch Econ, Moscow 101000, Russia

[4] Moscow Inst Phys & Technol, Moscow 141701, Russia

[5] Shemyakin Ovchinnikov Inst Bioorgan Chem, Moscow 117997, Russia

来源：

INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES | 2020年 / 21卷 / 03期

基金：

俄罗斯基础研究基金会;

关键词：

bioinformatics; personalized medicine; oncology; chemotherapy; machine learning; omics profiling; COMPLETE RESPONSE; II ERROR; EXPRESSION; CLASSIFICATION; CHEMOTHERAPY; THERAPY; CANCER; BORTEZOMIB; INHIBITOR; SELECTION;

D O I：

10.3390/ijms21030713

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

(1) Background: Machine learning (ML) methods are rarely used for an omics-based prescription of cancer drugs, due to shortage of case histories with clinical outcome supplemented by high-throughput molecular data. This causes overtraining and high vulnerability of most ML methods. Recently, we proposed a hybrid global-local approach to ML termed floating window projective separator (FloWPS) that avoids extrapolation in the feature space. Its core property is data trimming, i.e., sample-specific removal of irrelevant features. (2) Methods: Here, we applied FloWPS to seven popular ML methods, including linear SVM, k nearest neighbors (kNN), random forest (RF), Tikhonov (ridge) regression (RR), binomial naive Bayes (BNB), adaptive boosting (ADA) and multi-layer perceptron (MLP). (3) Results: We performed computational experiments for 21 high throughput gene expression datasets (41-235 samples per dataset) totally representing 1778 cancer patients with known responses on chemotherapy treatments. FloWPS essentially improved the classifier quality for all global ML methods (SVM, RF, BNB, ADA, MLP), where the area under the receiver-operator curve (ROC AUC) for the treatment response classifiers increased from 0.61-0.88 range to 0.70-0.94. We tested FloWPS-empowered methods for overtraining by interrogating the importance of different features for different ML methods in the same model datasets. (4) Conclusions: We showed that FloWPS increases the correlation of feature importance between the different ML methods, which indicates its robustness to overtraining. For all the datasets tested, the best performance of FloWPS data trimming was observed for the BNB method, which can be valuable for further building of ML classifiers in personalized oncology.

引用

页数：20

共 50 条

[41] Pan-cancer classification of multi-omics data based on machine learning models
Cava, Claudia
Sabetian, Soudabeh
Salvatore, Christian
Castiglioni, Isabella
NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):
[42] Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics
Ding, Michael Q.
Chen, Lujia
Cooper, Gregory F.
Young, Jonathan D.
Lu, Xinghua
MOLECULAR CANCER RESEARCH, 2018, 16 (02) : 269 - 278
[43] PERFORMANCE OF MACHINE LEARNING METHODS IN CLASSIFICATION MODELS WITH HIGH-DIMENSIONAL DATA
Zekic-Susac, Marijana
Pfeifer, Sanja
Sarlija, Natasa
SOR'13 PROCEEDINGS: THE 12TH INTERNATIONAL SYMPOSIUM ON OPERATIONAL RESEARCH IN SLOVENIA, 2013, : 219 - 224
[44] Performance comparison of Extreme Learning Machines and other machine learning methods on WBCD data set
Keskin, Omer Selim
Durdu, Akif
Aslan, Muhammet Fatih
Yusefi, Abdullah
29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
[45] Selecting critical features for data classification based on machine learning methods
Rung-Ching Chen
Christine Dewi
Su-Wen Huang
Rezzy Eko Caraka
Journal of Big Data, 7
[46] Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification
Ruiz-Chavez, Zoila
Salvador-Meneses, Jaime
Garcia-Rodriguez, Jose
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 297 - 304
[47] Editorial: Machine Learning-Based Methods for RNA Data Analysis
Peng, Lihong
Yang, Jialiang
Wang, Minxian
Zhou, Liqian
FRONTIERS IN GENETICS, 2022, 13
[48] Machine learning based data governance methods for demand response databases
Wang, Yu
Tang, Bihong
JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2024, 24 (02) : 907 - 920
[49] Water consumption prediction based on machine learning methods and public data
Kesornsit, Witwisit
Sirisathitkul, Yaowarat
ADVANCES IN COMPUTATIONAL DESIGN, AN INTERNATIONAL JOURNAL, 2022, 7 (02): : 113 - 128
[50] Machine Learning Methods for Credibility Assessment of Interviewees Based on Posturographic Data
Saripalle, Sashi K.
Vemulapalli, Spandana
King, Gregory W.
Burgoon, Judee K.
Derakhshani, Reza
2015 37TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2015, : 6708 - 6711

← 1 2 3 4 5 →