Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

被引：43

作者：

Yasrebi, Haleh

Sperisen, Peter

Praz, Viviane

Bucher, Philipp

机构：

[1] Swiss Institute for Experimental Cancer Research (ISREC), Swiss Federal Institute of Technology (EPFL), School of Life Sciences, Lausanne

[2] Swiss Institute of Bioinformatics, EPFL SV ISREC, Lausanne

来源：

PLOS ONE | 2009年 / 4卷 / 10期

关键词：

BREAST-CANCER; MICROARRAY DATA; ESTROGEN-RECEPTOR; HISTOLOGIC GRADE; MARKER GENES; SIGNATURE; PLATFORM; CLASSIFICATION; CARCINOMAS; SUBTYPES;

D O I：

10.1371/journal.pone.0007431

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Background: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. Results: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. Conclusions: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.

引用

页数：14

共 50 条

[1] A Hybrid Approach of Gene Sets and Single Genes for the Prediction of Survival Risks with Gene Expression Data
Seok, Junhee
Davis, Ronald W.
Xiao, Wenzhong
PLOS ONE, 2015, 10 (05):
[2] MERGING DATA SETS
ROUSSEAU, R
SCIENTOMETRICS, 1989, 15 (3-4) : 305 - 308
[3] Bayesian ensemble methods for survival prediction in gene expression data
Bonato, Vinicius
Baladandayuthapani, Veerabhadran
Broom, Bradley M.
Sulman, Erik P.
Aldape, Kenneth D.
Do, Kim-Anh
BIOINFORMATICS, 2011, 27 (03) : 359 - 367
[4] Survival prediction using gene expression data: A review and comparison
van Wieringen, Wessel N.
Kun, David
Hampel, Regina
Boulesteix, Anne-Laure
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2009, 53 (05) : 1590 - 1603
[5] Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
Krepel, Jessica
Kircher, Magdalena
Kohls, Moritz
Jung, Klaus
STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (01) : 112 - 124
[6] Improve Survival Prediction Using Principal Components of Gene Expression Data
Yi-Jing Shen1 and Shu-Guang Huang2* 1 Department of Statistics
2 Statistics and In- formation Science
Genomics Proteomics & Bioinformatics, 2006, (02) : 110 - 119
[7] Improved Estimates of Pentad Precipitation Through the Merging of Independent Precipitation Data Sets
Koster, Randal D.
Liu, Qing
Reichle, Rolf H.
Huffman, George J.
WATER RESOURCES RESEARCH, 2021, 57 (12)
[8] SURVIVAL PREDICTION WITH GENE EXPRESSION PROFILES
He, Wenqing
Yi, Grace Y.
JP JOURNAL OF BIOSTATISTICS, 2009, 3 (01) : 17 - 39
[9] Artificial Neural Network Prediction for Cancer Survival Time by Gene Expression Data
Chen, Yen-Chen
Yang, Wen-Wen
Chiu, Hung-Wen
2009 3RD INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICAL ENGINEERING, VOLS 1-11, 2009, : 602 - +
[10] Data sets merging in serial crystallography
Warshamanage, Rangana S.
Olieric, Vincent
Huang, Chia-Ying
Caffrey, Martin
Diederichs, Kay
Wang, Meitian
ACTA CRYSTALLOGRAPHICA A-FOUNDATION AND ADVANCES, 2016, 72 : S20 - S20

← 1 2 3 4 5 →