Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

被引:43
|
作者
Yasrebi, Haleh
Sperisen, Peter
Praz, Viviane
Bucher, Philipp
机构
[1] Swiss Institute for Experimental Cancer Research (ISREC), Swiss Federal Institute of Technology (EPFL), School of Life Sciences, Lausanne
[2] Swiss Institute of Bioinformatics, EPFL SV ISREC, Lausanne
来源
PLOS ONE | 2009年 / 4卷 / 10期
关键词
BREAST-CANCER; MICROARRAY DATA; ESTROGEN-RECEPTOR; HISTOLOGIC GRADE; MARKER GENES; SIGNATURE; PLATFORM; CLASSIFICATION; CARCINOMAS; SUBTYPES;
D O I
10.1371/journal.pone.0007431
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. Results: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. Conclusions: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Intrinsic bias in breast cancer gene expression data sets
    Jonathan D Mosley
    Ruth A Keri
    BMC Cancer, 9
  • [42] Semi-automated clustering of gene expression data sets
    Kim, Minho
    Jung, Ho-Youl
    Chung, Myungguen
    Kim, Pora
    Park, Seon-Hee
    Park, Soo-Jun
    2007 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-16, 2007, : 4625 - 4628
  • [43] Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data
    Nathan L Tintle
    Alexandra Sitarik
    Benjamin Boerema
    Kylie Young
    Aaron A Best
    Matthew DeJongh
    BMC Bioinformatics, 13
  • [44] Evaluating the consistency of gene sets used in the analysis of bacterial gene expression data
    Tintle, Nathan L.
    Sitarik, Alexandra
    Boerema, Benjamin
    Young, Kylie
    Best, Aaron A.
    DeJongh, Matthew
    BMC BIOINFORMATICS, 2012, 13
  • [45] Improved LLE and neighborhood rough sets-based gene selection using Lebesgue measure for cancer classification on gene expression data
    Sun, Lin
    Wang, Wei
    Xu, Jiucheng
    Zhang, Shiguang
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 37 (04) : 5731 - 5742
  • [46] Test data sets and evaluation of gene prediction programs on the rice genome
    Li, H
    Liu, JS
    Xu, Z
    Jin, J
    Fang, L
    Gao, L
    Li, YD
    Xing, ZX
    Gao, SG
    Liu, T
    Li, HH
    Li, Y
    Fang, LJ
    Xie, HM
    Zheng, WM
    Hao, BL
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2005, 20 (04) : 446 - 453
  • [47] Subclassification and Individual Survival Time Prediction from Gene Expression Data of Neuroblastoma Patients by Using CASPAR
    Oberthuer, Andre
    Kaderali, Lars
    Kahlert, Yvonne
    Hero, Barbara
    Westermann, Frank
    Berthold, Frank
    Brors, Benedikt
    Eils, Roland
    Fischer, Matthias
    CLINICAL CANCER RESEARCH, 2008, 14 (20) : 6590 - 6601
  • [48] Test Data Sets and Evaluation of Gene Prediction Programs on the Rice Genome
    Heng Li
    Jin-Song Liu
    Zhao Xu
    Jiao Jin
    Lin Fang
    Lei Gao
    Yu-Dong Li
    Zi-Xing Xing
    Shao-Gen Gao
    Tao Liu
    Hai-Hong Li
    Yan Li
    Li-Jun Fang
    Hui-Min Xie
    Wei-Mou Zheng
    Bai-Lin Hao
    Journal of Computer Science and Technology, 2005, 20 : 446 - 453
  • [49] Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data
    Lopez-Garcia, Guillermo
    Jerez, Jose M.
    Franco, Leonardo
    Veredas, Francisco J.
    PLOS ONE, 2020, 15 (03):
  • [50] Classification and prediction of survival in hepatocellular carcinoma by gene expression profiling
    Lee, JS
    Chu, IS
    Heo, J
    Calvisi, DF
    Sun, ZT
    Roskams, T
    Durnez, A
    Demetris, AJ
    Thorgeirsson, SS
    HEPATOLOGY, 2004, 40 (03) : 667 - 676