A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data

被引:10
|
作者
Raman, Pichai [1 ,2 ,8 ]
Zimmerman, Samuel [3 ]
Rathi, Komal S. [2 ,8 ]
de Torrente, Laurence [3 ,11 ]
Sarmady, Mahdi [4 ,9 ]
Wu, Chao [4 ]
Leipzig, Jeremy [4 ,5 ]
Taylor, Deanne M. [2 ,10 ]
Tozeren, Aydin [1 ]
Mar, Jessica C. [3 ,6 ,7 ]
机构
[1] Drexel Univ, Sch Biomed Engn Sci & Hlth Syst, Philadelphia, PA 19104 USA
[2] Childrens Hosp Philadelphia, Dept Biomed & Hlth Informat, Philadelphia, PA 19104 USA
[3] Albert Einstein Coll Med, Dept Syst & Computat Biol, Bronx, NY 10467 USA
[4] Childrens Hosp Philadelphia, Dept Pathol & Lab Med, Div Genom Diagnost, Philadelphia, PA 19104 USA
[5] Drexel Univ, Coll Comp & Informat, Philadelphia, PA 19104 USA
[6] Albert Einstein Coll Med, Dept Epidemiol & Populat Hlth, Bronx, NY 10467 USA
[7] Univ Queensland, Australian Inst Bioengn & Nanotechnol, Brisbane, Qld, Australia
[8] Childrens Hosp Philadelphia, Ctr Data Driven Discovery Biomed, Philadelphia, PA 19104 USA
[9] Univ Penn, Perelman Sch Med, Dept Pathol & Lab Med, Philadelphia, PA 19104 USA
[10] Univ Penn, Dept Pediat, Perelman Sch Med, Philadelphia, PA 19104 USA
[11] New York Genome Ctr, New York, NY USA
基金
澳大利亚研究理事会;
关键词
Survival analysis; Kaplan-Meier; TCGA; Cancer; Gene expression; PROSTATE-CANCER; TRANSITION; NORMALITY; PROFILES; INDEX;
D O I
10.1016/j.cancergen.2019.04.004
中图分类号
R73 [肿瘤学];
学科分类号
100214 ;
摘要
Identifying genetic biomarkers of patient survival remains a major goal of large-scale cancer profiling studies. Using gene expression data to predict the outcome of a patient's tumor makes biomarker discovery a compelling tool for improving patient care. As genomic technologies expand, multiple data types may serve as informative biomarkers, and bioinformatic strategies have evolved around these different applications. For categorical variables such as a gene's mutation status, biomarker identification to predict survival time is straightforward. However, for continuous variables like gene expression, the available methods generate highly-variable results, and studies on best practices are lacking. We investigated the performance of eight methods that deal specifically with continuous data. K-means, Cox regression, concordance index, D-index, 25th-75th percentile split, median-split, distribution-based splitting, and KaplanScan were applied to four RNA-sequencing (RNA-seq) datasets from the Cancer Genome Atlas. The reliability of the eight methods was assessed by splitting each dataset into two groups and comparing the overlap of the results. Gene sets that had been identified from the literature for a specific tumor type served as positive controls to assess the accuracy of each biomarker using receiver operating characteristic (ROC) curves. Artificial RNA-Seq data were generated to test the robustness of these methods under fixed levels of gene expression noise. Our results show that methods based on dichotomizing tend to have consistently poor performance while C-index, D-index, and k-means perform well in most settings. Overall, the Cox regression method had the strongest performance based on tests of accuracy, reliability, and robustness.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 50 条
  • [1] Comparison of RNA-Sequencing Methods for Degraded RNA
    Ura, Hiroki
    Niida, Yo
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2024, 25 (11)
  • [2] Combined statistics for differential expression analysis of RNA-sequencing data
    Fanidis, Dionysios
    Moulos, Panagiotis
    2019 IEEE 19TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2019, : 170 - 173
  • [3] Gene Expression Profiling of Liver Cancer Stem Cells by RNA-Sequencing
    Ho, David W. Y.
    Yang, Zhen Fan
    Yi, Kang
    Lam, Chi Tat
    Ng, Michael N. P.
    Yu, Wan Ching
    Lau, Joyce
    Wan, Timothy
    Wang, Xiaoqi
    Yan, Zhixiang
    Liu, Hang
    Zhang, Yong
    Fan, Sheung Tat
    PLOS ONE, 2012, 7 (05):
  • [4] Analysis of cellulose synthase gene expression strategies in higher plants using RNA-sequencing data
    Ts. A. Padvitski
    D. V. Galinousky
    N. V. Anisimova
    G. Ya. Baer
    Ya. V. Pirko
    A. I. Yemets
    L. V. Khotyleva
    Ya. B. Blume
    A. V. Kilchevsky
    Cytology and Genetics, 2017, 51 : 8 - 17
  • [5] Analysis of cellulose synthase gene expression strategies in higher plants using RNA-sequencing data
    Padvitski, Ts. A.
    Galinousky, D. V.
    Anisimova, N. V.
    Baer, G. Ya.
    Pirko, Ya. V.
    Yemets, A. I.
    Khotyleva, L. V.
    Blume, Ya. B.
    Kilchevskya, A. V.
    CYTOLOGY AND GENETICS, 2017, 51 (01) : 8 - 17
  • [6] Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data
    Zhang, Lihua
    Zhang, Shihua
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2020, 17 (02) : 376 - 389
  • [7] Statistical methods for analysis of single-cell RNA-sequencing data
    Das, Samarendra
    Rai, Shesh N.
    METHODSX, 2021, 8
  • [8] RNA-sequencing analysis of differential gene expression associated with arterial stiffness
    Logan, Jeongok G.
    Yun, Sijung
    Bao, Yongde
    Farber, Emily
    Farber, Charles R.
    VASCULAR, 2020, 28 (05) : 655 - 663
  • [9] Differential gene expression by RNA-sequencing in orbitofacial neurofibroma
    Arnold, Antje
    Imada, Eddie
    Edward, Deepak
    Marchionni, Luigi
    Rodriguez, Fausto
    JOURNAL OF NEUROPATHOLOGY AND EXPERIMENTAL NEUROLOGY, 2020, 79 (06): : 691 - 692
  • [10] Weighted Gene Co-expression Network Analysis for RNA-Sequencing Data of the Varicose Veins Transcriptome
    Zhang, Jianbin
    Nie, Qiangqiang
    Si, Chaozeng
    Wang, Cheng
    Chen, Yang
    Sun, Weiliang
    Pan, Lin
    Guo, Jing
    Kong, Jie
    Cui, Yiyao
    Wang, Feng
    Fan, Xueqiang
    Ye, Zhidong
    Wen, Jianyan
    Liu, Peng
    FRONTIERS IN PHYSIOLOGY, 2019, 10