Addressing erroneous scale assumptions in microbe and gene set enrichment analysis

被引:2
|
作者
McGovern, Kyle C. [1 ]
Nixon, Michelle Pistner [2 ]
Silverman, Justin D. [1 ,2 ,3 ,4 ,5 ]
机构
[1] Penn State Univ, Program Bioinformat & Genom, State Coll, PA 16801 USA
[2] Penn State Univ, Coll Informat Sci & Technol, State Coll, PA 16801 USA
[3] Penn State Univ, Dept Med, State Coll, PA 16801 USA
[4] Penn State Univ, Dept Stat, State Coll, PA 16801 USA
[5] Penn State Univ, Inst Computat & Data Sci, State Coll, PA 16801 USA
关键词
Bacteria - Genes - Risk perception - Sensitivity analysis;
D O I
10.1371/journal.pcbi.1011659
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
By applying Differential Set Analysis (DSA) to sequence count data, researchers can determine whether groups of microbes or genes are differentially enriched. Yet sequence count data suffer from a scale limitation: these data lack information about the scale (i.e., size) of the biological system under study, leading some authors to call these data compositional (i.e., proportional). In this article, we show that commonly used DSA methods that rely on normalization make strong, implicit assumptions about the unmeasured system scale. We show that even small errors in these scale assumptions can lead to positive predictive values as low as 9%. To address this problem, we take three novel approaches. First, we introduce a sensitivity analysis framework to identify when modeling results are robust to such errors and when they are suspect. Unlike standard benchmarking studies, this framework does not require ground-truth knowledge and can therefore be applied to both simulated and real data. Second, we introduce a statistical test that provably controls Type-I error at a nominal rate despite errors in scale assumptions. Finally, we discuss how the impact of scale limitations depends on a researcher's scientific goals and provide tools that researchers can use to evaluate whether their goals are at risk from erroneous scale assumptions. Overall, the goal of this article is to catalyze future research into the impact of scale limitations in analyses of sequence count data; to illustrate that scale limitations can lead to inferential errors in practice; yet to also show that rigorous and reproducible scale reliant inference is possible if done carefully. A common task in the analysis of DNA sequence count data is to determine whether sets of biologically related genes or microbes are differentially enriched between two experimental conditions (Differential Set Analysis; DSA). Yet DSA can be confounded by the non-biological (i.e., technical) variation in sequencing depth. To address this issue, many researchers use normalization techniques to remove this variation. The choice of normalization can dominate modeling results yet we lack tools for properly validating this decision. Here we develop statistical and computational tools that allow researchers to quantify the robustness of analytical results to the choice of normalization. These methods aim to improve the rigor and reproducibility of commonly performed set enrichment analyses.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Gene set enrichment analysis for multiple continuous phenotypes
    Xiaoming Wang
    Saumyadipta Pyne
    Irina Dinu
    BMC Bioinformatics, 15
  • [22] pyPAGE: A framework for Addressing biases in gene-set enrichment analysis-A case study on Alzheimer's disease
    Bakulin, Artemy
    Teyssier, Noam B.
    Kampmann, Martin
    Khoroshkin, Matvei
    Goodarzi, Hani
    PLOS COMPUTATIONAL BIOLOGY, 2024, 20 (09)
  • [23] Gene set enrichment analysis (GSEA) for interpreting gene expression profiles
    Shi, Jing
    Walker, Michael G.
    CURRENT BIOINFORMATICS, 2007, 2 (02) : 133 - 137
  • [24] The limitations of simple gene set enrichment analysis assuming gene independence
    Tamayo, Pablo
    Steinhardt, George
    Liberzon, Arthur
    Mesirov, Jill P.
    STATISTICAL METHODS IN MEDICAL RESEARCH, 2016, 25 (01) : 472 - 487
  • [25] Extensions to gene set enrichment
    Jiang, Zhen
    Gentleman, Robert
    BIOINFORMATICS, 2007, 23 (03) : 306 - 313
  • [26] Differential Gene Set Enrichment Analysis: a statistical approach to quantify the relative enrichment of two gene sets
    Joly, James H.
    Lowry, William E.
    Graham, Nicholas A.
    BIOINFORMATICS, 2021, 36 (21) : 5247 - 5254
  • [27] Differential Gene Set Enrichment Analysis: a statistical approach to quantify the relative enrichment of two gene sets
    Joly, James H.
    Lowry, William E.
    Graham, Nicholas A.
    BIOINFORMATICS, 2020, 36 (21) : 5247 - 5254
  • [28] Analysis of breast cancer recurrence using gene set enrichment analysis
    Kumar, Anupama Praveen
    Kovatich, Albert J.
    Biancotto, Angelique
    Cheung, Foo
    Davidson-Moncada, Jan K.
    Kvecher, Leonid
    Liu, Jianfang
    Ru, Yuanbin
    Kovatich, Audrey W.
    Deyarmin, Brenda
    Fantacone-Campbell, Jamie Leigh
    Hooke, Jeffrey A.
    Kumar, Praveen Kumar Raj
    Rui, Hallgeir
    Hu, Hai
    Shriver, Craig D.
    CANCER RESEARCH, 2018, 78 (04)
  • [29] ENRICHMENT OF PATIENTS WITH EHLERS DANLOS SYNDROME IN IDIOPATHIC GASTROPARESIS - A GENE SET ENRICHMENT ANALYSIS
    Smieszek, Sandra P.
    Carlin, Jesse L.
    Birznieks, Gunther
    Polymeropoulos, Mihael H.
    GASTROENTEROLOGY, 2022, 162 (07) : S709 - S709
  • [30] Ranking metrics in gene set enrichment analysis: do they matter?
    Joanna Zyla
    Michal Marczyk
    January Weiner
    Joanna Polanska
    BMC Bioinformatics, 18