The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

被引:351
|
作者
Lemmon, Alan R. [1 ]
Brown, Jeremy M. [1 ]
Stanger-Hall, Kathrin [2 ]
Lemmon, Emily Moriarty [1 ]
机构
[1] Univ Texas Austin, Sect Integrat Biol, Austin, TX 78712 USA
[2] Univ Georgia, Dept Plant Biol, Athens, GA 30602 USA
基金
美国国家科学基金会;
关键词
Ambiguous characters; ambiguous data; Bayesian; bias; maximum likelihood; missing data; model misspecification; phylogenetics; posterior probabilities; prior; MISSING DATA; MOLECULAR PHYLOGENETICS; DNA-SEQUENCES; POSTERIOR PROBABILITY; INCOMPLETE TAXA; EVOLUTION; TREE; PARSIMONY; HETEROGENEITY; HETEROTACHY;
D O I
10.1093/sysbio/syp017
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Although an increasing number of phylogenetic data sets are incomplete, the effect of ambiguous data on phylogenetic accuracy is not well understood. We use 4-taxon simulations to study the effects of ambiguous data (i.e., missing characters or gaps) in maximum likelihood (ML) and Bayesian frameworks. By introducing ambiguous data in a way that removes confounding factors, we provide the first clear understanding of 1 mechanism by which ambiguous data can mislead phylogenetic analyses. We find that in both ML and Bayesian frameworks, among-site rate variation can interact with ambiguous data to produce misleading estimates of topology and branch lengths. Furthermore, within a Bayesian framework, priors on branch lengths and rate heterogeneity parameters can exacerbate the effects of ambiguous data, resulting in strongly misleading bipartition posterior probabilities. The magnitude and direction of the ambiguous data bias are a function of the number and taxonomic distribution of ambiguous characters, the strength of topological support, and whether or not the model is correctly specified. The results of this study have major implications for all analyses that rely on accurate estimates of topology or branch lengths, including divergence time estimation, ancestral state reconstruction, tree-dependent comparative methods, rate variation analysis, phylogenetic hypothesis testing, and phylogeographic analysis.
引用
收藏
页码:130 / 145
页数:16
相关论文
共 50 条
  • [1] CURVATURE AND INFERENCE FOR MAXIMUM LIKELIHOOD ESTIMATES
    Efron, Bradley
    ANNALS OF STATISTICS, 2018, 46 (04): : 1664 - 1692
  • [2] An investigation of irreproducibility in maximum likelihood phylogenetic inference
    Xing-Xing Shen
    Yuanning Li
    Chris Todd Hittinger
    Xue-xin Chen
    Antonis Rokas
    Nature Communications, 11
  • [3] An investigation of irreproducibility in maximum likelihood phylogenetic inference
    Shen, Xing-Xing
    Li, Yuanning
    Hittinger, Chris Todd
    Chen, Xue-xin
    Rokas, Antonis
    NATURE COMMUNICATIONS, 2020, 11 (01)
  • [5] GPU Accelerated Maximum Likelihood Analysis for Phylogenetic Inference
    Rajapaksa, Sandun
    Rasanjana, Wageesha
    Perera, Indika
    Meedeniya, Dulani
    2019 8TH INTERNATIONAL CONFERENCE ON SOFTWARE AND COMPUTER APPLICATIONS (ICSCA 2019), 2019, : 6 - 10
  • [6] PHYLOGENETIC INFERENCE - LINEAR INVARIANTS AND MAXIMUM-LIKELIHOOD
    NAVIDI, WC
    CHURCHILL, GA
    VONHAESELER, A
    BIOMETRICS, 1993, 49 (02) : 543 - 555
  • [7] A fast program for phylogenetic tree inference with maximum likelihood
    Stamatakis, AP
    Ludwig, T
    Meier, H
    HIGH PERFORMANCE COMPUTING IN SCIENCE AND ENGINEERING, MUNICH 2003, 2004, : 273 - 283
  • [8] BAYESIAN-INFERENCE AND OPTIMALITY OF MAXIMUM LIKELIHOOD ESTIMATION
    HIGGINS, JJ
    INTERNATIONAL STATISTICAL REVIEW, 1977, 45 (01) : 9 - 11
  • [9] Bayesian and maximum likelihood inference approaches for the discrete generalized Sibuya distribution with censored data
    de Freitas, Bruno Caparroz Lopes
    Peres, Marcos Vinicius de Oliveira
    Achcar, Jorge Alberto
    Martinez, Edson Zangiacomi
    ELECTRONIC JOURNAL OF APPLIED STATISTICAL ANALYSIS, 2022, 15 (01) : 50 - 74
  • [10] ParBaum: A fast program for phylogenetic tree inference with maximum likelihood
    Stamatakis, AP
    Ludwig, T
    Meier, H
    High Performance Computing in Science and Engineering, Garching 2004, 2005, : 275 - 284