A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements

被引:0
|
作者
Gautam Dasarathy
Elchanan Mossel
Robert Nowak
Sebastien Roch
机构
[1] Arizona State University,School of Electrical, Computer, and Energy Engineering
[2] Massachusetts Institute of Technology,Department of Mathematics and IDSS
[3] University of Wisconsin,Department of Electrical and Computer Engineering
[4] University of Wisconsin,Department of Mathematics
来源
关键词
Phylogenetic reconstruction; Coalescent; Gene tree/species tree; Distance methods; Data requirement; 60K35; 92D15;
D O I
暂无
中图分类号
学科分类号
摘要
Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene—the gene trees—often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard—but unsatisfactory—assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error—or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of 1/[f2k]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1/[f^{2} \sqrt{k}]$$\end{document}. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with n≥3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \ge 3$$\end{document} species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
引用
收藏
相关论文
共 50 条
  • [1] A stochastic Farris transform for genetic data under the multispecies coalescent with applications to data requirements
    Dasarathy, Gautam
    Mossel, Elchanan
    Nowak, Robert
    Roch, Sebastien
    JOURNAL OF MATHEMATICAL BIOLOGY, 2022, 84 (05)
  • [2] Poor Fit to the Multispecies Coalescent is Widely Detectable in Empirical Data
    Reid, Noah M.
    Hird, Sarah M.
    Brown, Jeremy M.
    Pelletier, Tara A.
    McVay, John D.
    Satler, Jordan D.
    Carstens, Bryan C.
    SYSTEMATIC BIOLOGY, 2014, 63 (03) : 322 - 333
  • [3] A Simulation Study to Examine the Information Content in Phylogenomic Data Sets under the Multispecies Coalescent Model
    Huang, Jun
    Flouri, Tomas
    Yang, Ziheng
    MOLECULAR BIOLOGY AND EVOLUTION, 2020, 37 (11) : 3211 - 3224
  • [4] CoaSim: A flexible environment for simulating genetic data under coalescent models
    Thomas Mailund
    Mikkel H Schierup
    Christian NS Pedersen
    Peter JM Mechlenborg
    Jesper N Madsen
    Leif Schauser
    BMC Bioinformatics, 6
  • [5] CoaSim: A flexible environment for simulating genetic data under coalescent models
    Mailund, T
    Schierup, MH
    Pedersen, CNS
    Mechlenborg, PJM
    Madsen, JN
    Schauser, L
    BMC BIOINFORMATICS, 2005, 6 (1)
  • [6] The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets
    Jiang, Xiaodong
    Edwards, Scott, V
    Liu, Liang
    SYSTEMATIC BIOLOGY, 2020, 69 (04) : 795 - 812
  • [7] Site pattern probabilities under the multispecies coalescent and a relaxed molecular clock: Theory and applications
    Richards, A.
    Kubatko, L.
    JOURNAL OF THEORETICAL BIOLOGY, 2022, 542
  • [8] Likelihood computation and inference of demographic and mutational parameters from population genetic data under coalescent approximations
    Rousset, Francois
    Beeravolu, Champak Reddy
    Leblois, Raphael
    JOURNAL OF THE SFDS, 2018, 159 (03): : 142 - 166
  • [9] Quartet Inference from SNP Data Under the Coalescent Model
    Chifman, Julia
    Kubatko, Laura
    BIOINFORMATICS, 2014, 30 (23) : 3317 - 3324
  • [10] Effects of missing data on species tree estimation under the coalescent
    Hovmoeller, Rasmus
    Knowles, L. Lacey
    Kubatko, Laura S.
    MOLECULAR PHYLOGENETICS AND EVOLUTION, 2013, 69 (03) : 1057 - 1062