Reference-Free Imputation of Targeted Next-Generation Sequence Datasets

被引:0
|
作者
Nampally, Arun [1 ]
Kim, Joseph [1 ]
Proffitt, Eric
Palovcak, Eugene [2 ]
Lacoste, Alix [1 ]
机构
[1] Invitae Corp, San Francisco, CA 94103 USA
[2] Generate Biomed, Somerville, MA USA
来源
14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023 | 2023年
关键词
Imputation; next generation sequencing; targeted sequencing; hidden markov model; bayesian non-parametrics;
D O I
10.1145/3584371.3613047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the mainstream adoption of clinical genetic testing, labs routinely have access to millions of genomic sequences that were produced as a result of ordered tests. These sequences are primarily used for reporting on specific genes and disease conditions, yet they contain valuable population-scale genetic information, which remains underutilized. The distribution of haplotypes in the population is one such piece of information that can be gleaned from sequence datasets and can be used to power downstream applications like association studies. While large datasets of genomic sequences can be informative about haplotype distributions, the nature of sequencing performed can confound the reconstruction of haplotypes. Specifically, phased whole genomes are best suited for this purpose compared to unphased targeted sequences. However, the latter are more abundant. In our work, we address the specific challenges arising from the use of large datasets of targeted sequencing for recovering haplotypes. Leveraging targeted genomic sequences for genome-wide association studies requires the variants in the non-targeted regions to be imputed. This is commonly done using a variant of the Li-Stephens recombination model which approximates the generating mechanism of chromosomes by a hidden Markov model (HMM) that produces the chromosome of an individual as a mosaic of founder chromosomes. Although widely used, one limiting factor is the need to place bounds on the state space by hypothesizing about the number of founders of a population. This is particularly challenging for large datasets with samples from diverse ancestries. The second challenge in using targeted genomic sequences is that the off-target regions have sparse coverage, so the imputation model needs to account for uncertainty in the base pair calls in the offtarget regions. Here, we propose a new method based on the Hidden Markov Dirichlet Process (HMDP) non-parametric model [1]. While the HMDP model helps us circumvent the problem of fixing the state space, our modification to its observation model helps us handle the sparseness and varying coverage of the targeted sequences. The learning algorithm for this model is a modification of the HMDP Gibbs sampler that reflects the changes to the observation mechanism. Inference is based on the standard HMM algorithm to compute smoothed posterior distribution. We train the model with a dataset of approximate to 1500 targeted BAM files and validate it using downsampled whole genome files where the task is to impute approximate to 75000 SNPs in a 2.5 Mb region of Chr22. Even though a typical downsampled file has coverage for only a small fraction of SNPs, we get good performance when comparing imputed genotypes with the true genotypes (weighted f1, precision, recall: 0.95, 0.95, 0.96). We compare our method against a recent reference panel-free imputation algorithm, STITCH, and show comparable performance without having to make any fixed state space assumptions (STITCH weighted f1, precision, recall: 0.96, 0.96, 0.97). Overall, our new method enables imputing genomes with targeted sequencing using an unbounded state space.
引用
收藏
页数:1
相关论文
共 50 条
  • [21] Evaluating Targeted Next-Generation Sequencing Assays and Reference Materials for NTRK Fusion Detection
    Chung, Christina Bormann
    Lee, Jeeyun
    Barritault, Marc
    Bringuier, Pierre-Paul
    Xu, Zhaolin
    Huang, Weei-Yuarn
    Beharry, Andrea
    Castillo, Joseph
    Christiansen, Jason
    Lin, Jennifer C.
    Sheffield, Brandon S.
    JOURNAL OF MOLECULAR DIAGNOSTICS, 2022, 24 (01): : 18 - 32
  • [22] Determining Performance Metrics for Targeted Next-Generation Sequencing Panels Using Reference Materials
    Cleveland, Megan H.
    Zook, Justin M.
    Salit, Marc
    Vallone, Peter M.
    JOURNAL OF MOLECULAR DIAGNOSTICS, 2018, 20 (05): : 583 - 590
  • [23] Spurious Correlations in Reference-Free Evaluation of Text Generation
    Durmus, Esin
    Ladhak, Faisal
    Hashimoto, Tatsunori
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1443 - 1454
  • [24] Targeted Next-Generation Sequencing for Diagnostics and Forensics
    Minogue, Timothy D.
    Koehler, Jeffrey W.
    Norwood, David A.
    CLINICAL CHEMISTRY, 2017, 63 (02) : 450 - 452
  • [25] Next-Generation Sequencing: Targeting Targeted Therapies
    McCutcheon, Justine N.
    Giaccone, Giuseppe
    CLINICAL CANCER RESEARCH, 2015, 21 (16) : 3584 - 3585
  • [26] Anchored multiplex FOR for targeted next-generation sequencing
    Zheng, Zongli
    Liebers, Matthew
    Zhelyazkova, Boryana
    Cao, Yi
    Panditi, Divya
    Lynch, Kerry D.
    Chen, Juxiang
    Robinson, Hayley E.
    Shim, Hyo Sup
    Chmielecki, Juliann
    Pao, William
    Engelman, Jeffrey A.
    Iafrate, A. John
    Le, Long Phi
    NATURE MEDICINE, 2014, 20 (12) : 1479 - 1484
  • [27] Targeted next-generation sequencing in monogenic dyslipidemias
    Hegele, Robert A.
    Ban, Matthew R.
    Cao, Henian
    McIntyre, Adam D.
    Robinson, John F.
    Wang, Jian
    CURRENT OPINION IN LIPIDOLOGY, 2015, 26 (02) : 103 - 113
  • [28] Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads
    Song, Kai
    Ren, Jie
    Zhai, Zhiyuan
    Liu, Xuemei
    Deng, Minghua
    Sun, Fengzhu
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2013, 20 (02) : 64 - 79
  • [29] Targeted approach for next-generation coronary stents
    Brown, Simon D.
    Rodor, Julie
    Baker, Andrew H.
    EUROPEAN HEART JOURNAL, 2025,
  • [30] Next-generation reference intervals for pediatric hematology
    Zierk, Jakob
    Hirschmann, Johannes
    Toddenroth, Dennis
    Arzideh, Farhad
    Haeckel, Rainer
    Bertram, Alexander
    Cario, Holger
    Fruehwald, Michael C.
    Gross, Hans-Juergen
    Groening, Arndt
    Gruetzner, Stefanie
    Gscheidmeier, Thomas
    Hoff, Torsten
    Hoffmann, Reinhard
    Klauke, Rainer
    Krebs, Alexander
    Lichtinghagen, Ralf
    Muehlenbrock-Lenter, Sabine
    Neumann, Michael
    Noellke, Peter
    Niemeyer, Charlotte M.
    Razum, Oliver
    Ruf, Hans-Georg
    Steigerwald, Udo
    Streichert, Thomas
    Torge, Antje
    Rascher, Wolfgang
    Prokosch, Hans-Ulrich
    Rauh, Manfred
    Metzler, Markus
    CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2019, 57 (10) : 1595 - 1607