Reference-Free Imputation of Targeted Next-Generation Sequence Datasets

被引:0
|
作者
Nampally, Arun [1 ]
Kim, Joseph [1 ]
Proffitt, Eric
Palovcak, Eugene [2 ]
Lacoste, Alix [1 ]
机构
[1] Invitae Corp, San Francisco, CA 94103 USA
[2] Generate Biomed, Somerville, MA USA
关键词
Imputation; next generation sequencing; targeted sequencing; hidden markov model; bayesian non-parametrics;
D O I
10.1145/3584371.3613047
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the mainstream adoption of clinical genetic testing, labs routinely have access to millions of genomic sequences that were produced as a result of ordered tests. These sequences are primarily used for reporting on specific genes and disease conditions, yet they contain valuable population-scale genetic information, which remains underutilized. The distribution of haplotypes in the population is one such piece of information that can be gleaned from sequence datasets and can be used to power downstream applications like association studies. While large datasets of genomic sequences can be informative about haplotype distributions, the nature of sequencing performed can confound the reconstruction of haplotypes. Specifically, phased whole genomes are best suited for this purpose compared to unphased targeted sequences. However, the latter are more abundant. In our work, we address the specific challenges arising from the use of large datasets of targeted sequencing for recovering haplotypes. Leveraging targeted genomic sequences for genome-wide association studies requires the variants in the non-targeted regions to be imputed. This is commonly done using a variant of the Li-Stephens recombination model which approximates the generating mechanism of chromosomes by a hidden Markov model (HMM) that produces the chromosome of an individual as a mosaic of founder chromosomes. Although widely used, one limiting factor is the need to place bounds on the state space by hypothesizing about the number of founders of a population. This is particularly challenging for large datasets with samples from diverse ancestries. The second challenge in using targeted genomic sequences is that the off-target regions have sparse coverage, so the imputation model needs to account for uncertainty in the base pair calls in the offtarget regions. Here, we propose a new method based on the Hidden Markov Dirichlet Process (HMDP) non-parametric model [1]. While the HMDP model helps us circumvent the problem of fixing the state space, our modification to its observation model helps us handle the sparseness and varying coverage of the targeted sequences. The learning algorithm for this model is a modification of the HMDP Gibbs sampler that reflects the changes to the observation mechanism. Inference is based on the standard HMM algorithm to compute smoothed posterior distribution. We train the model with a dataset of approximate to 1500 targeted BAM files and validate it using downsampled whole genome files where the task is to impute approximate to 75000 SNPs in a 2.5 Mb region of Chr22. Even though a typical downsampled file has coverage for only a small fraction of SNPs, we get good performance when comparing imputed genotypes with the true genotypes (weighted f1, precision, recall: 0.95, 0.95, 0.96). We compare our method against a recent reference panel-free imputation algorithm, STITCH, and show comparable performance without having to make any fixed state space assumptions (STITCH weighted f1, precision, recall: 0.96, 0.96, 0.97). Overall, our new method enables imputing genomes with targeted sequencing using an unbounded state space.
引用
收藏
页数:1
相关论文
共 50 条
  • [1] Reference-free compression of next-generation sequencing data in FASTQ format
    Tan, Li
    Sun, Jifeng
    2017 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2017, : 10 - 13
  • [2] Reference-Free Population Genomics from Next-Generation Transcriptome Data and the Vertebrate-Invertebrate Gap
    Gayral, Philippe
    Melo-Ferreira, Jose
    Glemin, Sylvain
    Bierne, Nicolas
    Carneiro, Miguel
    Nabholz, Benoit
    Lourenco, Joao M.
    Alves, Paulo C.
    Ballenghien, Marion
    Faivre, Nicolas
    Belkhir, Khalid
    Cahais, Vincent
    Loire, Etienne
    Bernard, Aurelien
    Galtier, Nicolas
    PLOS GENETICS, 2013, 9 (04):
  • [3] Reference-free transcriptome assembly in non-model animals from next-generation sequencing data
    Cahais, V.
    Gayral, P.
    Tsagkogeorga, G.
    Melo-Ferreira, J.
    Ballenghien, M.
    Weinert, L.
    Chiari, Y.
    Belkhir, K.
    Ranwez, V.
    Galtier, N.
    MOLECULAR ECOLOGY RESOURCES, 2012, 12 (05) : 834 - 845
  • [4] Rapid, Reference-Free human genotype imputation with denoising autoencoders
    Dias, Raquel
    Evans, Doug
    Chen, Shang-Fu
    Chen, Kai-Yu
    Loguercio, Salvatore
    Chan, Leslie
    Torkamani, Ali
    Stephens, Matthew
    ELIFE, 2022, 11
  • [5] Next-generation genotype imputation service and methods
    Das, Sayantan
    Forer, Lukas
    Schoenherr, Sebastian
    Sidore, Carlo
    Locke, Adam E.
    Kwong, Alan
    Vrieze, Scott I.
    Chew, Emily Y.
    Levy, Shawn
    McGue, Matt
    Schlessinger, David
    Stambolian, Dwight
    Loh, Po-Ru
    Iacono, William G.
    Swaroop, Anand
    Scott, Laura J.
    Cucca, Francesco
    Kronenberg, Florian
    Boehnke, Michael
    Abecasis, Goncalo R.
    Fuchsberger, Christian
    NATURE GENETICS, 2016, 48 (10) : 1284 - 1287
  • [6] Next-generation genotype imputation service and methods
    Sayantan Das
    Lukas Forer
    Sebastian Schönherr
    Carlo Sidore
    Adam E Locke
    Alan Kwong
    Scott I Vrieze
    Emily Y Chew
    Shawn Levy
    Matt McGue
    David Schlessinger
    Dwight Stambolian
    Po-Ru Loh
    William G Iacono
    Anand Swaroop
    Laura J Scott
    Francesco Cucca
    Florian Kronenberg
    Michael Boehnke
    Gonçalo R Abecasis
    Christian Fuchsberger
    Nature Genetics, 2016, 48 : 1284 - 1287
  • [7] Next-generation sequence analysis
    H Craig Mak
    Nature Biotechnology, 2011, 29 (1) : 45 - 46
  • [8] Next-generation sequencing to generate interactome datasets
    Yu H.
    Tardivo L.
    Tam S.
    Weiner E.
    Gebreab F.
    Fan C.
    Svrzikapa N.
    Hirozane-Kishikawa T.
    Rietman E.
    Yang X.
    Sahalie J.
    Salehi-Ashtiani K.
    Hao T.
    Cusick M.E.
    Hill D.E.
    Roth F.P.
    Braun P.
    Vidal M.
    Nature Methods, 2011, 8 (6) : 478 - 480
  • [9] Next-generation sequencing to generate interactome datasets
    Yu, Haiyuan
    Tardivo, Leah
    Tam, Stanley
    Weiner, Evan
    Gebreab, Fana
    Fan, Changyu
    Svrzikapa, Nenad
    Hirozane-Kishikawa, Tomoko
    Rietman, Edward
    Yang, Xinping
    Sahalie, Julie
    Salehi-Ashtiani, Kourosh
    Hao, Tong
    Cusick, Michael E.
    Hill, David E.
    Roth, Frederick P.
    Braun, Pascal
    Vidal, Marc
    NATURE METHODS, 2011, 8 (06) : 478 - U2257
  • [10] Imputation of Rare Variants in Next-Generation Association Studies
    Asimit, Jennifer L.
    Zeggini, Eleftheria
    HUMAN HEREDITY, 2012, 74 (3-4) : 196 - 204