Merging short and stranded long reads improves transcript assembly

被引:6
|
作者
Kainth A.S. [1 ]
Haddad G.A. [2 ]
Hall J.M. [1 ]
Ruthenburg A.J. [1 ,2 ,3 ]
机构
[1] Department of Molecular Genetics and Cell Biology, The University of Chicago, Chicago, IL
[2] Committee on Genetics, Genomics and Systems Biology, The University of Chicago, Chicago, IL
[3] Department of Biochemistry and Molecular Biology, The University of Chicago, Chicago, IL
基金
美国国家卫生研究院;
关键词
Cell culture - Gene expression - Libraries;
D O I
10.1371/journal.pcbi.1011576
中图分类号
学科分类号
摘要
Long-read RNA sequencing has arisen as a counterpart to short-read sequencing, with the potential to capture full-length isoforms, albeit at the cost of lower depth. Yet this potential is not fully realized due to inherent limitations of current long-read assembly methods and underdeveloped approaches to integrate short-read data. Here, we critically compare the existing methods and develop a new integrative approach to characterize a particularly challenging pool of low-abundance long noncoding RNA (lncRNA) transcripts from short- and long-read sequencing in two distinct cell lines. Our analysis reveals severe limitations in each of the sequencing platforms. For short-read assemblies, coverage declines at transcript termini resulting in ambiguous ends, and uneven low coverage results in segmentation of a single transcript into multiple transcripts. Conversely, long-read sequencing libraries lack depth and strand-of-origin information in cDNA-based methods, culminating in erroneous assembly and quantitation of transcripts. We also discover a cDNA synthesis artifact in long-read datasets that markedly impacts the identity and quantitation of assembled transcripts. Towards remediating these problems, we develop a computational pipeline to “strand” long-read cDNA libraries that rectifies inaccurate mapping and assembly of longread transcripts. Leveraging the strengths of each platform and our computational stranding, we also present and benchmark a hybrid assembly approach that drastically increases the sensitivity and accuracy of full-length transcript assembly on the correct strand and improves detection of biological features of the transcriptome. When applied to a challenging set of under-annotated and cell-type variable lncRNA, our method resolves the segmentation problem of short-read sequencing and the depth problem of long-read sequencing, resulting in the assembly of coherent transcripts with precise 5’ and 3’ ends. Our workflow can be applied to existing datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts. © 2023 Kainth et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
引用
收藏
相关论文
共 50 条
  • [11] Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads
    Laura H. Tung
    Mingfu Shao
    Carl Kingsford
    Genome Biology, 20
  • [12] Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads
    Tung, Laura H.
    Shao, Mingfu
    Kingsford, Carl
    GENOME BIOLOGY, 2019, 20 (01)
  • [13] CoLoRMap: Correcting Long Reads by Mapping short reads
    Haghshenas, Ehsan
    Hach, Faraz
    Sahinalp, S. Cenk
    Chauve, Cedric
    BIOINFORMATICS, 2016, 32 (17) : 545 - 551
  • [14] New approaches for metagenome assembly with short reads
    Ayling, Martin
    Clark, Matthew D.
    Leggett, Richard M.
    BRIEFINGS IN BIOINFORMATICS, 2020, 21 (02) : 584 - 594
  • [15] Combined assembly of long and short sequencing reads improve the efficiency of exploring the soil metagenome
    Xu, Guoshun
    Zhang, Liwen
    Liu, Xiaoqing
    Guan, Feifei
    Xu, Yuquan
    Yue, Haitao
    Huang, Jin-Qun
    Chen, Jieyin
    Wu, Ningfeng
    Tian, Jian
    BMC GENOMICS, 2022, 23 (01)
  • [16] HAT: haplotype assembly tool using short and error-prone long reads
    Zade, Ramin Shirali Hossein
    Urhan, Aysun
    de Souza, Alvaro Assis
    Singh, Akash
    Abeel, Thomas
    BIOINFORMATICS, 2022, 38 (24) : 5352 - 5359
  • [17] De novo assembly of short sequence reads
    Paszkiewicz, Konrad
    Studholme, David J.
    BRIEFINGS IN BIOINFORMATICS, 2010, 11 (05) : 457 - 472
  • [18] Combined assembly of long and short sequencing reads improve the efficiency of exploring the soil metagenome
    Guoshun Xu
    Liwen Zhang
    Xiaoqing Liu
    Feifei Guan
    Yuquan Xu
    Haitao Yue
    Jin-Qun Huang
    Jieyin Chen
    Ningfeng Wu
    Jian Tian
    BMC Genomics, 23
  • [19] Deep Learning for Assembly of Haplotypes and Viral Quasispecies from Short and Long Sequencing Reads
    Ke, Ziqi
    Vikalo, Haris
    13TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, BCB 2022, 2022,
  • [20] An Error Correction and DeNovo Assembly Approach for Nanopore Reads Using Short Reads
    Kchouk, Mehdi
    Elloumi, Mourad
    CURRENT BIOINFORMATICS, 2018, 13 (03) : 241 - 252