A sensitive repeat identification framework based on short and long reads

被引:0
|
作者
Liao, Xingyu [1 ,2 ]
Li, Min [1 ]
Hu, Kang [1 ]
Wu, Fang-Xiang [3 ,4 ]
Gao, Xin [2 ]
Wang, Jianxin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Hunan Prov Key Lab Bioinformat, Changsha 410083, Peoples R China
[2] King Abdullah Univ Sci & Technol KAUST, Computat Biosci Res Ctr CBRC, Comp Elect & Math Sci & Engn Div, Thuwal 23955, Saudi Arabia
[3] Univ Saskatchewan, Dept Mech Engn, Saskatoon, SK S7N 5A9, Canada
[4] Univ Saskatchewan, Div Biomed Engn, Saskatoon, SK S7N 5A9, Canada
基金
中国国家自然科学基金;
关键词
TRANSPOSABLE ELEMENTS; REPETITIVE DNA; SINGLE-CELL; GENOME; CLASSIFICATION; SEQUENCES; ASSEMBLER; FAMILIES; PROGRAM; SYSTEM;
D O I
10.1093/narlgkab563
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https: //github.com/BioinformaticsCSU/LongRepMarker).
引用
收藏
页数:18
相关论文
共 50 条
  • [41] GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads
    Chu, Chong
    Li, Xin
    Wu, Yufeng
    2017 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2017,
  • [42] SVIM: structural variant identification using mapped long reads
    Heller, David
    Vingron, Martin
    BIOINFORMATICS, 2019, 35 (17) : 2907 - 2915
  • [43] Satellite DNA evolution in Corvoidea inferred from short and long reads
    Peona, Valentina
    Kutschera, Verena E.
    Blom, Mozes P. K.
    Irestedt, Martin
    Suh, Alexander
    MOLECULAR ECOLOGY, 2023, 32 (06) : 1288 - 1305
  • [44] Sequencing synergy: integration of short and long reads for comprehensive pharmacogenetics testing
    Tellado, Sonia Font
    Brennan, Patrick
    Busse, Birgit
    Keskic, Leila
    Gentili, Sophie
    Lott, Steffen
    Wachter, Oliver
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 1698 - 1698
  • [45] Optimizing sequencing protocols for leaderboard metagenomics by combining long and short reads
    Jon G. Sanders
    Sergey Nurk
    Rodolfo A. Salido
    Jeremiah Minich
    Zhenjiang Z. Xu
    Qiyun Zhu
    Cameron Martino
    Marcus Fedarko
    Timothy D. Arthur
    Feng Chen
    Brigid S. Boland
    Greg C. Humphrey
    Caitriona Brennan
    Karenina Sanders
    James Gaffney
    Kristen Jepsen
    Mahdieh Khosroheidari
    Cliff Green
    Marlon Liyanage
    Jason W. Dang
    Vanessa V. Phelan
    Robert A. Quinn
    Anton Bankevich
    John T. Chang
    Tariq M. Rana
    Douglas J. Conrad
    William J. Sandborn
    Larry Smarr
    Pieter C. Dorrestein
    Pavel A. Pevzner
    Rob Knight
    Genome Biology, 20
  • [46] IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences
    Li, Hong-Dong
    Zhang, Wenjing
    Luo, Yuwen
    Wang, Jianxin
    CURRENT BIOINFORMATICS, 2020, 15 (10) : 1168 - 1177
  • [47] Retained introns in long RNA-seq reads are not reliably detected in sample-matched short reads
    Julianne K. David
    Sean K. Maden
    Mary A. Wood
    Reid F. Thompson
    Abhinav Nellore
    Genome Biology, 23
  • [48] Optimizing sequencing protocols for leaderboard metagenomics by combining long and short reads
    Sanders, Jon G.
    Nurk, Sergey
    Salido, Rodolfo A.
    Minich, Jeremiah
    Xu, Zhenjiang Z.
    Zhu, Qiyun
    Martino, Cameron
    Fedarko, Marcus
    Arthur, Timothy D.
    Chen, Feng
    Boland, Brigid S.
    Humphrey, Greg C.
    Brennan, Caitriona
    Sanders, Karenina
    Gaffney, James
    Jepsen, Kristen
    Khosroheidari, Mahdieh
    Green, Cliff
    Liyanage, Marlon
    Dang, Jason W.
    Phelan, Vanessa V.
    Quinn, Robert A.
    Bankevich, Anton
    Chang, John T.
    Rana, Tariq M.
    Conrad, Douglas J.
    Sandborn, William J.
    Smarr, Larry
    Dorrestein, Pieter C.
    Pevzner, Pavel A.
    Knight, Rob
    GENOME BIOLOGY, 2019, 20 (01) : 1 - 14
  • [49] Improved transcriptome assembly using a hybrid of long and short reads with StringTie
    Shumate, Alaina
    Wong, Brandon
    Pertea, Geo
    Pertea, Mihaela
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (06)
  • [50] Dysgu: efficient structural variant calling using short or long reads
    Cleal, Kez
    Baird, Duncan M.
    NUCLEIC ACIDS RESEARCH, 2022, 50 (09) : E53