A sensitive repeat identification framework based on short and long reads

被引:0
|
作者
Liao, Xingyu [1 ,2 ]
Li, Min [1 ]
Hu, Kang [1 ]
Wu, Fang-Xiang [3 ,4 ]
Gao, Xin [2 ]
Wang, Jianxin [1 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Hunan Prov Key Lab Bioinformat, Changsha 410083, Peoples R China
[2] King Abdullah Univ Sci & Technol KAUST, Computat Biosci Res Ctr CBRC, Comp Elect & Math Sci & Engn Div, Thuwal 23955, Saudi Arabia
[3] Univ Saskatchewan, Dept Mech Engn, Saskatoon, SK S7N 5A9, Canada
[4] Univ Saskatchewan, Div Biomed Engn, Saskatoon, SK S7N 5A9, Canada
基金
中国国家自然科学基金;
关键词
TRANSPOSABLE ELEMENTS; REPETITIVE DNA; SINGLE-CELL; GENOME; CLASSIFICATION; SEQUENCES; ASSEMBLER; FAMILIES; PROGRAM; SYSTEM;
D O I
10.1093/narlgkab563
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https: //github.com/BioinformaticsCSU/LongRepMarker).
引用
收藏
页数:18
相关论文
共 50 条
  • [21] REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads
    Chu, Chong
    Nielsen, Rasmus
    Wu, Yufeng
    PLOS ONE, 2016, 11 (03):
  • [22] Probably Correct: Rescuing Repeats with Short and Long Reads
    Cechova, Monika
    GENES, 2021, 12 (01) : 1 - 13
  • [23] HYBRIDSPADES: an algorithm for hybrid assembly of short and long reads
    Antipov, Dmitry
    Korobeynikov, Anton
    McLean, Jeffrey S.
    Pevzner, Pavel A.
    BIOINFORMATICS, 2016, 32 (07) : 1009 - 1015
  • [24] HiTaxon: a hierarchical ensemble framework for taxonomic classification of short reads
    Verma, Bhavish
    Parkinson, John
    BIOINFORMATICS ADVANCES, 2024, 4 (01):
  • [25] Identification and correction of substitution errors in Moleculo long reads
    Price, Jared
    Ward, Judson
    Udall, Joshua
    Snell, Quinn
    Clement, Mark
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2013,
  • [26] A Sensitive and Accurate protein domain cLassification Tool (SALT) for short reads
    Zhang, Yuan
    Sun, Yanni
    Cole, James R.
    BIOINFORMATICS, 2013, 29 (17) : 2103 - 2111
  • [27] SOAPindel: Efficient identification of indels from short paired reads
    Li, Shengting
    Li, Ruiqiang
    Li, Heng
    Lu, Jianliang
    Li, Yingrui
    Bolund, Lars
    Schierup, Mikkel H.
    Wang, Jun
    GENOME RESEARCH, 2013, 23 (01) : 195 - 200
  • [28] SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers
    Marchiori, Davide
    Comin, Matteo
    PROCEEDINGS OF THE 10TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, VOL 3: BIOINFORMATICS, 2017, : 59 - 67
  • [29] Merging short and stranded long reads improves transcript assembly
    Kainth A.S.
    Haddad G.A.
    Hall J.M.
    Ruthenburg A.J.
    PLoS Computational Biology, 2023, 19 (10 October)
  • [30] The long and the short (reads) of it - Methods for analyzing RNA splicing in cancer
    Barash, Yoseph
    CANCER SCIENCE, 2024, 115 : 1474 - 1474