A sensitive repeat identification framework based on short and long reads

被引：0

作者：

Liao, Xingyu ^{[1
,2
]}

Li, Min ^{[1
]}

Hu, Kang ^{[1
]}

Wu, Fang-Xiang ^{[3
,4
]}

Gao, Xin ^{[2
]}

Wang, Jianxin ^{[1
]}

机构：

[1] Cent South Univ, Sch Comp Sci & Engn, Hunan Prov Key Lab Bioinformat, Changsha 410083, Peoples R China

[2] King Abdullah Univ Sci & Technol KAUST, Computat Biosci Res Ctr CBRC, Comp Elect & Math Sci & Engn Div, Thuwal 23955, Saudi Arabia

[3] Univ Saskatchewan, Dept Mech Engn, Saskatoon, SK S7N 5A9, Canada

[4] Univ Saskatchewan, Div Biomed Engn, Saskatoon, SK S7N 5A9, Canada

来源：

NUCLEIC ACIDS RESEARCH | 2021年 / 49卷 / 17期

基金：

中国国家自然科学基金;

关键词：

TRANSPOSABLE ELEMENTS; REPETITIVE DNA; SINGLE-CELL; GENOME; CLASSIFICATION; SEQUENCES; ASSEMBLER; FAMILIES; PROGRAM; SYSTEM;

D O I：

10.1093/narlgkab563

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https: //github.com/BioinformaticsCSU/LongRepMarker).

引用

页数：18

共 50 条

[1] Hybrid de novo tandem repeat detection using short and long reads
Guillaume Fertin
Géraldine Jean
Andreea Radulescu
Irena Rusu
BMC Medical Genomics, 8
[2] Hybrid de novo tandem repeat detection using short and long reads
Fertin, Guillaume
Jean, Geraldine
Radulescu, Andreea
Rusu, Irena
BMC MEDICAL GENOMICS, 2015, 8
[3] Resolving repeat families with long reads
Philipp Bongartz
BMC Bioinformatics, 20
[4] Resolving repeat families with long reads
Bongartz, Philipp
BMC BIOINFORMATICS, 2019, 20 (1)
[5] Long reads for a short plant
Kellogg, Elizabeth A.
NATURE PLANTS, 2015, 1 (12)
[6] CoLoRMap: Correcting Long Reads by Mapping short reads
Haghshenas, Ehsan
Hach, Faraz
Sahinalp, S. Cenk
Chauve, Cedric
BIOINFORMATICS, 2016, 32 (17) : 545 - 551
[7] SLHSD: hybrid scaffolding method based on short and long reads
Luo, Junwei
Guan, Ting
Chen, Guolin
Yu, Zhonghua
Zhai, Haixia
Yan, Chaokun
Luo, Huimin
BRIEFINGS IN BIOINFORMATICS, 2023, 24 (03)
[8] A distributed framework for aligning short reads to genomes
Guo, Shanshan
Phan, Vinthuy
BMC BIOINFORMATICS, 2014, 15
[9] A distributed framework for aligning short reads to genomes
Shanshan Guo
Vinthuy Phan
BMC Bioinformatics, 15 (Suppl 10)
[10] A distributed framework for aligning short reads to genomes
Guo, Shanshan
Vinthuy Phan
BMC BIOINFORMATICS, 2014, 15

← 1 2 3 4 5 →