HALC: High throughput algorithm for long read error correction

被引：37

作者：

Bao, Ergude ^{[1
,2
]}

Lan, Lingxiao ^{[1
]}

机构：

[1] Beijing Jiaotong Univ, Sch Software Engn, 3 Shangyuan Residence, Beijing 100044, Peoples R China

[2] Univ Calif Riverside, Dept Bot & Plant Sci, 900 Univ Ave, Riverside, CA 92521 USA

来源：

BMC BIOINFORMATICS | 2017年 / 18卷

基金：

美国国家科学基金会;

关键词：

PacBio long reads; Error correction; Throughput; MOLECULE SEQUENCING READS; BASIC LOCAL ALIGNMENT; RNA-SEQ DATA; GENOME ASSEMBLIES; TOOL; ACCURATE;

D O I：

10.1186/s12859-017-1610-3

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.

引用

页数：12

共 50 条

[41] Improving transcriptome assembly through error correction of high-throughput sequence reads
MacManes, Matthew D.
Eisen, Michael B.
PEERJ, 2013, 1
[42] A* fast and scalable high-throughput sequencing data error correction via oligomers
Milicchio, Franco
Buchan, Iain E.
Prosperi, Mattia C. F.
2016 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2016,
[43] High throughput error correction in information reconciliation for semiconductor superlattice secure key distribution
Xie, Jianguo
Wu, Han
Xia, Chao
Ding, Peng
Song, Helun
Xu, Liwei
Chen, Xiaoming
SCIENTIFIC REPORTS, 2021, 11 (01)
[44] Error correction of high-throughput sequencing datasets with non-uniform coverage
Medvedev, Paul
Scott, Eric
Kakaradov, Boyko
Pevzner, Pavel
BIOINFORMATICS, 2011, 27 (13) : I137 - I141
[45] A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
Arghya Kusum Das
Sayan Goswami
Kisung Lee
Seung-Jong Park
BMC Genomics, 20
[46] A hybrid and scalable error correction algorithm for indel and substitution errors of long reads
Das, Arghya Kusum
Goswami, Sayan
Lee, Kisung
Park, Seung-Jong
BMC GENOMICS, 2019, 20 (Suppl 11)
[47] Benchmarking of long-read correction methods
Dohm, Juliane C.
Peters, Philipp
Stralis-Pavese, Nancy
Himmelbauer, Heinz
NAR GENOMICS AND BIOINFORMATICS, 2020, 2 (02)
[48] Effect of error correction strategy on speech dictation throughput
Lewis, JR
PROCEEDINGS OF THE HUMAN FACTORS AND ERGONOMICS SOCIETY 43RD ANNUAL MEETING, VOLS 1 AND 2, 1999, : 457 - 461
[49] Short Read Error Correction using an FM-Index
Greenstein, Seth
Holt, James
McMillan, Leonard
PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2015, : 101 - 104
[50] CARE: context-aware sequencing read error correction
Kallenborn, Felix
Hildebrandt, Andreas
Schmidt, Bertil
BIOINFORMATICS, 2021, 37 (07) : 889 - 895

← 1 2 3 4 5 →