HALC: High throughput algorithm for long read error correction

被引:37
|
作者
Bao, Ergude [1 ,2 ]
Lan, Lingxiao [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Software Engn, 3 Shangyuan Residence, Beijing 100044, Peoples R China
[2] Univ Calif Riverside, Dept Bot & Plant Sci, 900 Univ Ave, Riverside, CA 92521 USA
来源
BMC BIOINFORMATICS | 2017年 / 18卷
基金
美国国家科学基金会;
关键词
PacBio long reads; Error correction; Throughput; MOLECULE SEQUENCING READS; BASIC LOCAL ALIGNMENT; RNA-SEQ DATA; GENOME ASSEMBLIES; TOOL; ACCURATE;
D O I
10.1186/s12859-017-1610-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] On the throughput of multicasting with incremental forward error correction
    Lee, IC
    Chang, CS
    Lien, CM
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2005, 51 (03) : 900 - 918
  • [32] Parallel Read Error Correction for Big Genomic Datasets
    Jammula, Nagakishore
    Chockalingam, Sriram
    Aluru, Srinivas
    2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 446 - 455
  • [33] Reptile: representative tiling for short read error correction
    Yang, Xiao
    Dorman, Karin S.
    Aluru, Srinivas
    BIOINFORMATICS, 2010, 26 (20) : 2526 - 2533
  • [34] Fiona: a parallel and automatic strategy for read error correction
    Schulz, Marcel H.
    Weese, David
    Holtgrewe, Manuel
    Dimitrova, Viktoria
    Niu, Sijia
    Reinert, Knut
    Richard, Hugues
    BIOINFORMATICS, 2014, 30 (17) : I356 - I363
  • [35] Perspectives and Benefits of High-Throughput Long-Read Sequencing in Microbial Ecology
    Tedersoo, Leho
    Albertsen, Math
    Anslan, Sten
    Callahan, Benjamin
    APPLIED AND ENVIRONMENTAL MICROBIOLOGY, 2021, 87 (17) : 1 - 19
  • [36] SHREC: a short-read error correction method
    Schroeder, Jan
    Schroeder, Heiko
    Puglisi, Simon J.
    Sinha, Ranjan
    Schmidt, Bertil
    BIOINFORMATICS, 2009, 25 (17) : 2157 - 2163
  • [37] Improved Algorithm for Error Correction
    Toghuj, Wael
    Alkhatib, Ghazi
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING, 2011, 6 (01) : 1 - 12
  • [38] MAECI: A pipeline for generating consensus sequence with nanopore sequencing long-read assembly and error correction
    Lang, Jidong
    PLOS ONE, 2022, 17 (05):
  • [39] High Throughput Joint Error Detection and Correction Based on GRAND-MO and CRC
    Zhan, Ming
    Yu, Kan
    Wu, Fang
    Zhou, Qiang
    Luo, Yichen
    Zhang, Shiqing
    Zhang, Jianwu
    Pang, Zhibo
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (04) : 7461 - 7469
  • [40] High throughput error correction in information reconciliation for semiconductor superlattice secure key distribution
    Jianguo Xie
    Han Wu
    Chao Xia
    Peng Ding
    Helun Song
    Liwei Xu
    Xiaoming Chen
    Scientific Reports, 11