HALC: High throughput algorithm for long read error correction

被引:37
|
作者
Bao, Ergude [1 ,2 ]
Lan, Lingxiao [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Software Engn, 3 Shangyuan Residence, Beijing 100044, Peoples R China
[2] Univ Calif Riverside, Dept Bot & Plant Sci, 900 Univ Ave, Riverside, CA 92521 USA
来源
BMC BIOINFORMATICS | 2017年 / 18卷
基金
美国国家科学基金会;
关键词
PacBio long reads; Error correction; Throughput; MOLECULE SEQUENCING READS; BASIC LOCAL ALIGNMENT; RNA-SEQ DATA; GENOME ASSEMBLIES; TOOL; ACCURATE;
D O I
10.1186/s12859-017-1610-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Using Geometric Structures to Improve the Error Correction Algorithm of High-Throughput Sequencing Data on MapReduce Framework
    Chung, Wei-Chun
    Chang, Yu-Jung
    Lee, D. T.
    Ho, Jan-Ming
    2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 784 - 789
  • [22] Tip to correct the variation position error in applying long-read high-throughput sequencing technology for fungal identification
    Langsiri, Nattapong
    Worasilchai, Navaporn
    Irinyi, Lazlo
    Meyer, Wieland
    Chindamporn, Ariya
    MEDICAL MYCOLOGY, 2022, 60 (SUPP 1) : 255 - 255
  • [23] Parallel algorithm for sensitive sequence recognition from long-read genome data with high error rate
    Zhong, Cheng
    Sun, Hui
    Tongxin Xuebao/Journal on Communications, 2023, 44 (02): : 160 - 171
  • [24] PSAEC: An Improved Algorithm for Short Read Error Correction Using Partial Suffix Arrays
    Zhao, Zhiheng
    Yin, Jianping
    Zhan, Yubin
    Xiong, Wei
    Li, Yong
    Liu, Fayao
    FRONTIERS IN ALGORITHMICS AND ALGORITHMIC ASPECTS IN INFORMATION AND MANAGEMENT, (FAW-AAIM 2011), 2011, 6681 : 220 - 232
  • [25] Identification and correction of systematic error in high-throughput sequence data
    Meacham, Frazer
    Boffelli, Dario
    Dhahbi, Joseph
    Martin, David I. K.
    Singer, Meromit
    Pachter, Lior
    BMC BIOINFORMATICS, 2011, 12
  • [26] HiTEC: accurate error correction in high-throughput sequencing data
    Ilie, Lucian
    Fazayeli, Farideh
    Ilie, Silvana
    BIOINFORMATICS, 2011, 27 (03) : 295 - 302
  • [27] Identification and correction of systematic error in high-throughput sequence data
    Frazer Meacham
    Dario Boffelli
    Joseph Dhahbi
    David IK Martin
    Meromit Singer
    Lior Pachter
    BMC Bioinformatics, 12
  • [28] Minimum error correction-based haplotype assembly: Considerations for long read data
    Majidian, Sina
    Kahaei, Mohammad Hossein
    de Ridder, Dick
    PLOS ONE, 2020, 15 (06):
  • [29] HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning
    Olivia Choudhury
    Ankush Chakrabarty
    Scott J. Emrich
    Scientific Reports, 8
  • [30] HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning
    Choudhury, Olivia
    Chakrabarty, Ankush
    Emrich, Scott J.
    SCIENTIFIC REPORTS, 2018, 8