HALC: High throughput algorithm for long read error correction

被引:37
|
作者
Bao, Ergude [1 ,2 ]
Lan, Lingxiao [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Software Engn, 3 Shangyuan Residence, Beijing 100044, Peoples R China
[2] Univ Calif Riverside, Dept Bot & Plant Sci, 900 Univ Ave, Riverside, CA 92521 USA
来源
BMC BIOINFORMATICS | 2017年 / 18卷
基金
美国国家科学基金会;
关键词
PacBio long reads; Error correction; Throughput; MOLECULE SEQUENCING READS; BASIC LOCAL ALIGNMENT; RNA-SEQ DATA; GENOME ASSEMBLIES; TOOL; ACCURATE;
D O I
10.1186/s12859-017-1610-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis. Results: Here, we introduce HALC, a high throughput algorithm for long read error correction. HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region, including its true genome region's repeats in the contigs sufficiently similar to it (similar repeat based alignment approach). It then constructs a contig graph and, for each long read, references the other long reads' alignments to find the most accurate alignment and correct it with the aligned contig regions (long read support based validation approach). Even though some long read regions without the true genome regions in the contigs are corrected with their repeats, this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the uncorrected regions in between. In our performance tests on E. coli, A. thaliana and Maylandia zebra data sets, HALC was able to obtain 6.7-41.1% higher throughput than the existing algorithms while maintaining comparable accuracy. The HALC corrected long reads can thus result in 11.4-60.7% longer assembled contigs than the existing algorithms. Conclusions: The HALC software can be downloaded for free from this site: https://github.com/lanl001/halc.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] HALC: High throughput algorithm for long read error correction
    Ergude Bao
    Lingxiao Lan
    BMC Bioinformatics, 18
  • [2] A Long read hybrid error correction algorithm based on segmented pHMM
    Hu Lanyue
    Chen Jianhua
    Wang Rongshu
    Lu Zhiwen
    Hou Bin
    2020 5TH INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE 2020), 2020, : 1501 - 1504
  • [3] FLAS: fast and high-throughput algorithm for PacBio long-read self-correction
    Bao, Ergude
    Xie, Fei
    Song, Changjin
    Song, Dandan
    BIOINFORMATICS, 2019, 35 (20) : 3953 - 3960
  • [4] A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware
    Shi, Haixiang
    Schmidt, Bertil
    Liu, Weiguo
    Mueller-Wittig, Wolfgang
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (04) : 603 - 615
  • [5] A comprehensive evaluation of long read error correction methods
    Haowen Zhang
    Chirag Jain
    Srinivas Aluru
    BMC Genomics, 21
  • [6] LoRDEC: accurate and efficient long read error correction
    Salmela, Leena
    Rivals, Eric
    BIOINFORMATICS, 2014, 30 (24) : 3506 - 3514
  • [7] A comprehensive evaluation of long read error correction methods
    Zhang, Haowen
    Jain, Chirag
    Aluru, Srinivas
    BMC GENOMICS, 2020, 21 (Suppl 6)
  • [8] Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA
    Shi, Haixiang
    Schmidt, Bertil
    Liu, Weiguo
    Mueller-Wittig, Wolfgang
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 1546 - 1553
  • [9] ParLECH: Parallel Long-Read Error Correction with Hadoop
    Das, Arghya Kusum
    Lee, Kisung
    Park, Seung-Jong
    PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2018, : 341 - 348
  • [10] CloudRS: An Error Correction Algorithm of High-Throughput Sequencing Data based on Scalable Framework
    Chen, Chien-Chih
    Chang, Yu-Jung
    Chung, Wei-Chun
    Lee, Der-Tsai
    Ho, Jan-Ming
    2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,