CIndex: compressed indexes for fast retrieval of FASTQ files

被引:1
|
作者
Huo, Hongwei [1 ]
Liu, Pengfei [1 ]
Wang, Chenhui [1 ]
Jiang, Hongbo [1 ]
Vitter, Jeffrey Scott [2 ]
机构
[1] Xidian Univ, Dept Comp Sci, Xian 710071, Peoples R China
[2] Tulane Univ, Dept Comp Sci, New Orleans, LA 70118 USA
基金
中国国家自然科学基金;
关键词
LOSSY COMPRESSION; READ ALIGNMENT; SUFFIX ARRAYS; ALGORITHMS; SEQUENCES; LOSSLESS;
D O I
10.1093/bioinformatics/btab655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables R-EF and R-gamma, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
引用
收藏
页码:335 / 343
页数:9
相关论文
共 50 条
  • [41] BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
    Pineiro, Cesar
    Pichel, Juan C.
    GIGASCIENCE, 2023, 12
  • [42] Fast Compressed Self-indexes with Deterministic Linear-Time Construction
    J. Ian Munro
    Gonzalo Navarro
    Yakov Nekrich
    Algorithmica, 2020, 82 : 316 - 337
  • [43] Fast Compressed Self-indexes with Deterministic Linear-Time Construction
    Munro, J. Ian
    Navarro, Gonzalo
    Nekrich, Yakov
    ALGORITHMICA, 2020, 82 (02) : 316 - 337
  • [44] Fast texture description and retrieval of DCT-based compressed images
    Sim, DG
    Kim, HK
    Park, RH
    ELECTRONICS LETTERS, 2001, 37 (01) : 18 - 19
  • [45] FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files
    Gabriel Roser, Leandro
    Aguero, Fernan
    Oscar Sanchez, Daniel
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [46] Tabix: fast retrieval of sequence features from generic TAB-delimited files
    Li, Heng
    BIOINFORMATICS, 2011, 27 (05) : 718 - 719
  • [47] FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files
    Leandro Gabriel Roser
    Fernán Agüero
    Daniel Oscar Sánchez
    BMC Bioinformatics, 20
  • [48] Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines
    Frampton, Matthew
    Houlston, Richard
    PLOS ONE, 2012, 7 (11):
  • [49] LW-FQZip 2: a parallelized reference-based compression of FASTQ files
    Huang, Zhi-An
    Wen, Zhenkun
    Deng, Qingjin
    Chu, Ying
    Sun, Yiwen
    Zhu, Zexuan
    BMC BIOINFORMATICS, 2017, 18
  • [50] Restoring flowcell type and basecaller configuration from FASTQ files of nanopore sequencing data
    Jun Mencius
    Wenjun Chen
    Youqi Zheng
    Tingyi An
    Yongguo Yu
    Kun Sun
    Huijuan Feng
    Zhixing Feng
    Nature Communications, 16 (1)