CIndex: compressed indexes for fast retrieval of FASTQ files

被引:1
|
作者
Huo, Hongwei [1 ]
Liu, Pengfei [1 ]
Wang, Chenhui [1 ]
Jiang, Hongbo [1 ]
Vitter, Jeffrey Scott [2 ]
机构
[1] Xidian Univ, Dept Comp Sci, Xian 710071, Peoples R China
[2] Tulane Univ, Dept Comp Sci, New Orleans, LA 70118 USA
基金
中国国家自然科学基金;
关键词
LOSSY COMPRESSION; READ ALIGNMENT; SUFFIX ARRAYS; ALGORITHMS; SEQUENCES; LOSSLESS;
D O I
10.1093/bioinformatics/btab655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables R-EF and R-gamma, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
引用
收藏
页码:335 / 343
页数:9
相关论文
共 50 条
  • [1] GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
    Xing, Yuting
    Li, Gen
    Wang, Zhenguo
    Feng, Bolun
    Song, Zhuo
    Wu, Chengkun
    BMC BIOINFORMATICS, 2017, 18
  • [2] GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
    Yuting Xing
    Gen Li
    Zhenguo Wang
    Bolun Feng
    Zhuo Song
    Chengkun Wu
    BMC Bioinformatics, 18
  • [3] Compression of Nanopore FASTQ Files
    Dufort y Alvarez, Guillermo
    Seroussi, Gadiel
    Smircich, Pablo
    Sotelo, Jose
    Ochoa, Idoia
    Martin, Alvaro
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2019, PT I, 2019, 11465 : 36 - 47
  • [4] ENANO: Encoder for NANOpore FASTQ files
    Alvarez, Guillermo Dufort Y.
    Seroussi, Gadiel
    Smircich, Pablo
    Sotelo, Jose
    Ochoa, Idoia
    Martin, Alvaro
    BIOINFORMATICS, 2020, 36 (16) : 4506 - 4507
  • [5] Efficient algorithms for the compression of FASTQ files
    Saha, Subrata
    Rajasekaran, Sanguthevar
    2014 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2014,
  • [6] Compressed text indexes with fast locate
    Gonzalez, Rodrigo
    Navarro, Gonzalo
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2007, 4580 : 216 - +
  • [7] Compressed indexes for fast search in sequences
    Grossi, R
    Vitter, JS
    PROCEEDINGS OF THE 6TH JOINT CONFERENCE ON INFORMATION SCIENCES, 2002, : 44 - 47
  • [8] Music information retrieval in compressed audio files: a survey
    Zampoglou, Markos
    Malamos, Athanasios G.
    NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA, 2014, 20 (03) : 189 - 206
  • [9] mgikit: demultiplexing toolkit for MGI fastq files
    Al Bkhetan, Ziad
    Wang, Sen
    BIOINFORMATICS, 2024, 40 (09)
  • [10] Compressed Indexes for Fast Search of Semantic Data
    Perego, Raffaele
    Pibiri, Giulio Ermanno
    Venturini, Rossano
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (09) : 3187 - 3198