CIndex: compressed indexes for fast retrieval of FASTQ files

被引：1

作者：

Huo, Hongwei ^{[1
]}

Liu, Pengfei ^{[1
]}

Wang, Chenhui ^{[1
]}

Jiang, Hongbo ^{[1
]}

Vitter, Jeffrey Scott ^{[2
]}

机构：

[1] Xidian Univ, Dept Comp Sci, Xian 710071, Peoples R China

[2] Tulane Univ, Dept Comp Sci, New Orleans, LA 70118 USA

来源：

BIOINFORMATICS | 2022年 / 38卷 / 02期

基金：

中国国家自然科学基金;

关键词：

LOSSY COMPRESSION; READ ALIGNMENT; SUFFIX ARRAYS; ALGORITHMS; SEQUENCES; LOSSLESS;

D O I：

10.1093/bioinformatics/btab655

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables R-EF and R-gamma, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.

引用

页码：335 / 343

页数：9

共 50 条

[41] BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
Pineiro, Cesar
Pichel, Juan C.
GIGASCIENCE, 2023, 12
[42] Fast Compressed Self-indexes with Deterministic Linear-Time Construction
J. Ian Munro
Gonzalo Navarro
Yakov Nekrich
Algorithmica, 2020, 82 : 316 - 337
[43] Fast Compressed Self-indexes with Deterministic Linear-Time Construction
Munro, J. Ian
Navarro, Gonzalo
Nekrich, Yakov
ALGORITHMICA, 2020, 82 (02) : 316 - 337
[44] Fast texture description and retrieval of DCT-based compressed images
Sim, DG
Kim, HK
Park, RH
ELECTRONICS LETTERS, 2001, 37 (01) : 18 - 19
[45] FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files
Gabriel Roser, Leandro
Aguero, Fernan
Oscar Sanchez, Daniel
BMC BIOINFORMATICS, 2019, 20 (1)
[46] Tabix: fast retrieval of sequence features from generic TAB-delimited files
Li, Heng
BIOINFORMATICS, 2011, 27 (05) : 718 - 719
[47] FastqCleaner: an interactive Bioconductor application for quality-control, filtering and trimming of FASTQ files
Leandro Gabriel Roser
Fernán Agüero
Daniel Oscar Sánchez
BMC Bioinformatics, 20
[48] Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines
Frampton, Matthew
Houlston, Richard
PLOS ONE, 2012, 7 (11):
[49] LW-FQZip 2: a parallelized reference-based compression of FASTQ files
Huang, Zhi-An
Wen, Zhenkun
Deng, Qingjin
Chu, Ying
Sun, Yiwen
Zhu, Zexuan
BMC BIOINFORMATICS, 2017, 18
[50] Restoring flowcell type and basecaller configuration from FASTQ files of nanopore sequencing data
Jun Mencius
Wenjun Chen
Youqi Zheng
Tingyi An
Yongguo Yu
Kun Sun
Huijuan Feng
Zhixing Feng
Nature Communications, 16 (1)

← 1 2 3 4 5 →