CIndex: compressed indexes for fast retrieval of FASTQ files

被引:1
|
作者
Huo, Hongwei [1 ]
Liu, Pengfei [1 ]
Wang, Chenhui [1 ]
Jiang, Hongbo [1 ]
Vitter, Jeffrey Scott [2 ]
机构
[1] Xidian Univ, Dept Comp Sci, Xian 710071, Peoples R China
[2] Tulane Univ, Dept Comp Sci, New Orleans, LA 70118 USA
基金
中国国家自然科学基金;
关键词
LOSSY COMPRESSION; READ ALIGNMENT; SUFFIX ARRAYS; ALGORITHMS; SEQUENCES; LOSSLESS;
D O I
10.1093/bioinformatics/btab655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables R-EF and R-gamma, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
引用
收藏
页码:335 / 343
页数:9
相关论文
共 50 条
  • [31] RETRACTED: LFQC: a lossless compression algorithm for FASTQ files (Retracted Article)
    Nicolae, Marius
    Pathak, Sudipta
    Rajasekaran, Sanguthevar
    BIOINFORMATICS, 2015, 31 (20) : 3276 - 3281
  • [32] Cellular liberality is measurable as Lempel-Ziv complexity of fastq files
    Ogata, Norichika
    Hosaka, Aoi
    2022 IEEE 22ND INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE 2022), 2022, : 321 - 326
  • [33] Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
    Nazari, Foad
    Patel, Sneh
    Larocca, Melissa
    Sansevich, Alina
    Czarny, Ryan
    Schena, Giana
    Murray, Emma K.
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [34] A Method for Hypermutated Viral Sequences Detection in Fastq and Bam Format Files
    Alinejad-Rokny, Hamid
    Masoud, Masoudeh
    JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2016, 6 (05) : 1202 - 1208
  • [35] GTRAC: fast retrieval from compressed collections of genomic variants
    Tatwawadi, Kedar
    Hernaez, Mikel
    Ochoa, Idoia
    Weissman, Tsachy
    BIOINFORMATICS, 2016, 32 (17) : 479 - 486
  • [36] BEETL-fastq: a searchable compressed archive for DNA reads
    Janin, Lilian
    Schulz-Trieglaff, Ole
    Cox, Anthony J.
    BIOINFORMATICS, 2014, 30 (19) : 2796 - 2801
  • [37] Incremental cluster-based retrieval using compressed cluster-skipping inverted files
    Altingovde, Ismail Sengor
    Demir, Engin
    Can, Fazli
    Ulusoy, Oezguer
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (03)
  • [38] BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
    Pineiro, Cesar
    Pichel, Juan C.
    GIGASCIENCE, 2023, 12
  • [39] BUSZ: compressed BUS files
    Einarsson, Petur Helgi
    Melsted, Pall
    BIOINFORMATICS, 2023, 39 (05)
  • [40] BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale
    Pineiro, Cesar
    Pichel, Juan C.
    GIGASCIENCE, 2023, 12