CIndex: compressed indexes for fast retrieval of FASTQ files

被引:1
|
作者
Huo, Hongwei [1 ]
Liu, Pengfei [1 ]
Wang, Chenhui [1 ]
Jiang, Hongbo [1 ]
Vitter, Jeffrey Scott [2 ]
机构
[1] Xidian Univ, Dept Comp Sci, Xian 710071, Peoples R China
[2] Tulane Univ, Dept Comp Sci, New Orleans, LA 70118 USA
基金
中国国家自然科学基金;
关键词
LOSSY COMPRESSION; READ ALIGNMENT; SUFFIX ARRAYS; ALGORITHMS; SEQUENCES; LOSSLESS;
D O I
10.1093/bioinformatics/btab655
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. Results: We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables R-EF and R-gamma, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.
引用
收藏
页码:335 / 343
页数:9
相关论文
共 50 条
  • [21] 2FAST2Q: a general-purpose sequence search and counting program for FASTQ files
    Bravo, Afonso M.
    Typas, Athanasios
    Veening, Jan-Willem
    PEERJ, 2022, 10
  • [22] RENANO: a REference-based compressor for NANOpore FASTQ files
    Dufort y Alvarez, Guillermo
    Seroussi, Gadiel
    Smircich, Pablo
    Sotelo-Silveira, Jose
    Ochoa, Idoia
    Martin, Alvaro
    BIOINFORMATICS, 2021, 37 (24) : 4862 - 4864
  • [23] Dynamic Clustering for Information Retrieval from Big Data Depending on Compressed Files
    Kadhim, Alaa F.
    Majeed, Ghassan H. Abdul
    Ali, Rasha Subhi
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2016, 7 (01) : 290 - 297
  • [24] A new efficient referential genome compression technique for FastQ files
    Kumar, Sanjeev
    Singh, Mukund Pratap
    Nayak, Soumya Ranjan
    Khan, Asif Uddin
    Jain, Anuj Kumar
    Singh, Prabhishek
    Diwakar, Manoj
    Soujanya, Thota
    FUNCTIONAL & INTEGRATIVE GENOMICS, 2023, 23 (04)
  • [25] Self-indexing inverted files for fast text retrieval
    Moffat, A
    Zobel, J
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1996, 14 (04) : 349 - 379
  • [26] A new efficient referential genome compression technique for FastQ files
    Sanjeev Kumar
    Mukund Pratap Singh
    Soumya Ranjan Nayak
    Asif Uddin Khan
    Anuj Kumar Jain
    Prabhishek Singh
    Manoj Diwakar
    Thota Soujanya
    Functional & Integrative Genomics, 2023, 23
  • [27] Fast In-Memory XPath Search using Compressed Indexes
    Arroyuelo, Diego
    Claude, Francisco
    Maneth, Sebastian
    Makinen, Veli
    Navarro, Gonzalo
    Nguyen, Kim
    Siren, Jouni
    Valimaki, Niko
    26TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING ICDE 2010, 2010, : 417 - 428
  • [28] Fast in-memory XPath search using compressed indexes
    Arroyuelo, Diego
    Claude, Francisco
    Maneth, Sebastian
    Makinen, Veli
    Navarro, Gonzalo
    Kim Nguyen
    Siren, Jouni
    Valimaki, Niko
    SOFTWARE-PRACTICE & EXPERIENCE, 2015, 45 (03): : 399 - 434
  • [29] Compressed Indexes for Fast Search of Semantic Data (Extended Abstract)
    Perego, Raffaele
    Pibiri, Giulio Ermanno
    Venturini, Rossano
    2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 2325 - 2326
  • [30] Fast content access and retrieval of JPEG compressed images
    Mehrabi, Mahdi
    Zargari, Farzad
    Ghanbari, Mohammad
    Shayegan, Mohammad Amin
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2016, 46 : 54 - 59