Compressed indexes for text with wildcards

被引:3
|
作者
Thachuk, Chris [1 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z4, Canada
关键词
Compressed indexes; Text search; Wildcard matching; Dictionary matching; SEARCH;
D O I
10.1016/j.tcs.2012.08.011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity while maintaining the query time complexity of previous approaches by giving a compressed index requiring 2nH(k)(T) + o(n log sigma) + O(n + d logo) bits for a text T of length n over an alphabet of size a containing d groups of wildcards. The new index is particularly favorable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n) + O(d log n/d) bits to also support efficient dictionary matching queries. We discuss how the space can be reduced further by a number of approaches and by allowing an increase in the worst case query time. We also present a new query algorithm for our wildcard indexes that can greatly reduce the query working space. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:22 / 35
页数:14
相关论文
共 50 条
  • [41] Regular Expression Search on Compressed Text
    Ganty, Pierre
    Valero, Pedro
    2019 DATA COMPRESSION CONFERENCE (DCC), 2019, : 528 - 537
  • [42] Direct pattern matching on compressed text
    de Moura, ES
    Navarro, G
    Ziviani, N
    Baeza-Yates, R
    STRING PROCESSING AND INFORMATION RETRIEVAL - PROCEEDINGS: A SOUTH AMERICAN SYMPOSIUM, 1998, : 90 - 95
  • [43] Compressed Context Modeling for Text Compression
    Kulekci, M. Oguzhan
    2011 DATA COMPRESSION CONFERENCE (DCC), 2011, : 373 - 382
  • [44] Classification of compressed and uncompressed text documents
    Bhushan, N. Bharath
    Danti, Ajit
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 88 : 614 - 623
  • [45] Matching wildcards:: An algorithm -: A simple wildcard text-matching algorithm in a single while loop
    Krauss, Kirk J.
    DR DOBBS JOURNAL, 2008, 33 (09): : 37 - 39
  • [46] Grammar-compressed indexes with logarithmic search time
    Claude, Francisco
    Navarro, Gonzalo
    Pacheco, Alejandro
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2021, 118 : 53 - 74
  • [47] High Performance Queries Using Compressed Bitmap Indexes
    Yildiz, Beytullah
    EURO-PAR 2019: PARALLEL PROCESSING WORKSHOPS, 2020, 11997 : 493 - 505
  • [48] CIndex: compressed indexes for fast retrieval of FASTQ files
    Huo, Hongwei
    Liu, Pengfei
    Wang, Chenhui
    Jiang, Hongbo
    Vitter, Jeffrey Scott
    BIOINFORMATICS, 2022, 38 (02) : 335 - 343
  • [49] Optimal-Time Dictionary-Compressed Indexes
    Christiansen, Anders Roy
    Ettienne, Mikko Berggren
    Kociumaka, Tomasz
    Navarro, Gonzalo
    Prezza, Nicola
    ACM TRANSACTIONS ON ALGORITHMS, 2021, 17 (01)
  • [50] Compressed Inverted Indexes for In-Memory Search Engines
    Transier, Frederik
    Sanders, Peter
    PROCEEDINGS OF THE TENTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE FIFTH WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS, 2008, : 3 - +