Compressed indexes for text with wildcards

被引:3
|
作者
Thachuk, Chris [1 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z4, Canada
关键词
Compressed indexes; Text search; Wildcard matching; Dictionary matching; SEARCH;
D O I
10.1016/j.tcs.2012.08.011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity while maintaining the query time complexity of previous approaches by giving a compressed index requiring 2nH(k)(T) + o(n log sigma) + O(n + d logo) bits for a text T of length n over an alphabet of size a containing d groups of wildcards. The new index is particularly favorable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n) + O(d log n/d) bits to also support efficient dictionary matching queries. We discuss how the space can be reduced further by a number of approaches and by allowing an increase in the worst case query time. We also present a new query algorithm for our wildcard indexes that can greatly reduce the query working space. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:22 / 35
页数:14
相关论文
共 50 条
  • [1] Compressed Text Indexing with Wildcards
    Hon, Wing-Kai
    Ku, Tsung-Han
    Shah, Rahul
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    STRING PROCESSING AND INFORMATION RETRIEVAL, 2011, 7024 : 267 - +
  • [2] Compressed text indexing with wildcards
    Hon, Wing-Kai
    Ku, Tsung-Han
    Shah, Rahul
    Thankachan, Sharma V.
    Vitter, Jeffrey Scott
    JOURNAL OF DISCRETE ALGORITHMS, 2013, 19 (19) : 23 - 29
  • [3] Compressed text indexes with fast locate
    Gonzalez, Rodrigo
    Navarro, Gonzalo
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2007, 4580 : 216 - +
  • [4] Compressed Indexes for Dynamic Text Collections
    Chan, Ho-Leung
    Hon, Wing-Kai
    Lam, Tak-Wah
    Sadakane, Kunihiko
    ACM TRANSACTIONS ON ALGORITHMS, 2007, 3 (02)
  • [5] Compressed full-text indexes
    Navarro, Gonzalo
    Makinen, Veli
    ACM COMPUTING SURVEYS, 2007, 39 (01)
  • [6] Succinct Text Indexing with Wildcards
    Tam, Alan
    Wu, Edward
    Lam, Tak-Wah
    Yiu, Siu-Ming
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5721 : 39 - 50
  • [7] Compressed Representations of Sequences and Full-Text Indexes
    Ferragina, Paolo
    Manzini, Giovanni
    Makinen, Veli
    Navarro, Gonzalo
    ACM TRANSACTIONS ON ALGORITHMS, 2007, 3 (02)
  • [8] Succincter Text Indexing with Wildcards
    Thachuk, Chris
    COMBINATORIAL PATTERN MATCHING, 22ND ANNUAL SYMPOSIUM, CPM 2011, 2011, 6661 : 27 - 40
  • [9] Distribution-Aware Compressed Full-Text Indexes
    Ferragina, Paolo
    Siren, Jouni
    Venturini, Rossano
    ALGORITHMS - ESA 2011, 2011, 6942 : 760 - 771
  • [10] Improved Compressed Indexes for Full-Text Document Retrieval
    Belazzougui, Djamal
    Navarro, Gonzalo
    STRING PROCESSING AND INFORMATION RETRIEVAL, 2011, 7024 : 386 - +