Compressed indexes for text with wildcards

被引:3
|
作者
Thachuk, Chris [1 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z4, Canada
关键词
Compressed indexes; Text search; Wildcard matching; Dictionary matching; SEARCH;
D O I
10.1016/j.tcs.2012.08.011
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity while maintaining the query time complexity of previous approaches by giving a compressed index requiring 2nH(k)(T) + o(n log sigma) + O(n + d logo) bits for a text T of length n over an alphabet of size a containing d groups of wildcards. The new index is particularly favorable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n) + O(d log n/d) bits to also support efficient dictionary matching queries. We discuss how the space can be reduced further by a number of approaches and by allowing an increase in the worst case query time. We also present a new query algorithm for our wildcard indexes that can greatly reduce the query working space. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:22 / 35
页数:14
相关论文
共 50 条
  • [31] Universal compressed text indexing
    Navarro, Gonzalo
    Prezza, Nicola
    THEORETICAL COMPUTER SCIENCE, 2019, 762 : 41 - 50
  • [32] Search and retrieval of compressed text
    Mukherjee, A
    Zhang, N
    Tao, T
    Satya, RV
    Sun, WF
    ADVANCES IN COMPUTERS, VOL 63: PARALLEL, DISTRIBUTED, AND PERVASIVE COMPUTING, 2005, 63 : 207 - 262
  • [33] Approximate searching on compressed text
    Pérez, CA
    Uribe, CF
    15TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND COMPUTERS, PROCEEDINGS, 2005, : 258 - 261
  • [34] Improved Grammar-Based Compressed Indexes
    Claude, Francisco
    Navarro, Gonzalo
    STRING PROCESSING AND INFORMATION RETRIEVAL: 19TH INTERNATIONAL SYMPOSIUM, SPIRE 2012, 2012, 7608 : 180 - 192
  • [35] Compressed indexes for string searching in labeled graphs
    Dipartimento di Informatica, University of Pisa, Italy
    Proc. Int. Conf. World Wide Web, WWW, (322-332):
  • [36] Compressed Indexes for String Searching in Labeled Graphs
    Ferragina, Paolo
    Piccinno, Francesco
    Venturini, Rossano
    PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW 2015), 2015, : 322 - 332
  • [37] Compressed Indexes for Fast Search of Semantic Data
    Perego, Raffaele
    Pibiri, Giulio Ermanno
    Venturini, Rossano
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (09) : 3187 - 3198
  • [38] Compressed bitmap indexes: beyond unions and intersections
    Kaser, Owen
    Lemire, Daniel
    SOFTWARE-PRACTICE & EXPERIENCE, 2016, 46 (02): : 167 - 198
  • [39] Succinct text indexes on large alphabet
    Zhang, Meng
    Tang, Jijun
    Guo, Dong
    Hu, Liang
    Li, Qiang
    THEORY AND APPLICATIONS OF MODELS OF COMPUTATION, PROCEEDINGS, 2006, 3959 : 528 - 537
  • [40] Scalable Construction of Text Indexes with Thrill
    Bingmann, Timo
    Gog, Simon
    Kurpicz, Florian
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 634 - 643