Succincter Text Indexing with Wildcards

被引:0
|
作者
Thachuk, Chris [1 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1W5, Canada
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giving a succinct index requiring (2 + o(1)) n log sigma + O(n) + O(d log n) + O(k log k) bits for a text of length n over an alphabet of size s containing d groups of k wildcards. The new index is particularly favourable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n) + O(d log n d) bits to also support efficient dictionary matching queries. We present a new query algorithm for our wildcard index that greatly reduces the query working space to O(dm + m log n) bits, where m is the length of the query. We note that compared to previous results this reduces the working space by two orders of magnitude when aligning short read data to the Human genome.
引用
收藏
页码:27 / 40
页数:14
相关论文
共 50 条
  • [31] COMPUTER EVALUATION OF INDEXING AND TEXT PROCESSING
    SALTON, G
    LESK, ME
    JOURNAL OF THE ACM, 1968, 15 (01) : 8 - &
  • [32] Text segmentation by latent semantic indexing
    Ishioka, T
    NEW DEVELOPMENTS IN PSYCHOMETRICS, 2003, : 689 - 696
  • [33] Sparse Text Indexing in Small Space
    Bille, Philip
    Fischer, Johannes
    Gortz, Inge Li
    Kopelowitz, Tsvi
    Sach, Benjamin
    Vildhoj, Hjalte Wedel
    ACM TRANSACTIONS ON ALGORITHMS, 2016, 12 (03)
  • [34] Parameterized Text Indexing with One Wildcard
    Ganguly, Arnab
    Hon, Wing-Kai
    Huang, Yu-An
    Pissis, Solon P.
    Shah, Rahul
    Thankachan, Sharma, V
    2019 DATA COMPRESSION CONFERENCE (DCC), 2019, : 152 - 161
  • [35] SEMIAUTOMATIC INDEXING OF STRUCTURED INFORMATION OF TEXT
    NISHIDA, F
    TAKAMATSU, S
    FUJITA, Y
    JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1984, 24 (01): : 15 - 20
  • [36] Concept indexing for automated text categorization
    Gómez, JM
    Cortizo, JC
    Puertas, E
    Ruiz, M
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2004, 3136 : 195 - 206
  • [37] Keywords, Indexing, Text Analysis: An Editorial
    Smiraglia, Richard P.
    KNOWLEDGE ORGANIZATION, 2013, 40 (03): : 155 - 159
  • [38] Overlapping statistical word indexing: A new indexing method for Japanese text
    Ogawa, Y
    Matsuda, T
    PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1997, : 226 - 234
  • [39] Text influenced molecular indexing (TIMI).
    Singh, SB
    Hull, RD
    Fluder, EM
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2001, 221 : U393 - U393
  • [40] Text Indexing for Regular Expression Matching
    Gibney, Daniel
    Thankachan, Sharma, V
    ALGORITHMS, 2021, 14 (05)