Succincter Text Indexing with Wildcards

被引:0
|
作者
Thachuk, Chris [1 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1W5, Canada
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)-positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giving a succinct index requiring (2 + o(1)) n log sigma + O(n) + O(d log n) + O(k log k) bits for a text of length n over an alphabet of size s containing d groups of k wildcards. The new index is particularly favourable for larger alphabets and comparable for smaller alphabets, such as DNA. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying O(n) + O(d log n d) bits to also support efficient dictionary matching queries. We present a new query algorithm for our wildcard index that greatly reduces the query working space to O(dm + m log n) bits, where m is the length of the query. We note that compared to previous results this reduces the working space by two orders of magnitude when aligning short read data to the Human genome.
引用
收藏
页码:27 / 40
页数:14
相关论文
共 50 条
  • [41] Retrieval experiments: Full text versus human indexing versus automatic indexing
    Lancaster, FW
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1998, 49 (05): : 484 - 484
  • [42] Random Indexing and Modified Random Indexing based approach for extractive text summarization
    Chatterjee, Niladri
    Sahoo, Pramod Kumar
    COMPUTER SPEECH AND LANGUAGE, 2015, 29 (01): : 32 - 44
  • [43] A novel full-text indexing model for Chinese text retrieval
    Zhou, SG
    Hu, YF
    Hu, JT
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, 2001, 2113 : 370 - 379
  • [44] DyST: Dynamic and scalable temporal text indexing
    Norvag, Kjetil
    Nybo, Albert Overskeid
    TIME 2006: THIRTEENTH INTERNATIONAL SYMPOSIUM ON TEMPORAL REPRESENTATION AND REASONING, PROCEEDINGS, 2006, : 204 - +
  • [45] Indexing text events in digital video databases
    Gargi, U
    Antani, S
    Kasturi, R
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 916 - 918
  • [46] Indexing text and visual features for WWW images
    Shen, HT
    Zhou, XF
    Cui, B
    WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005, 2005, 3399 : 885 - 899
  • [47] Dynamic text indexing under string updates
    Ferragina, P
    JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 1997, 22 (02): : 296 - 328
  • [48] Metric indexing for the vector model in Text Retrieval
    Skopal, T
    Moravec, P
    Pokorny, J
    Snásel, V
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 183 - 195
  • [49] The impact of indexing approaches on Arabic text classification
    Al-Badarneh, Amer
    Al-Shawakfa, Emad
    Bani-Ismail, Basel
    Al-Rababah, Khaleel
    Shatnawi, Safwan
    JOURNAL OF INFORMATION SCIENCE, 2017, 43 (02) : 159 - 173
  • [50] Efficient indexing for Query By String text retrieval
    Ghosh, Suman K.
    Gomez, Liuis
    Karatzas, Dimosthenis
    Valveny, Ernest
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1236 - 1240