Efficient string matching in Huffman compressed texts

被引:0
|
作者
Fredriksson, K
Tarhio, J
机构
[1] Univ Joensuu, Dept Comp Sci, FIN-80101 Joensuu, Finland
[2] Helsinki Univ Technol, Dept CSE, Espoo 02015, Finland
关键词
Huffman compression; string matching; natural language;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O (n (log2sigma)/(b)) time, where n is the size of the compressed text in bytes, a is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O((b)/(Hlog2sigma)) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2(b)) space and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n(log2sigma)/(Hw)), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n(log2sigma)/(b) + t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n(log2sigma)/(b)[(m + s - 1)/w]+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O((b)/(Hlog2sigma)). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 50 条
  • [31] ANALYZING THE PERFORMANCE DIFFERENCES BETWEEN PATTERN MATCHING AND COMPRESSED PATTERN MATCHING ON TEXTS
    Erdogan, Cihat
    Bulus, H. Nusret
    Diri, Banu
    2013 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTER AND COMPUTATION (ICECCO), 2013, : 135 - 138
  • [32] Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
    Daptardar, A
    Shapira, D
    DCC 2004: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2004, : 535 - 535
  • [33] Approximate string matching with Lempel-Ziv compressed indexes
    Russo, Luis M. S.
    Navarro, Gonzalo
    Oliveira, Arlindo L.
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2007, 4726 : 264 - +
  • [34] A Compressed Enhanced Suffix Array Supporting Fast String Matching
    Oblebusch, Enno
    Gog, Simon
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5721 : 51 - 62
  • [35] A compressed string matching algorithm for face recognition with partial occlusion
    Bommidi, Krishnaveni
    Sundaramurthy, Sridhar
    MULTIMEDIA SYSTEMS, 2021, 27 (02) : 191 - 203
  • [36] A compressed string matching algorithm for face recognition with partial occlusion
    Krishnaveni Bommidi
    Sridhar Sundaramurthy
    Multimedia Systems, 2021, 27 : 191 - 203
  • [37] Approximate pattern matching in LZ77-compressed texts
    Gagie, Travis
    Gawrychowski, Pawel
    Puglisi, Simon J.
    JOURNAL OF DISCRETE ALGORITHMS, 2015, 32 : 64 - 68
  • [38] Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
    Shapira, D
    Daptardar, A
    INFORMATION PROCESSING & MANAGEMENT, 2006, 42 (02) : 429 - 439
  • [39] Worst-case efficient single and multiple string matching on packed texts in the word-RAM model
    Belazzougui, Djamal
    JOURNAL OF DISCRETE ALGORITHMS, 2012, 14 : 91 - 106
  • [40] Time/space efficient compressed pattern matching
    Gasieniec, L
    Potapov, I
    FUNDAMENTA INFORMATICAE, 2003, 56 (1-2) : 137 - 154