Processing of Huffman compressed texts with a super-alphabet

被引:0
|
作者
Fredriksson, K
Tarhio, J
机构
[1] Univ Joensuu, Dept CS, FIN-80101 Joensuu, Finland
[2] Aalto Univ, Dept CSE, FIN-02015 Espoo, Finland
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2003年 / 2857卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(nlog(2)sigma/b) time, where n is the size of the compressed text in bytes, or is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O(b/H log(2)sigma) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2(b)) space, and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(nlog(2)sigma/Hw), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n log(2)sigma/b + t), where t is the number of occurrences reported, and a shift-or string matching algorithm that works in time O(n log(2)sigma/b [(m + s)/w] + t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/H log(2) sigma). The method can be applied to several other algorithms as well. We conclude with some experimental results.
引用
收藏
页码:108 / 121
页数:14
相关论文
共 50 条
  • [1] Efficient string matching in Huffman compressed texts
    Fredriksson, K
    Tarhio, J
    FUNDAMENTA INFORMATICAE, 2004, 63 (01) : 1 - 16
  • [2] Huffman coding with an infinite alphabet
    Kato, A
    Han, TS
    Nagaoka, H
    IEEE TRANSACTIONS ON INFORMATION THEORY, 1996, 42 (03) : 977 - 984
  • [3] Processing compressed texts: A tractability border
    Lifshits, Yury
    Combinatorial Pattern Matching, Proceedings, 2007, 4580 : 228 - 240
  • [4] Huffman Redundancy for Large Alphabet Sources
    Narimani, Hamed
    Khosravifard, Mohammadali
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2014, 60 (03) : 1412 - 1427
  • [5] Level-compressed Huffman decoding
    Chung, KL
    Wu, JG
    IEEE TRANSACTIONS ON COMMUNICATIONS, 1999, 47 (10) : 1455 - 1457
  • [6] Pattern matching in Huffman encoded texts
    Klein, ST
    Shapira, D
    INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (04) : 829 - 841
  • [7] Pattern matching in Huffman encoded texts
    Klein, ST
    Shapira, D
    DCC 2001: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2001, : 449 - 458
  • [8] Alphabet partitioning techniques for semiadaptive Huffman coding of large alphabets
    Chen, Dan
    Chiang, Yi-Jen
    Memon, Nasir
    Wu, Xiaolin
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2007, 55 (03) : 436 - 443
  • [9] Generation of fast interpreters for Huffman compressed bytecode
    Latendresse, M
    Feeley, M
    SCIENCE OF COMPUTER PROGRAMMING, 2005, 57 (03) : 295 - 317
  • [10] Rewriting Turkish texts written in English alphabet using Turkish alphabet
    Okur, Burak Cagri
    Takci, Hidayet
    Akgul, Yusuf Sinan
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,