Processing of Huffman compressed texts with a super-alphabet

被引:0
|
作者
Fredriksson, K
Tarhio, J
机构
[1] Univ Joensuu, Dept CS, FIN-80101 Joensuu, Finland
[2] Aalto Univ, Dept CSE, FIN-02015 Espoo, Finland
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2003年 / 2857卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(nlog(2)sigma/b) time, where n is the size of the compressed text in bytes, or is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O(b/H log(2)sigma) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2(b)) space, and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(nlog(2)sigma/Hw), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n log(2)sigma/b + t), where t is the number of occurrences reported, and a shift-or string matching algorithm that works in time O(n log(2)sigma/b [(m + s)/w] + t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/H log(2) sigma). The method can be applied to several other algorithms as well. We conclude with some experimental results.
引用
收藏
页码:108 / 121
页数:14
相关论文
共 50 条
  • [31] Generalized argument/alphabet signal processing
    Blyumin, S
    ICSP '96 - 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, 1996, : 23 - 23
  • [32] Problems of publishing Romance-language texts in the Greek alphabet
    Schlosser, R
    OLD AND NEW PHILOLOGY, 1997, 8 : 337 - 345
  • [33] Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts
    Cantone, Domenico
    Faro, Simone
    Giaquinta, Emanuele
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2009, 2009, : 29 - 39
  • [34] ADAPTING BOYER-MOORE-LIKE ALGORITHMS FOR SEARCHING HUFFMAN ENCODED TEXTS
    Cantone, Domenico
    Faro, Simone
    Giaquinta, Emanuele
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2012, 23 (02) : 343 - 356
  • [35] FINDING CHARACTERISTIC SUBSTRINGS FROM COMPRESSED TEXTS
    Inenaga, Shunsuke
    Bannai, Hideo
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2012, 23 (02) : 261 - 280
  • [36] Finding Characteristic Substrings from Compressed Texts
    Inenaga, Shunsuke
    Bannai, Hideo
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2009, 2009, : 40 - 54
  • [37] Robust super resolution of compressed video
    Zhang, Xiaohong
    Tang, Min
    Tong, Ruofeng
    VISUAL COMPUTER, 2012, 28 (12): : 1167 - 1180
  • [38] Robust super resolution of compressed video
    Xiaohong Zhang
    Min Tang
    Ruofeng Tong
    The Visual Computer, 2012, 28 : 1167 - 1180
  • [39] "The Mouth is the Wound of the Alphabet." About the Texts of the Collapsing New Buildings
    Schuette, Uwe
    WEIMARER BEITRAGE, 2019, 65 (04): : 606 - 624
  • [40] Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts
    Daptardar, A
    Shapira, D
    DCC 2004: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2004, : 535 - 535