Processing of Huffman compressed texts with a super-alphabet

被引:0
|
作者
Fredriksson, K
Tarhio, J
机构
[1] Univ Joensuu, Dept CS, FIN-80101 Joensuu, Finland
[2] Aalto Univ, Dept CSE, FIN-02015 Espoo, Finland
来源
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2003年 / 2857卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(nlog(2)sigma/b) time, where n is the size of the compressed text in bytes, or is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O(b/H log(2)sigma) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2(b)) space, and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(nlog(2)sigma/Hw), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n log(2)sigma/b + t), where t is the number of occurrences reported, and a shift-or string matching algorithm that works in time O(n log(2)sigma/b [(m + s)/w] + t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/H log(2) sigma). The method can be applied to several other algorithms as well. We conclude with some experimental results.
引用
收藏
页码:108 / 121
页数:14
相关论文
共 50 条
  • [21] Compressed DNA Coding Using Minimum Variance Huffman Tree
    Mishra, Pooja
    Bhaya, Chiranjeev
    Pal, Arup Kumar
    Singh, Abhay Kumar
    IEEE COMMUNICATIONS LETTERS, 2020, 24 (08) : 1602 - 1606
  • [22] THE PROBLEM OF STANDARDIZING THE TRANSLITERATION OF RUSSIAN TEXTS INTO THE LATIN ALPHABET
    REFORMATSKIJ, AA
    OSTERREICHISCHE OSTHEFTE, 1979, 21 (04): : 278 - 286
  • [23] A fast asynchronous Huffman decoder for compressed-code embedded processors
    Benes, M
    Nowick, SM
    Wolfe, A
    ADVANCED RESEARCH IN ASYNCHRONOUS CIRCUITS AND SYSTEMS - FOURTH INTERNATIONAL SYMPOSIUM, 1998, : 43 - 56
  • [24] Fast Insertion and Deletion in Compressed Texts
    Boettcher, Stefan
    Bueltmann, Alexander
    Hartel, Rita
    Schluessler, Jonathan
    2012 DATA COMPRESSION CONFERENCE (DCC), 2012, : 393 - 393
  • [25] Compressed Suffix Trees for Repetitive Texts
    Abeliuk, Andres
    Navarro, Gonzalo
    STRING PROCESSING AND INFORMATION RETRIEVAL: 19TH INTERNATIONAL SYMPOSIUM, SPIRE 2012, 2012, 7608 : 30 - 41
  • [26] Pattern Matching in Compressed Texts and Images
    Adjeroh, Don
    Bell, Tim
    Mukherjee, Amar
    FOUNDATIONS AND TRENDS IN SIGNAL PROCESSING, 2012, 6 (2-3): : 97 - 241
  • [27] Window subsequence problems for compressed texts
    Cegielski, Patrick
    Guessarian, Irene
    Lifshits, Yury
    Matiyasevich, Yuri
    COMPUTER SCIENCE - THEORY AND APPLICATIONS, 2006, 3967 : 127 - 136
  • [28] Compressed index for a dynamic collection of texts
    Chan, HL
    Hon, WK
    Lam, TW
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 445 - 456
  • [29] Encrypting JPEG-compressed Images by Substituting Huffman Code Words
    Parfieniuk, Marek
    PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 545 - 549
  • [30] First Huffman, then Burrows-Wheeler:: A simple alphabet-independent FM-index
    Grabowski, S
    Mäkinen, V
    Navarro, G
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 210 - 211