Processing of Huffman compressed texts with a super-alphabet

被引：0

作者：

Fredriksson, K

Tarhio, J

机构：

[1] Univ Joensuu, Dept CS, FIN-80101 Joensuu, Finland

[2] Aalto Univ, Dept CSE, FIN-02015 Espoo, Finland

来源：

STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2003年 / 2857卷

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(nlog(2)sigma/b) time, where n is the size of the compressed text in bytes, or is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O(b/H log(2)sigma) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2(b)) space, and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(nlog(2)sigma/Hw), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n log(2)sigma/b + t), where t is the number of occurrences reported, and a shift-or string matching algorithm that works in time O(n log(2)sigma/b [(m + s)/w] + t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/H log(2) sigma). The method can be applied to several other algorithms as well. We conclude with some experimental results.

引用

页码：108 / 121

页数：14

共 50 条

[1] Efficient string matching in Huffman compressed texts
Fredriksson, K
Tarhio, J
FUNDAMENTA INFORMATICAE, 2004, 63 (01) : 1 - 16
[2] Huffman coding with an infinite alphabet
Kato, A
Han, TS
Nagaoka, H
IEEE TRANSACTIONS ON INFORMATION THEORY, 1996, 42 (03) : 977 - 984
[3] Processing compressed texts: A tractability border
Lifshits, Yury
Combinatorial Pattern Matching, Proceedings, 2007, 4580 : 228 - 240
[4] Huffman Redundancy for Large Alphabet Sources
Narimani, Hamed
Khosravifard, Mohammadali
IEEE TRANSACTIONS ON INFORMATION THEORY, 2014, 60 (03) : 1412 - 1427
[5] Level-compressed Huffman decoding
Chung, KL
Wu, JG
IEEE TRANSACTIONS ON COMMUNICATIONS, 1999, 47 (10) : 1455 - 1457
[6] Pattern matching in Huffman encoded texts
Klein, ST
Shapira, D
INFORMATION PROCESSING & MANAGEMENT, 2005, 41 (04) : 829 - 841
[7] Pattern matching in Huffman encoded texts
Klein, ST
Shapira, D
DCC 2001: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2001, : 449 - 458
[8] Alphabet partitioning techniques for semiadaptive Huffman coding of large alphabets
Chen, Dan
Chiang, Yi-Jen
Memon, Nasir
Wu, Xiaolin
IEEE TRANSACTIONS ON COMMUNICATIONS, 2007, 55 (03) : 436 - 443
[9] Generation of fast interpreters for Huffman compressed bytecode
Latendresse, M
Feeley, M
SCIENCE OF COMPUTER PROGRAMMING, 2005, 57 (03) : 295 - 317
[10] Rewriting Turkish texts written in English alphabet using Turkish alphabet
Okur, Burak Cagri
Takci, Hidayet
Akgul, Yusuf Sinan
2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,

← 1 2 3 4 5 →