Processing of Huffman compressed texts with a super-alphabet

被引：0

作者：

Fredriksson, K

Tarhio, J

机构：

[1] Univ Joensuu, Dept CS, FIN-80101 Joensuu, Finland

[2] Aalto Univ, Dept CSE, FIN-02015 Espoo, Finland

来源：

STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS | 2003年 / 2857卷

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O(nlog(2)sigma/b) time, where n is the size of the compressed text in bytes, or is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O(b/H log(2)sigma) symbols, where H is the entropy of the text. Each super-symbol is processed in O(1) time. The algorithm uses O(2(b)) space, and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text, or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(nlog(2)sigma/Hw), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n log(2)sigma/b + t), where t is the number of occurrences reported, and a shift-or string matching algorithm that works in time O(n log(2)sigma/b [(m + s)/w] + t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O(b/H log(2) sigma). The method can be applied to several other algorithms as well. We conclude with some experimental results.

引用

页码：108 / 121

页数：14

共 50 条

[21] Compressed DNA Coding Using Minimum Variance Huffman Tree
Mishra, Pooja
Bhaya, Chiranjeev
Pal, Arup Kumar
Singh, Abhay Kumar
IEEE COMMUNICATIONS LETTERS, 2020, 24 (08) : 1602 - 1606
[22] THE PROBLEM OF STANDARDIZING THE TRANSLITERATION OF RUSSIAN TEXTS INTO THE LATIN ALPHABET
REFORMATSKIJ, AA
OSTERREICHISCHE OSTHEFTE, 1979, 21 (04): : 278 - 286
[23] A fast asynchronous Huffman decoder for compressed-code embedded processors
Benes, M
Nowick, SM
Wolfe, A
ADVANCED RESEARCH IN ASYNCHRONOUS CIRCUITS AND SYSTEMS - FOURTH INTERNATIONAL SYMPOSIUM, 1998, : 43 - 56
[24] Fast Insertion and Deletion in Compressed Texts
Boettcher, Stefan
Bueltmann, Alexander
Hartel, Rita
Schluessler, Jonathan
2012 DATA COMPRESSION CONFERENCE (DCC), 2012, : 393 - 393
[25] Compressed Suffix Trees for Repetitive Texts
Abeliuk, Andres
Navarro, Gonzalo
STRING PROCESSING AND INFORMATION RETRIEVAL: 19TH INTERNATIONAL SYMPOSIUM, SPIRE 2012, 2012, 7608 : 30 - 41
[26] Pattern Matching in Compressed Texts and Images
Adjeroh, Don
Bell, Tim
Mukherjee, Amar
FOUNDATIONS AND TRENDS IN SIGNAL PROCESSING, 2012, 6 (2-3): : 97 - 241
[27] Window subsequence problems for compressed texts
Cegielski, Patrick
Guessarian, Irene
Lifshits, Yury
Matiyasevich, Yuri
COMPUTER SCIENCE - THEORY AND APPLICATIONS, 2006, 3967 : 127 - 136
[28] Compressed index for a dynamic collection of texts
Chan, HL
Hon, WK
Lam, TW
COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 445 - 456
[29] Encrypting JPEG-compressed Images by Substituting Huffman Code Words
Parfieniuk, Marek
PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 545 - 549
[30] First Huffman, then Burrows-Wheeler:: A simple alphabet-independent FM-index
Grabowski, S
Mäkinen, V
Navarro, G
STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 210 - 211

← 1 2 3 4 5 →