Efficient string matching in Huffman compressed texts

被引：0

作者：

Fredriksson, K

Tarhio, J

机构：

[1] Univ Joensuu, Dept Comp Sci, FIN-80101 Joensuu, Finland

[2] Helsinki Univ Technol, Dept CSE, Espoo 02015, Finland

来源：

FUNDAMENTA INFORMATICAE | 2004年 / 63卷 / 01期

关键词：

Huffman compression; string matching; natural language;

D O I：

暂无

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

We present an efficient algorithm for scanning Huffman compressed texts. The algorithm parses the compressed text in O (n (log2sigma)/(b)) time, where n is the size of the compressed text in bytes, a is the size of the alphabet, and b is a user specified parameter. The method uses a variable size super-alphabet, with an average size of O((b)/(Hlog2sigma)) characters, where H is the entropy of the text. Each super-character is processed in O(1) time. The algorithm uses O(2(b)) space and O(b2(b)) preprocessing time. The method can be easily augmented by auxiliary functions, which can e.g. decompress the text or perform pattern matching in the compressed text. We give three example functions: decoding the text in average time O(n(log2sigma)/(Hw)), where w is the number of bits in a machine word; an Aho-Corasick dictionary matching algorithm, which works in time O(n(log2sigma)/(b) + t), where t is the number of occurrences reported; and a shift-or string matching algorithm that works in time O(n(log2sigma)/(b)[(m + s - 1)/w]+t), where m is the length of the pattern and s depends on the encoding. The Aho-Corasick algorithm uses an automaton with variable length moves, i.e. it processes variable number of states at each step. The shift-or algorithm makes variable length shifts, effectively also processing variable number of states at each step. The number of states processed in O(1) time is O((b)/(Hlog2sigma)). The method can be applied to several other algorithms as well. Finally, we apply the methods to natural language taking the words (vocabulary) as the alphabet. This improves the compression ratio and allows more complex search problems to be solved efficiently. We conclude with some experimental results.

引用

页码：1 / 16

页数：16

共 50 条

[41] Efficient Regular Expression Matching on Compressed Strings
Han, Yutong
Wang, Bin
Yang, Xiaochun
Zhu, Huaijie
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2017), PT II, 2017, 10178 : 219 - 234
[42] EFFICIENT STRING MATCHING - AID TO BIBLIOGRAPHIC SEARCH
AHO, AV
CORASICK, MJ
COMMUNICATIONS OF THE ACM, 1975, 18 (06) : 333 - 340
[43] EFFICIENT STRING MATCHING WITH K-MISMATCHES
LANDAU, GM
VISHKIN, U
THEORETICAL COMPUTER SCIENCE, 1986, 43 (2-3) : 239 - 249
[44] Efficient string matching with wildcards and length constraints
Chen, Gong
Wu, Xindong
Zhu, Xingquan
Arslan, Abdullah N.
He, Yu
KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 10 (04) : 399 - 419
[45] Efficient algorithms for approximate string matching with swaps
Lee, JS
Kim, DK
Park, K
Cho, Y
COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 1997, 1264 : 28 - 39
[46] Efficient string matching with wildcards and length constraints
Gong Chen
Xindong Wu
Xingquan Zhu
Abdullah N. Arslan
Yu He
Knowledge and Information Systems, 2006, 10 : 399 - 419
[47] Efficient parallel hardware algorithms for string matching
Park, JH
George, KM
MICROPROCESSORS AND MICROSYSTEMS, 1999, 23 (03) : 155 - 168
[48] SIMPLE AND EFFICIENT STRING MATCHING WITH K MISMATCHES
GROSSI, R
LUCCIO, F
INFORMATION PROCESSING LETTERS, 1989, 33 (03) : 113 - 120
[49] Efficient algorithms for approximate string matching with swaps
Kim, DK
Lee, JS
Park, K
Cho, Y
JOURNAL OF COMPLEXITY, 1999, 15 (01) : 128 - 147
[50] Improved approximate string matching using compressed suffix data structures
Lam, Tak-Wah
Sung, Wing-Kin
Wong, Swee-Seong
ALGORITHMICA, 2008, 51 (03) : 298 - 314

← 1 2 3 4 5 →