Fast and flexible word searching on compressed text

被引:112
|
作者
de Moura, ES
Navarro, G
Ziviani, N
BaezaYates, R
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, BR-31270010 Belo Horizonte, MG, Brazil
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
compressed pattern matching; natural language text compression; word searching; word-based Huffman coding;
D O I
10.1145/348751.348754
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text;. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approximate patterns, our algorithms are up, to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
引用
收藏
页码:113 / 139
页数:27
相关论文
共 50 条
  • [1] A text compression scheme that allows fast searching directly in the compressed file
    Manber, U
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1997, 15 (02) : 124 - 136
  • [2] Approximate searching on compressed text
    Pérez, CA
    Uribe, CF
    15TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATIONS AND COMPUTERS, PROCEEDINGS, 2005, : 258 - 261
  • [3] Compressed text indexes with fast locate
    Gonzalez, Rodrigo
    Navarro, Gonzalo
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2007, 4580 : 216 - +
  • [4] FAST TEXT SEARCHING ALLOWING ERRORS
    WU, S
    MANBER, U
    COMMUNICATIONS OF THE ACM, 1992, 35 (10) : 83 - 91
  • [5] Fast searching over compressed text using a new coding technique: Tagged Suboptimal code (TSC)
    Bellaachia, A
    AL Rassan, I
    DCC 2004: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2004, : 526 - 526
  • [6] Fast text searching for regular expressions or automaton searching on tries
    BaezaYates, RA
    Gonnet, GH
    JOURNAL OF THE ACM, 1996, 43 (06) : 915 - 936
  • [7] Word searching in CCITT group 4 compressed document images
    Lu, Y
    Tan, CL
    SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 467 - 471
  • [8] A fast and effective text tracking in compressed video
    Jiang, Haixia
    Liu, Guizhong
    Qian, Xueming
    Nan, Nan
    Guo, Danping
    Li, Zhi
    Sun, Li
    ISM: 2008 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, 2008, : 136 - 141
  • [9] Efficient Multi-word Parameterized Matching on Compressed Text
    Prasad, Rajesh
    Garg, Rama
    PROCEEDINGS OF THE 2014 IEEE 6TH INTERNATIONAL CONFERENCE ON ADAPTIVE SCIENCE AND TECHNOLOGY (ICAST 2014), 2014,
  • [10] Short Text Topic Modeling with Flexible Word Patterns
    Wu, Xiaobao
    Li, Chunping
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,