Improving semistatic compression via phrase-based modeling

被引:2
|
作者
Brisaboa, Nieves R. [1 ]
Farina, Antonio [1 ]
Navarro, Gonzalo [2 ]
Parama, Jose R. [1 ]
机构
[1] Univ A Coruna, Database Lab, Fac Informat, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
Text compression; Direct search; ALGORITHM;
D O I
10.1016/j.ipm.2011.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:545 / 559
页数:15
相关论文
共 50 条
  • [1] Improving semistatic compression via pair-based coding
    Brisaboa, Nieves R.
    Farina, Antonio
    Navarro, Gonzalo
    Parama, Jose R.
    PERSPECTIVES OF SYSTEMS INFORMATICS, 2007, 4378 : 124 - +
  • [2] Leveraging External Knowledge for Phrase-based Topic Modeling
    Xu, Mingyang
    Yang, Ruixin
    Ranshous, Stephen
    Li, Shijie
    Samatova, Nagiza F.
    2017 CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2017, : 29 - 32
  • [3] Phrase-based correction model for improving handwriting recognition accuracies
    Farooq, Faisal
    Jose, Damien
    Govindaraju, Venu
    PATTERN RECOGNITION, 2009, 42 (12) : 3271 - 3277
  • [4] Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques
    Yashothara, S.
    Uthayasanker, R. T.
    Jayasena, S.
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 322 - 327
  • [5] Improving phrase-based statistical machine translation with morphosyntactic transformation
    Thai Phuong Nguyen
    Shimazu, Akira
    MACHINE TRANSLATION, 2006, 20 (03) : 147 - 166
  • [6] PHRASE-BASED RAGA RECOGNITION USING VECTOR SPACE MODELING
    Gulati, Sankalp
    Serra, Joan
    Ishwar, Vignesh
    Senturk, Sertan
    Serra, Xavier
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 66 - 70
  • [7] Statistical phrase-based translation
    Koehn, P
    Och, FJ
    Marcu, D
    HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 127 - 133
  • [8] Hierarchical phrase-based translation
    Chiang, David
    COMPUTATIONAL LINGUISTICS, 2007, 33 (02) : 201 - 228
  • [9] A Comparative Study on Applying Hierarchical Phrase-based and Phrase-based on Thai-Chinese Translation
    Luekhong, Prasert
    Sukhauta, Rattasit
    Porkaew, Peerachet
    Ruangrajitpakorn, Taneth
    Supnithi, Thepchai
    2012 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE, INFORMATION AND CREATIVITY SUPPORT SYSTEMS (KICSS 2012), 2012, : 126 - 133
  • [10] Improving Phrase-based Korean-English Statistical Machine Translation
    Lee, Jonghoon
    Lee, Donghyeon
    Lee, Gary Geunbae
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 753 - 756