A New Method for Short Text Compression

被引:2
|
作者
Aslanyurek, Murat [1 ]
Mesut, Altan [2 ]
机构
[1] Kirklareli Univ, Pinarhisar Vocat Sch, Comp Programming Program, TR-39300 Kirklareli, Turkiye
[2] Trakya Univ, Comp Engn Dept, TR-22100 Edirne, Turkiye
关键词
Machine learning; Text categorization; text compression; k-means; clustering; LANGUAGE IDENTIFICATION;
D O I
10.1109/ACCESS.2023.3340436
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Short texts cannot be compressed effectively with general-purpose compression methods. Methods developed to compress short texts often use static dictionaries. In order to achieve high compression ratios, using a static dictionary suitable for the text to be compressed is an important problem that needs to be solved. In this study, a method called WSDC (Word-based Static Dictionary Compression), which can compress short texts at a high ratio, and a model that uses iterative clustering to create static dictionaries used in this method are proposed. The number of static dictionaries to be created can vary by running the k-Means clustering algorithm iteratively according to some rules. A method called DSWF (Dictionary Selection by Word Frequency) is also presented to determine which of the created dictionaries can compress the source text at the best ratio. Wikipedia article abstracts consisting of 6 different languages were used as the dataset in the experiments. The developed WSDC method is compared with both general-purpose compression methods (Gzip, Bzip2, PPMd, Brotli and Zstd) and special methods used for compression of short texts (shoco, b64pack and smaz). According to the test results, although WSDC is slower than some other methods, it achieves the best compression ratios for short texts smaller than 200 bytes and better than other methods except Zstd for short texts smaller than 1000 bytes.
引用
收藏
页码:141022 / 141035
页数:14
相关论文
共 50 条
  • [21] A New SVM Method for Short Text Classification Based on Semi-Supervised Learning
    Yin, Chunyong
    Xiang, Jun
    Zhang, Hui
    Wang, Jin
    Yin, Zhichao
    Kim, Jeong-Uk
    2015 4TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION TECHNOLOGY AND SENSOR APPLICATION (AITS), 2015, : 100 - 103
  • [22] A method of abstracting short text for consulting clients
    Zhao, YanPing
    Wang, Fang
    Zhou, XiaoLai
    Liu, Wenjing
    2020 5TH INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE 2020), 2020, : 892 - 895
  • [23] Wikipedia Based Short Text Classification Method
    Li, Junze
    Cai, Yi
    Cai, Zhiwei
    Leung, Hofung
    Yang, Kai
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2017), 2017, 10179 : 275 - 286
  • [24] Learning-based short text compression using BERT models
    Ozturk, Emir
    Mesut, Altan
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [25] The New Statistical Compression Method: Multistream Compression
    Kochanek, Jiri
    Lansky, Jan
    Uzel, Petr
    Zemlicka, Michal
    2008 FIRST INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES, VOLS 1 AND 2, 2008, : 327 - +
  • [26] New text compression technique based on language structure
    Ibruhim, K.
    Journal of Information Science, 1995, 21 (02):
  • [27] A NEW TEXT COMPRESSION TECHNIQUE BASED ON LANGUAGE STRUCTURE
    AKMAN, KI
    JOURNAL OF INFORMATION SCIENCE, 1995, 21 (02) : 87 - 94
  • [28] The Research on Compression Method for RDSS Short Messages
    Hu, Guangming
    Ma, Min
    Su, Ranran
    Bi, Jiahong
    CSNC 2011: 2ND CHINA SATELLITE NAVIGATION CONFERENCE, VOLS 1-3, 2011, : 358 - 361
  • [29] Microblog Short Text Semantic Modeling Method for Search
    Kou F.-F.
    Du J.-P.
    Shi Y.-S.
    Yang C.-X.
    Cui W.-Q.
    Liang M.-Y.
    Shi L.
    Jisuanji Xuebao/Chinese Journal of Computers, 2020, 43 (05): : 781 - 795
  • [30] A Short Text Topic Discovery Method for Social Network
    Liu Jia
    Wang Qinglin
    Liu Yu
    Li Yuan
    2014 33RD CHINESE CONTROL CONFERENCE (CCC), 2014, : 512 - 516