Text Classification Using Compression-Based Dissimilarity Measures

被引:11
|
作者
Coutinho, David Pereira [1 ,2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Inst Politecn Lisboa, Inst Telecommun, P-1959007 Lisbon, Portugal
[2] Inst Politecn Lisboa, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal
[3] Univ Lisbon, Inst Telecommun, P-1049001 Lisbon, Portugal
[4] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
关键词
Text classidication; text similarity measures; relative entropy; Ziv-Merhav method; cross-parsing algorithm; STATISTICAL COMPARISONS; INDIVIDUAL SEQUENCES; FORMAL THEORY; AUTHORSHIP; CLASSIFIERS; SIMILARITY; INFERENCE;
D O I
10.1142/S0218001415530043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Compression-based spam filter
    Almeida, Tiago A.
    Yamakami, Akebo
    SECURITY AND COMMUNICATION NETWORKS, 2016, 9 (04) : 327 - 335
  • [32] Combining dissimilarity measures for image classification
    Liu, Chuanyi
    Wang, Junqian
    Duan, Shaoming
    Xu, Yong
    PATTERN RECOGNITION LETTERS, 2019, 128 (536-543) : 536 - 543
  • [33] Retina recognition using compression-based joint transform correlator
    Widjaja, Joewono
    Suripon, Ubon
    OPTICAL ENGINEERING, 2011, 50 (09)
  • [34] DATA DISCOVERY USING LOSSLESS COMPRESSION-BASED SPARSE REPRESENTATION
    Sabeti, Elyas
    Song, Peter X. K.
    Hero, Alfred O., III
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5539 - 5543
  • [35] Text Ranking and Classification using Data Compression
    Kasturi, Nitya
    Markov, Igor L.
    WORKSHOP AT NEURIPS 2021, VOL 163, 2021, 163 : 48 - 53
  • [36] Retinal recognition using compression-based joint transform correlator
    Suripon, Ubon
    Widjaja, Joewono
    ICPS 2013: INTERNATIONAL CONFERENCE ON PHOTONICS SOLUTIONS, 2013, 8883
  • [37] Classification of time series using combination of DTW and LCSS dissimilarity measures
    Gorecki, Tomasz
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2018, 47 (01) : 263 - 276
  • [38] A compression-based distance measure for texture
    Campana B.J.L.
    Keogh E.J.
    Statistical Analysis and Data Mining, 2010, 3 (06): : 381 - 398
  • [39] Compression-based analysis of metamorphic malware
    Department of Computer Science, San Jose State University, San Jose
    CA
    95192, United States
    Int. J. Secur. Netw., 2 (124-136):
  • [40] A Compression-Based Method for Stemmatic Analysis
    Roos, Teemu
    Heikkila, Tuomas
    Myllymaki, Petri
    ECAI 2006, PROCEEDINGS, 2006, 141 : 805 - +