Text Classification Using Compression-Based Dissimilarity Measures

被引:11
|
作者
Coutinho, David Pereira [1 ,2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Inst Politecn Lisboa, Inst Telecommun, P-1959007 Lisbon, Portugal
[2] Inst Politecn Lisboa, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal
[3] Univ Lisbon, Inst Telecommun, P-1049001 Lisbon, Portugal
[4] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
关键词
Text classidication; text similarity measures; relative entropy; Ziv-Merhav method; cross-parsing algorithm; STATISTICAL COMPARISONS; INDIVIDUAL SEQUENCES; FORMAL THEORY; AUTHORSHIP; CLASSIFIERS; SIMILARITY; INFERENCE;
D O I
10.1142/S0218001415530043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Universal Lossless Compression-based Denoising
    Su, Han-I
    Weissman, Tsachy
    2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1648 - 1652
  • [42] Compression-based SoC Test Infrastructures
    Dalmasso, Julien
    Flottes, Marie-Lise
    Rouzeyre, Bruno
    VLSI-SOC: ADVANCED TOPICS ON SYSTEMS ON A CHIP, 2009, 291 : 53 - 67
  • [43] Compression-based similarity in EEG signals
    Prilepok, Michal
    Platos, Jan
    Snasel, Vaclav
    Jahan, Ibrahim Salem
    2013 13TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2013, : 247 - 252
  • [44] Compression-based pruning of decision lists
    Pfahringer, B
    MACHINE LEARNING : ECML-97, 1997, 1224 : 199 - 212
  • [45] A LEARNING ALGORITHM WITH COMPRESSION-BASED REGULARIZATION
    Vera, Matias
    Rey Vega, Leonardo
    Piantanida, Pablo
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2836 - 2840
  • [46] Postbuckling Shear Strength at Elevated Temperatures using a Compression-based Approach
    Glassman, Jonathan A.
    Garlock, Maria E. Moreyra
    STRUCTURES IN FIRE, 2016, : 735 - 742
  • [47] A Novel Compression-Based Approach for Malware Detection Using PE Header
    Khorsand, Zahra
    Hamzeh, Ali
    2013 5TH CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2013, : 127 - 133
  • [48] C-Net: A Compression-Based Lightweight Network for Machine-Generated Text Detection
    Zhou, Yinghan
    Wen, Juan
    Jia, Jianghao
    Gao, Liting
    Zhang, Ziwei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1269 - 1273
  • [49] Compression-based Technique for SDN Using Sparse-Representation Dictionary
    Al-Jawad, Ahmed
    Shah, Purav
    Gemikonakli, Orhan
    Trestian, Ramona
    NOMS 2016 - 2016 IEEE/IFIP NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM, 2016, : 754 - 758
  • [50] Compression-Based Clustering of Video Human Activity Using an ASCII Encoding
    Sarasa, Guillermo
    Montero, Aaron
    Granados, Ana
    Rodriguez, Francisco B.
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT II, 2018, 11140 : 66 - 75