Text Classification Using Compression-Based Dissimilarity Measures

被引:11
|
作者
Coutinho, David Pereira [1 ,2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Inst Politecn Lisboa, Inst Telecommun, P-1959007 Lisbon, Portugal
[2] Inst Politecn Lisboa, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal
[3] Univ Lisbon, Inst Telecommun, P-1049001 Lisbon, Portugal
[4] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
关键词
Text classidication; text similarity measures; relative entropy; Ziv-Merhav method; cross-parsing algorithm; STATISTICAL COMPARISONS; INDIVIDUAL SEQUENCES; FORMAL THEORY; AUTHORSHIP; CLASSIFIERS; SIMILARITY; INFERENCE;
D O I
10.1142/S0218001415530043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] A new evaluation measure using compression dissimilarity on text summarization
    Wang, Tong
    Chen, Ping
    Simovici, Dan
    APPLIED INTELLIGENCE, 2016, 45 (01) : 127 - 134
  • [22] Compression-based steganography
    Carpentieri, Bruno
    Castiglione, Arcangelo
    De Santis, Alfredo
    Palmieri, Francesco
    Pizzolante, Raffaele
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (08):
  • [23] A new evaluation measure using compression dissimilarity on text summarization
    Tong Wang
    Ping Chen
    Dan Simovici
    Applied Intelligence, 2016, 45 : 127 - 134
  • [24] Competitive Author Profiling Using Compression-Based Strategies
    Claude, Francisco
    Galaktionov, Daniil
    Konow, Roberto
    Ladra, Susana
    Pedreira, Oscar
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2017, 25 : 5 - 20
  • [25] GENERALIZED BOUNDARY DETECTION USING COMPRESSION-BASED ANALYTICS
    Ting, Christina
    Field, Richard, Jr.
    Quach, Tu-Thach
    Bauer, Travis
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3522 - 3526
  • [26] Compression-based image registration
    Bardera, Anton
    Feixas, Miquel
    Boada, Imma
    Sbert, Mateu
    2006 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, VOLS 1-6, PROCEEDINGS, 2006, : 436 - +
  • [27] Compression-Based Compressed Sensing
    Rezagah, Farideh E.
    Jalali, Shirin
    Erkip, Elza
    Poor, H. Vincent
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2017, 63 (10) : 6735 - 6752
  • [28] Compression-based AODE Classifiers
    Corani, G.
    Antonucci, A.
    De Rosa, R.
    20TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2012), 2012, 242 : 264 - +
  • [29] Compression-based Facies Modelling
    Manzocchi, Tom
    Walsh, Deirdre A.
    Carneiro, Marcus
    Lopez-Cabrera, Javier
    MATHEMATICAL GEOSCIENCES, 2023, 55 (05) : 625 - 644
  • [30] Compression-based Facies Modelling
    Tom Manzocchi
    Deirdre A. Walsh
    Marcus Carneiro
    Javier López-Cabrera
    Mathematical Geosciences, 2023, 55 : 625 - 644