Text Classification Using Compression-Based Dissimilarity Measures

被引:11
|
作者
Coutinho, David Pereira [1 ,2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Inst Politecn Lisboa, Inst Telecommun, P-1959007 Lisbon, Portugal
[2] Inst Politecn Lisboa, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal
[3] Univ Lisbon, Inst Telecommun, P-1049001 Lisbon, Portugal
[4] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
关键词
Text classidication; text similarity measures; relative entropy; Ziv-Merhav method; cross-parsing algorithm; STATISTICAL COMPARISONS; INDIVIDUAL SEQUENCES; FORMAL THEORY; AUTHORSHIP; CLASSIFIERS; SIMILARITY; INFERENCE;
D O I
10.1142/S0218001415530043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] On compression-based text classification
    Marton, Y
    Wu, N
    Hellerstein, L
    ADVANCES IN INFORMATION RETRIEVAL, 2005, 3408 : 300 - 314
  • [2] Compression-Based Arabic Text Classification
    Ta'amneh, Haneen
    Abu Keshek, Ehsan
    Issa, Manar Bani
    Al-Ayyoub, Mahmoud
    Jararweh, Yaser
    2014 IEEE/ACS 11TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2014, : 594 - 600
  • [3] Statistical Compression-Based Models for Text Classification
    Saikrishna, Vidya
    Dowe, David L.
    Ray, Sid
    2016 FIFTH INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS), 2016, : 1 - 6
  • [4] Adult Content Filtering through Compression-Based Text Classification
    Santos, Igor
    Galan-Garcia, Patxi
    Santamaria-Ibirika, Aitor
    Alonso-Isla, Borja
    Alabau-Sarasola, Iker
    Garcia Bringas, Pablo
    INTERNATIONAL JOINT CONFERENCE CISIS'12 - ICEUTE'12 - SOCO'12 SPECIAL SESSIONS, 2013, 189 : 281 - 288
  • [5] Improving the Accuracy and Efficiency of Compression-based Dissimilarity Measure using Information Quantity in Data Classification Problems
    Takamoto A.
    Kohara Y.
    Yoshida M.
    Umemura K.
    Transactions of the Japanese Society for Artificial Intelligence, 2023, 38 (01) : 1 - 15
  • [6] A compression-based text steganography method
    Satir, Esra
    Isik, Hakan
    JOURNAL OF SYSTEMS AND SOFTWARE, 2012, 85 (10) : 2385 - 2394
  • [7] A Compression-Based Dissimilarity Measure for Multi-task Clustering
    Nguyen Huy Thach
    Shao, Hao
    Tong, Bin
    Suzuki, Einoshin
    FOUNDATIONS OF INTELLIGENT SYSTEMS, 2011, 6804 : 123 - 132
  • [8] Comparing Medical Code Usage With the Compression-Based Dissimilarity Measure
    Rost, Thomas Brox
    Edsberg, Ole
    Grimsmo, Anders
    Nytro, Oystein
    MEDINFO 2007: PROCEEDINGS OF THE 12TH WORLD CONGRESS ON HEALTH (MEDICAL) INFORMATICS, PTS 1 AND 2: BUILDING SUSTAINABLE HEALTH SYSTEMS, 2007, 129 : 684 - +
  • [9] Infant Cry Classification using Compression-based Similarity Metric
    Radoi, Anamaria
    Burileanu, Corneliu
    2018 12TH INTERNATIONAL CONFERENCE ON COMMUNICATIONS (COMM), 2018, : 67 - 70
  • [10] Application of compression-based distance measures to protein sequence classification:: a methodological study
    Kocsor, A
    Kertész-Farkas, A
    Kaján, L
    Pongor, S
    BIOINFORMATICS, 2006, 22 (04) : 407 - 412