Supervised term weighting centroid-based classifiers for text categorization

被引:0
|
作者
Tam T. Nguyen
Kuiyu Chang
Siu Cheung Hui
机构
[1] Nanyang Technological University,School of Computer Engineering
来源
关键词
Centroid classification; Support vector machines; Kullback–Leibler divergence; Jensen–Shannon divergence;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
引用
收藏
页码:61 / 85
页数:24
相关论文
共 50 条
  • [1] Supervised term weighting centroid-based classifiers for text categorization
    Nguyen, Tam T.
    Chang, Kuiyu
    Hui, Siu Cheung
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 35 (01) : 61 - 85
  • [2] Semi-supervised Single-label Text Categorization using Centroid-based Classifiers
    Cardoso-Cachopo, Ana
    Oliveira, Arlindo L.
    APPLIED COMPUTING 2007, VOL 1 AND 2, 2007, : 844 - +
  • [3] Effect of term distributions on centroid-based text categorization
    Lertnattee, V
    Theeramunkong, T
    INFORMATION SCIENCES, 2004, 158 : 89 - 115
  • [4] Term-length normalization for centroid-based text categorization
    Lertnattee, V
    Theeramunkong, T
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 1, PROCEEDINGS, 2003, 2773 : 850 - 856
  • [5] Class normalization in centroid-based text categorization
    Lertnattee, Verayuth
    Theeramunkong, Thanaruk
    INFORMATION SCIENCES, 2006, 176 (12) : 1712 - 1738
  • [6] A Framework of Centroid-Based Methods for Text Categorization
    Wang, Dandan
    Chen, Qingcai
    Wang, Xiaolong
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (02): : 245 - 254
  • [7] A New Centroid-Based Classifier for Text Categorization
    Chen, Lifei
    Ye, Yanfang
    Jiang, Qingshan
    2008 22ND INTERNATIONAL WORKSHOPS ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOLS 1-3, 2008, : 1217 - +
  • [8] Using Different Term Weighting Schemes of Centroid-based Classifiers to Classify Drug Monographs
    Lertnattee, Verayuth
    Lueviphan, Chanisara
    PROGRESS IN MECHATRONICS AND INFORMATION TECHNOLOGY, PTS 1 AND 2, 2014, 462-463 : 968 - 973
  • [9] Combining homogeneous classifiers for centroid-based text classification
    Lertnattee, V
    Theeramunkong, T
    ISCC 2002: SEVENTH INTERNATIONAL SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2002, : 1034 - 1039
  • [10] A new Centroid-Based Classification model for text categorization
    Liu, Chuan
    Wang, Wenyong
    Tu, Guanghui
    Xiang, Yu
    Wang, Siyang
    Lv, Fengmao
    KNOWLEDGE-BASED SYSTEMS, 2017, 136 : 15 - 26