Supervised term weighting centroid-based classifiers for text categorization

被引:0
|
作者
Tam T. Nguyen
Kuiyu Chang
Siu Cheung Hui
机构
[1] Nanyang Technological University,School of Computer Engineering
来源
关键词
Centroid classification; Support vector machines; Kullback–Leibler divergence; Jensen–Shannon divergence;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
引用
收藏
页码:61 / 85
页数:24
相关论文
共 50 条
  • [21] An improvement of centroid-based classification algorithm for text classification
    Cataltepe, Zehra
    Aygun, Eser
    2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOP, VOLS 1-2, 2007, : 952 - 956
  • [22] Two novel term weighting for text categorization
    Matsunaga, L. A.
    Ebecken, N. F. F.
    DATA MINING IX: DATA MINING, PROTECTION, DETECTION AND OTHER SECURITY TECHNOLOGIES, 2008, 40 : 105 - 114
  • [23] A semantic term weighting scheme for text categorization
    Luo, Qiming
    Chen, Enhong
    Xiong, Hui
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (10) : 12708 - 12716
  • [24] A Symmetric Term Weighting Scheme for Text Categorization Based on Term Occurrence Probabilities
    Erenel, Zafer
    Altincay, Hakan
    Varoglu, Ekrem
    2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 215 - 218
  • [25] Adaptive Centroid-based Clustering Algorithm for Text Document Data
    Li, Ximing
    Ouyang, Jihong
    Zhou, Xiaotang
    Fu, Bo
    2014 SIXTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING (PAAP), 2014, : 63 - 68
  • [26] A Term Weighting Scheme Based on the Measure of Relevance and Distinction for Text Categorization
    Yang, Jieming
    Wang, Jing
    Liu, Zhiying
    Qu, Zhaoyang
    2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 63 - 68
  • [27] Imbalanced Text Categorization Based on Positive and Negative Term Weighting Approach
    Naderalvojoud, Behzad
    Sezer, Ebru Akcapinar
    Ucan, Alaettin
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 325 - 333
  • [28] Entropy-based Term Weighting Schemes for Text Categorization in VSM
    Wang, Tao
    Cai, Yi
    Leung, Ho-fung
    Cai, Zhiwei
    Min, Huaqing
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 325 - 332
  • [29] Analysis of inverse class frequency in centroid-based text classification
    Lertnattee, V
    Theeramunkong, T
    IEEE INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES 2004 (ISCIT 2004), PROCEEDINGS, VOLS 1 AND 2: SMART INFO-MEDIA SYSTEMS, 2004, : 1171 - 1176
  • [30] Nonlinear transformation of term frequencies for term weighting in text categorization
    Erenel, Zafer
    Altincay, Hakan
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2012, 25 (07) : 1505 - 1514