Supervised term weighting centroid-based classifiers for text categorization

被引:0
|
作者
Tam T. Nguyen
Kuiyu Chang
Siu Cheung Hui
机构
[1] Nanyang Technological University,School of Computer Engineering
来源
关键词
Centroid classification; Support vector machines; Kullback–Leibler divergence; Jensen–Shannon divergence;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
引用
收藏
页码:61 / 85
页数:24
相关论文
共 50 条
  • [31] A NOVEL TERM WEIGHTING SCHEME MIDF FOR TEXT CATEGORIZATION
    Deisy, C.
    Gowri, M.
    Baskar, S.
    Kalaiarasi, S. M. A.
    Ramraj, N.
    JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2010, 5 (01) : 94 - 107
  • [32] A comparative study on term weighting schemes for text categorization
    Lan, M
    Sung, SY
    Low, HB
    Tan, CL
    PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), VOLS 1-5, 2005, : 546 - 551
  • [33] Utilizing Language Model for Term Weighting in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [34] Analytical evaluation of term weighting schemes for text categorization
    Altincay, Hakan
    Erenel, Zafer
    PATTERN RECOGNITION LETTERS, 2010, 31 (11) : 1310 - 1323
  • [35] A New Improved Term Weighting Scheme for Text Categorization
    Nguyen Pham Xuan
    Hieu Le Quang
    KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2013), VOL 1, 2014, 244 : 261 - 270
  • [36] A novel term weighting scheme for automated text categorization
    Xu, Hongzhi
    Li, Chunping
    PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2007, : 759 - 764
  • [37] OPTIMAL PROPERTIES OF CENTROID-BASED CLASSIFIERS FOR VERY HIGH-DIMENSIONAL DATA
    Hall, Peter
    Pham, Tung
    ANNALS OF STATISTICS, 2010, 38 (02): : 1071 - 1093
  • [38] Structure-Based Supervised Term Weighting and Regularization for Text Classification
    Shanavas, Niloofer
    Wang, Hui
    Lin, Zhiwei
    Hawe, Glenn
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 105 - 117
  • [39] A generalized cluster centroid based classifier for text categorization
    Pang, Guansong
    Jiang, Shengyi
    INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (02) : 576 - 586
  • [40] Explicit Use of Term Occurrence Probabilities for Term Weighting in Text Categorization
    Erenel, Zafer
    Altincay, Hakan
    Varoglu, Ekrem
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2011, 27 (03) : 819 - 834