Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

被引:347
|
作者
Lan, Man [1 ,2 ]
Tan, Chew Lim [2 ]
Su, Jian [3 ]
Lu, Yue [1 ]
机构
[1] E China Normal Univ, Dept Comp Sci & Technol, Shanghai 200241, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117590, Singapore
[3] Inst Infocomm Res, Singapore 119613, Singapore
关键词
Text categorization; text representation; term weighting; SVM; kNN; RELEVANCE;
D O I
10.1109/TPAMI.2008.110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e., words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e., tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently a better performance than other term weighting methods while most supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
引用
收藏
页码:721 / 735
页数:15
相关论文
共 50 条
  • [41] A Novel scheme for Term weighting in Text Categorization : Positive Impact factor
    Emmanuel, M.
    Khatri, Saurabh M.
    Babu, Ramesh D. R.
    2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 2292 - 2297
  • [42] Entropy-based Term Weighting Schemes for Text Categorization in VSM
    Wang, Tao
    Cai, Yi
    Leung, Ho-fung
    Cai, Zhiwei
    Min, Huaqing
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 325 - 332
  • [43] An Arabic text categorization approach using term weighting and multiple reducts
    Al-Radaideh, Qasem A.
    Al-Abrat, Mohammed A.
    SOFT COMPUTING, 2019, 23 (14) : 5849 - 5863
  • [44] Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization
    Dima Badawi
    Hakan Altınçay
    Applied Intelligence, 2017, 47 : 456 - 472
  • [45] Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization
    Badawi, Dima
    Altincay, Hakan
    APPLIED INTELLIGENCE, 2017, 47 (02) : 456 - 472
  • [46] Supervised term-category feature weighting for improved text classification
    Attieh, Joseph
    Tekli, Joe
    KNOWLEDGE-BASED SYSTEMS, 2023, 261
  • [47] Exploiting category information and document information to improve term weighting for text categorization
    Li, Jingyang
    Sun, Maosong
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2007, 4394 : 587 - +
  • [48] Supervised Contrastive Learning with Term Weighting for Improving Chinese Text Classification
    Guo, Jiabao
    Zhao, Bo
    Liu, Hui
    Liu, Yifan
    Zhong, Qian
    TSINGHUA SCIENCE AND TECHNOLOGY, 2023, 28 (01): : 59 - 68
  • [49] Structure-Based Supervised Term Weighting and Regularization for Text Classification
    Shanavas, Niloofer
    Wang, Hui
    Lin, Zhiwei
    Hawe, Glenn
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 105 - 117
  • [50] Bangla Content Categorization Using Text Based Supervised Learning Methods
    Al Mostakim, Sadek
    Ehsan, Faiza
    Hasan, Syeda Mahdiea
    Islam, Sadia
    Shatabda, Swakkhar
    2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,