Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

被引:347
|
作者
Lan, Man [1 ,2 ]
Tan, Chew Lim [2 ]
Su, Jian [3 ]
Lu, Yue [1 ]
机构
[1] E China Normal Univ, Dept Comp Sci & Technol, Shanghai 200241, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117590, Singapore
[3] Inst Infocomm Res, Singapore 119613, Singapore
关键词
Text categorization; text representation; term weighting; SVM; kNN; RELEVANCE;
D O I
10.1109/TPAMI.2008.110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e., words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e., tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently a better performance than other term weighting methods while most supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
引用
收藏
页码:721 / 735
页数:15
相关论文
共 50 条
  • [21] Explicit Use of Term Occurrence Probabilities for Term Weighting in Text Categorization
    Erenel, Zafer
    Altincay, Hakan
    Varoglu, Ekrem
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2011, 27 (03) : 819 - 834
  • [22] Automatic Evaluation of Interpretability Methods in Text Categorization
    A. Rogov
    N. Loukachevitch
    Journal of Mathematical Sciences, 2024, 285 (2) : 201 - 209
  • [23] A New Supervised Term Ranking Method for Text Categorization
    Mammadov, Musa
    Yearwood, John
    Zhao, Lei
    AI 2010: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2010, 6464 : 102 - 111
  • [24] A supervised term selection technique for effective text categorization
    Basu, Tanmay
    Murthy, C. A.
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2016, 7 (05) : 877 - 892
  • [25] A supervised term selection technique for effective text categorization
    Tanmay Basu
    C. A. Murthy
    International Journal of Machine Learning and Cybernetics, 2016, 7 : 877 - 892
  • [26] On entropy-based term weighting schemes for text categorization
    Wang, Tao
    Cai, Yi
    Leung, Ho-fung
    Lau, Raymond Y. K.
    Xie, Haoran
    Li, Qing
    KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (09) : 2313 - 2346
  • [27] On entropy-based term weighting schemes for text categorization
    Tao Wang
    Yi Cai
    Ho-fung Leung
    Raymond Y. K. Lau
    Haoran Xie
    Qing Li
    Knowledge and Information Systems, 2021, 63 : 2313 - 2346
  • [28] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Dogan, Turgut
    Uysal, Alper Kursat
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2019, 44 (11) : 9545 - 9560
  • [29] Automated Categorization of Research Papers with MONO Supervised Term Weighting in RECApp
    Biol, Ivic Jan A.
    Depositario, Rhey Marc A.
    Noangay, Glenn Geo T.
    Melchor, Julian Michael F.
    Abalorio, Cristopher C.
    Bustillo, James Cloyd M.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (02) : 332 - 339
  • [30] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Turgut Dogan
    Alper Kursat Uysal
    Arabian Journal for Science and Engineering, 2019, 44 : 9545 - 9560