Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

被引:347
|
作者
Lan, Man [1 ,2 ]
Tan, Chew Lim [2 ]
Su, Jian [3 ]
Lu, Yue [1 ]
机构
[1] E China Normal Univ, Dept Comp Sci & Technol, Shanghai 200241, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117590, Singapore
[3] Inst Infocomm Res, Singapore 119613, Singapore
关键词
Text categorization; text representation; term weighting; SVM; kNN; RELEVANCE;
D O I
10.1109/TPAMI.2008.110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e., words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e., tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently a better performance than other term weighting methods while most supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
引用
收藏
页码:721 / 735
页数:15
相关论文
共 50 条
  • [1] Supervised term weighting for automated text categorization
    Debole, F
    Sebastiani, F
    TEXT MINING AND ITS APPLICATIONS, 2004, 138 : 81 - 97
  • [2] A Supervised Term Weighting Scheme for Multi-class Text Categorization
    Gu, Yiwei
    Gu, Xiaodong
    INTELLIGENT COMPUTING METHODOLOGIES, ICIC 2017, PT III, 2017, 10363 : 436 - 447
  • [3] Supervised term weighting centroid-based classifiers for text categorization
    Tam T. Nguyen
    Kuiyu Chang
    Siu Cheung Hui
    Knowledge and Information Systems, 2013, 35 : 61 - 85
  • [4] Supervised term weighting centroid-based classifiers for text categorization
    Nguyen, Tam T.
    Chang, Kuiyu
    Hui, Siu Cheung
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 35 (01) : 61 - 85
  • [5] Supervised and Traditional Term Weighting Methods for Sentiment Analysis
    Cetin, Mahmut
    Amasyali, M. Fatih
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [6] A term weighting approach for text categorization
    Lee, KC
    Kang, SS
    Hahn, KS
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2005, 3689 : 673 - 678
  • [7] Inverse-Category-Frequency Based Supervised Term Weighting Schemes for Text Categorization
    Wang, Deqing
    Zhang, Hui
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2013, 29 (02) : 209 - 225
  • [8] Two novel term weighting for text categorization
    Matsunaga, L. A.
    Ebecken, N. F. F.
    DATA MINING IX: DATA MINING, PROTECTION, DETECTION AND OTHER SECURITY TECHNOLOGIES, 2008, 40 : 105 - 114
  • [9] A semantic term weighting scheme for text categorization
    Luo, Qiming
    Chen, Enhong
    Xiong, Hui
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (10) : 12708 - 12716
  • [10] A new document representation based on global policy for supervised term weighting schemes in text categorization
    Jia, Longjia
    Zhang, Bangzuo
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (05) : 5223 - 5240