Supervised and Traditional Term Weighting Methods for Automatic Text Categorization

被引:347
|
作者
Lan, Man [1 ,2 ]
Tan, Chew Lim [2 ]
Su, Jian [3 ]
Lu, Yue [1 ]
机构
[1] E China Normal Univ, Dept Comp Sci & Technol, Shanghai 200241, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Dept Comp Sci, Singapore 117590, Singapore
[3] Inst Infocomm Res, Singapore 119613, Singapore
关键词
Text categorization; text representation; term weighting; SVM; kNN; RELEVANCE;
D O I
10.1109/TPAMI.2008.110
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e., words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e., tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently a better performance than other term weighting methods while most supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.
引用
收藏
页码:721 / 735
页数:15
相关论文
共 50 条
  • [31] A Symmetric Term Weighting Scheme for Text Categorization Based on Term Occurrence Probabilities
    Erenel, Zafer
    Altincay, Hakan
    Varoglu, Ekrem
    2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 215 - 218
  • [32] TERM-WEIGHTING APPROACHES IN AUTOMATIC TEXT RETRIEVAL
    SALTON, G
    BUCKLEY, C
    INFORMATION PROCESSING & MANAGEMENT, 1988, 24 (05) : 513 - 523
  • [33] An improved supervised term weighting scheme for text representation and classification
    Tang, Zhong
    Li, Wenqiang
    Li, Yan
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 189
  • [34] Text categorization methods for automatic estimation of verbal intelligence
    Fernandez-Martinez, Fernando
    Zablotskaya, Kseniya
    Minker, Wolfgang
    EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (10) : 9807 - 9820
  • [35] Hadoop MapReduce Implementation of A Novel scheme for Term weighting in Text Categorization
    Dalavi, Manesh
    Cheke, Shailesh
    2014 INTERNATIONAL CONFERENCE ON CONTROL, INSTRUMENTATION, COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICCICCT), 2014, : 994 - 999
  • [36] A Term Weighting Scheme Based on the Measure of Relevance and Distinction for Text Categorization
    Yang, Jieming
    Wang, Jing
    Liu, Zhiying
    Qu, Zhaoyang
    2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 63 - 68
  • [37] Term Weighting using Contextual Information for Categorization of Unstructured Text Documents
    Kulkarni, Anagha
    Tokekar, Vrinda
    Kulkarni, Parag
    2015 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2015,
  • [38] Imbalanced Text Categorization Based on Positive and Negative Term Weighting Approach
    Naderalvojoud, Behzad
    Sezer, Ebru Akcapinar
    Ucan, Alaettin
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 325 - 333
  • [39] An Arabic text categorization approach using term weighting and multiple reducts
    Qasem A. Al-Radaideh
    Mohammed A. Al-Abrat
    Soft Computing, 2019, 23 : 5849 - 5863
  • [40] A new term-weighting scheme for naive Bayes text categorization
    Mendoza, Marcelo
    INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2012, 8 (01) : 55 - +