A new feature selection method for text classification

被引:8
|
作者
Uchyigit, Gulden [1 ]
Clark, Keith [1 ]
机构
[1] Univ London Imperial Coll Sci Technol & Med, Dept Comp, London SW7 2AZ, England
关键词
feature selection; text classification; statistical inference;
D O I
10.1142/S0218001407005466
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text classification is the problem of classifying a set of documents into a pre-defined set of classes. A major problem with text classification problems is the high dimensionality of the feature space. Only a small subset of these words are feature words which can be used in determining a document's class, while the rest adds noise and can make the results unreliable and significantly increase computational time. A common approach in dealing with this problem is feature selection where the number of words in the feature space are significantly reduced. In this paper we present the experiments of a comparative study of feature selection methods used for text classification. Ten feature selection methods were evaluated in this study including the new feature selection method, called the GU metric. The other feature selection methods evaluated in this study are: Chi-Squared (x(2)) statistic, NGL coefficient, GSS coefficient, Mutual Information, Information Gain, Odds Ratio, Term Frequency, Fisher Criterion, BSS/WSS coefficient. The experimental evaluations show that the GU metric obtained the best F-1 and F-2 scores. The experiments were performed on the 20 Newsgroups data sets with the Naive Bayesian Probabilistic Classifier.
引用
收藏
页码:423 / 438
页数:16
相关论文
共 50 条
  • [21] A NEW FEATURE SELECTION METHOD BASED ON CONCEPT EXTRACTION IN AUTOMATIC CHINESE TEXT CLASSIFICATION
    Liao, Shasha
    Jiang, Minghu
    NEW MATHEMATICS AND NATURAL COMPUTATION, 2007, 3 (03) : 331 - 347
  • [22] A feature selection method to handle imbalanced data in text classification
    Chang, Fengxiang
    Guo, Jun
    Xu, Weiran
    Yao, Kejun
    Journal of Digital Information Management, 2015, 13 (03): : 169 - 175
  • [23] A hybrid method of feature selection for Chinese text sentiment classification
    Wang, Suge
    Wei, Yingjie
    Li, Deyu
    Zhang, Wu
    Li, Wei
    FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 3, PROCEEDINGS, 2007, : 435 - +
  • [24] Research on Feature Selection Method in Chinese Text Automatic Classification
    Hong, Ying
    Shao, Xiwen
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON APPLIED SCIENCE AND ENGINEERING INNOVATION, 2015, 12 : 1759 - 1763
  • [25] Research on feature selection method in Chinese text automatic classification
    Hong, Ying
    Geng, Zengmin
    ENERGY SCIENCE AND APPLIED TECHNOLOGY, 2016, : 359 - 361
  • [26] Two-stage Feature Selection Method for Text Classification
    Li Xi
    Dai Hang
    Wang Mingwen
    MINES 2009: FIRST INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY, VOL 1, PROCEEDINGS, 2009, : 234 - +
  • [27] A novel filter feature selection method for text classification: Extensive Feature Selector
    Parlak, Bekir
    Uysal, Alper Kursat
    JOURNAL OF INFORMATION SCIENCE, 2023, 49 (01) : 59 - 78
  • [28] A New Big Data Feature Selection Approach for Text Classification
    Amazal, Houda
    Kissi, Mohamed
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [29] A New Method of Feature Selection for Flow Classification
    Sun, Meifeng
    Chen, Jingtao
    Zhang, Yun
    Shi, Shangzhe
    INTERNATIONAL CONFERENCE ON APPLIED PHYSICS AND INDUSTRIAL ENGINEERING 2012, PT C, 2012, 24 : 1729 - 1736
  • [30] A New Method of Feature Selection for Flow Classification
    Sun, Meifeng
    Chen, Jingtao
    Zhang, Yun
    Shi, Shangzhe
    2010 INTERNATIONAL COLLOQUIUM ON COMPUTING, COMMUNICATION, CONTROL, AND MANAGEMENT (CCCM2010), VOL I, 2010, : 299 - 302