A novel feature selection technique for enhancing performance of unbalanced text classification problem

被引:0
|
作者
Behera, Santosh Kumar [1 ]
Dash, Rajashree [1 ]
机构
[1] Siksha O Anusandhan Deemed Univ, Dept Comp Sci & Engn, Bhubaneswar, Odisha, India
来源
INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | 2022年 / 16卷 / 01期
关键词
Text classification; feature selection; unbalanced class distribution; Chi-square; DECISION TREE; ALGORITHM;
D O I
10.3233/IDT-210057
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.
引用
收藏
页码:51 / 69
页数:19
相关论文
共 50 条
  • [31] A new feature selection method for text classification
    Uchyigit, Gulden
    Clark, Keith
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
  • [32] Text feature selection method for hierarchical classification
    Zhu, Cui-Ling
    Ma, Jun
    Zhang, Dong-Mei
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (01): : 103 - 110
  • [33] Feature Selection Method of Text Tendency Classification
    Li, Yanling
    Dai, Guanzhong
    Li, Gang
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 34 - +
  • [34] An enhanced feature selection method for text classification
    Kang, Jinbeom
    Lee, Eunshil
    Hong, Kwanghee
    Park, Jeahyun
    Kim, Taehwan
    Park, Juyoung
    Choi, Joongmin
    Yang, Jaeyoung
    PROCEEDINGS OF THE SECOND IASTED INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, 2006, : 36 - 41
  • [35] A novel feature selection method for text classification using association rules and clustering
    Sheydaei, Navid
    Saraee, Mohamad
    Shahgholian, Azar
    JOURNAL OF INFORMATION SCIENCE, 2015, 41 (01) : 3 - 15
  • [36] Feature selection improves text classification accuracy
    不详
    IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75
  • [37] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [38] Higher order feature selection for text classification
    Jan Bakus
    Mohamed S. Kamel
    Knowledge and Information Systems, 2006, 9 : 468 - 491
  • [39] Composite Feature Extraction and Selection for Text Classification
    Wan, Chuan
    Wang, Yuling
    Liu, Yaoze
    Ji, Jinchao
    Feng, Guozhong
    IEEE ACCESS, 2019, 7 : 35208 - 35219
  • [40] Higher order feature selection for text classification
    Bakus, J
    Kamel, MS
    KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 9 (04) : 468 - 491