A novel feature selection technique for enhancing performance of unbalanced text classification problem

被引：0

作者：

Behera, Santosh Kumar ^{[1
]}

Dash, Rajashree ^{[1
]}

机构：

[1] Siksha O Anusandhan Deemed Univ, Dept Comp Sci & Engn, Bhubaneswar, Odisha, India

来源：

INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | 2022年 / 16卷 / 01期

关键词：

Text classification; feature selection; unbalanced class distribution; Chi-square; DECISION TREE; ALGORITHM;

D O I：

10.3233/IDT-210057

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.

引用

页码：51 / 69

页数：19

共 50 条

[31] A new feature selection method for text classification
Uchyigit, Gulden
Clark, Keith
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
[32] Text feature selection method for hierarchical classification
Zhu, Cui-Ling
Ma, Jun
Zhang, Dong-Mei
Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (01): : 103 - 110
[33] Feature Selection Method of Text Tendency Classification
Li, Yanling
Dai, Guanzhong
Li, Gang
FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 34 - +
[34] An enhanced feature selection method for text classification
Kang, Jinbeom
Lee, Eunshil
Hong, Kwanghee
Park, Jeahyun
Kim, Taehwan
Park, Juyoung
Choi, Joongmin
Yang, Jaeyoung
PROCEEDINGS OF THE SECOND IASTED INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, 2006, : 36 - 41
[35] A novel feature selection method for text classification using association rules and clustering
Sheydaei, Navid
Saraee, Mohamad
Shahgholian, Azar
JOURNAL OF INFORMATION SCIENCE, 2015, 41 (01) : 3 - 15
[36] Feature selection improves text classification accuracy
不详
IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75
[37] A new approach to feature selection in text classification
Wang, Y
Wang, XJ
PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
[38] Higher order feature selection for text classification
Jan Bakus
Mohamed S. Kamel
Knowledge and Information Systems, 2006, 9 : 468 - 491
[39] Composite Feature Extraction and Selection for Text Classification
Wan, Chuan
Wang, Yuling
Liu, Yaoze
Ji, Jinchao
Feng, Guozhong
IEEE ACCESS, 2019, 7 : 35208 - 35219
[40] Higher order feature selection for text classification
Bakus, J
Kamel, MS
KNOWLEDGE AND INFORMATION SYSTEMS, 2006, 9 (04) : 468 - 491

← 1 2 3 4 5 →