A novel feature selection technique for enhancing performance of unbalanced text classification problem

被引：0

作者：

Behera, Santosh Kumar ^{[1
]}

Dash, Rajashree ^{[1
]}

机构：

[1] Siksha O Anusandhan Deemed Univ, Dept Comp Sci & Engn, Bhubaneswar, Odisha, India

来源：

INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | 2022年 / 16卷 / 01期

关键词：

Text classification; feature selection; unbalanced class distribution; Chi-square; DECISION TREE; ALGORITHM;

D O I：

10.3233/IDT-210057

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.

引用

页码：51 / 69

页数：19

共 50 条

[41] Feature selection for text classification with Naive Bayes
Chen, Jingnian
Huang, Houkuan
Tian, Shengfeng
Qu, Youli
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
[42] Optimal Feature Selection for Imbalanced Text Classification
Khurana A.
Verma O.P.
IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
[43] A New Performance Metric to Evaluate Filter Feature Selection Methods in Text Classification
Cekik, Rasim
Kaya, Mahmut
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2024, 30 (07) : 978 - 1005
[44] Improving Text Classification Performance with Random Forests-Based Feature Selection
Maruf, Sameen
Javed, Kashif
Babri, Haroon A.
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2016, 41 (03) : 951 - 964
[45] Improving Text Classification Performance with Random Forests-Based Feature Selection
Sameen Maruf
Kashif Javed
Haroon A. Babri
Arabian Journal for Science and Engineering, 2016, 41 : 951 - 964
[46] Novel feature selection approaches for improving the performance of sentiment classification
Chang, Jing-Rong
Liang, Hsin-Ying
Chen, Long-Sheng
Chang, Chia-Wei
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020,
[47] A New Metaheuristic Optimization Technique for Solving Feature Selection and Classification Problems for Arabic Text
Hadni, Meryeme
Hjiaj, Hassane
ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2023, PT II, 2025, 2340 : 221 - 235
[48] A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification
Kaur, Kiranpreet
Patil, Nagamma
RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 1, 2019, 707 : 399 - 405
[49] A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy
Basha, S. Rahamat
Rani, J. Keziya
Yadav, J. J. C. Prasad
ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2019, 9 (06) : 5001 - 5005
[50] A Hybrid Feature Selection Algorithm For Classification Unbalanced Data Processsing
Zhang, Xue
Shi, Zhiguo
Liu, Xuan
Li, Xueni
2018 IEEE INTERNATIONAL CONFERENCE ON SMART INTERNET OF THINGS (SMARTIOT 2018), 2018, : 269 - 275

← 1 2 3 4 5 →