A novel feature selection technique for enhancing performance of unbalanced text classification problem

被引:0
|
作者
Behera, Santosh Kumar [1 ]
Dash, Rajashree [1 ]
机构
[1] Siksha O Anusandhan Deemed Univ, Dept Comp Sci & Engn, Bhubaneswar, Odisha, India
来源
INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | 2022年 / 16卷 / 01期
关键词
Text classification; feature selection; unbalanced class distribution; Chi-square; DECISION TREE; ALGORITHM;
D O I
10.3233/IDT-210057
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.
引用
收藏
页码:51 / 69
页数:19
相关论文
共 50 条
  • [41] Feature selection for text classification with Naive Bayes
    Chen, Jingnian
    Huang, Houkuan
    Tian, Shengfeng
    Qu, Youli
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
  • [42] Optimal Feature Selection for Imbalanced Text Classification
    Khurana A.
    Verma O.P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147
  • [43] A New Performance Metric to Evaluate Filter Feature Selection Methods in Text Classification
    Cekik, Rasim
    Kaya, Mahmut
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2024, 30 (07) : 978 - 1005
  • [44] Improving Text Classification Performance with Random Forests-Based Feature Selection
    Maruf, Sameen
    Javed, Kashif
    Babri, Haroon A.
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2016, 41 (03) : 951 - 964
  • [45] Improving Text Classification Performance with Random Forests-Based Feature Selection
    Sameen Maruf
    Kashif Javed
    Haroon A. Babri
    Arabian Journal for Science and Engineering, 2016, 41 : 951 - 964
  • [46] Novel feature selection approaches for improving the performance of sentiment classification
    Chang, Jing-Rong
    Liang, Hsin-Ying
    Chen, Long-Sheng
    Chang, Chia-Wei
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020,
  • [47] A New Metaheuristic Optimization Technique for Solving Feature Selection and Classification Problems for Arabic Text
    Hadni, Meryeme
    Hjiaj, Hassane
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2023, PT II, 2025, 2340 : 221 - 235
  • [48] A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification
    Kaur, Kiranpreet
    Patil, Nagamma
    RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 1, 2019, 707 : 399 - 405
  • [49] A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy
    Basha, S. Rahamat
    Rani, J. Keziya
    Yadav, J. J. C. Prasad
    ENGINEERING TECHNOLOGY & APPLIED SCIENCE RESEARCH, 2019, 9 (06) : 5001 - 5005
  • [50] A Hybrid Feature Selection Algorithm For Classification Unbalanced Data Processsing
    Zhang, Xue
    Shi, Zhiguo
    Liu, Xuan
    Li, Xueni
    2018 IEEE INTERNATIONAL CONFERENCE ON SMART INTERNET OF THINGS (SMARTIOT 2018), 2018, : 269 - 275