Entropy and improved k-nearest neighbor search based under-sampling (ENU) method to handle class overlap in imbalanced datasets

被引:0
|
作者
Kumar, Anil [1 ]
Singh, Dinesh [1 ]
Yadav, Rama Shankar [1 ]
机构
[1] Motilal Nehru Natl Inst Technol Allahabad, Dept Comp Sci & Engn, Prayagraj, Uttar Pradesh, India
关键词
imbalance; information entropy; information loss; local density; overlap; undersampling;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Many real-world application datasets such as medical diagnostics, fraud detection, biological classification, risk analysis and so forth are facing class imbalance and overlapping problems. It seriously affects the learning of the classification model on these datasets because minority instances are not visible to the learner in the overlapped region and the performance of learners is biased towards the majority. Undersampling-based methods are the most commonly used techniques to handle the above-mentioned problems. The major problem with these methods is excessive elimination and information loss, that is, unable to retain potential informative majority instances. We propose a novel entropy and neighborhood-based undersampling (ENU) that removed only those majority instances from the overlapped region which are having less informativeness (entropy) score than the threshold entropy. Most of such existing methods improved sensitivity scores significantly but not in many other performance contexts. ENU first computes entropy and threshold score for majority instances and, a local density-based improved KNN search is used to identify overlapped majority instances. To tackle the problem effectively ENU defined four improved KNN-based procedures (ENUB, ENUT, ENUC, and ENUR) for effective undersampling. ENU outperformed in sensitivity, G-mean, and F1-score average ranking with reduced information loss as compared to the existing state-of-the-art methods.
引用
收藏
页数:36
相关论文
共 20 条
  • [1] Entropy and improved k-nearest neighbor search based under-sampling (ENU) method to handle class overlap in imbalanced datasets
    Kumar, Anil
    Singh, Dinesh
    Yadav, Rama Shankar
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (02):
  • [2] Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets
    Singh, Deepika
    Gosain, Anjana
    Saha, Anju
    STATISTICAL ANALYSIS AND DATA MINING, 2020, 13 (04) : 394 - 404
  • [3] Entropy-based hybrid sampling (EHS) method to handle class overlap in highly imbalanced dataset
    Kumar, Anil
    Singh, Dinesh
    Yadav, Rama Shankar
    EXPERT SYSTEMS, 2024, 41 (11)
  • [4] A Novel Evolutionary Preprocessing Method Based on Over-sampling and Under-sampling for Imbalanced Datasets
    Wong, Ginny Y.
    Leung, Frank H. F.
    Ling, Sai-Ho
    39TH ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY (IECON 2013), 2013, : 2354 - 2359
  • [5] Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization
    Chen, Yiheng
    Zou, Jinbai
    Liu, Lihai
    Hu, Chuanbo
    SYMMETRY-BASEL, 2024, 16 (03):
  • [6] A quality control method based on k-nearest neighbor algorithm for missing and problematic datasets
    Lu, Youmin
    Journal of Computational and Theoretical Nanoscience, 2015, 12 (11) : 4263 - 4266
  • [7] A Novel Query Method for Spatial Database Based on Improved K-Nearest Neighbor Algorithm
    Xia, Huili
    Xue, Feng
    INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2023, 16 (01)
  • [8] A K-means Clustering Based Under-Sampling Method for Imbalanced Dataset Classification
    Huang, Chih-Ming
    Hung, Chuan-Sheng
    Hsu, Yao-Yuan
    Zheng, You-Cheng
    Yu, Cheng-Han
    Lin, Chun-Hung Richard
    Chen, Shi-Huang
    38TH INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING, ICOIN 2024, 2024, : 708 - 713
  • [9] DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction
    Bhandari, Kirti
    Kumar, Kuldeep
    Sangal, Amrit Lal
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (15): : 22682 - 22725
  • [10] Prediction of carbamylated lysine sites based on the one-class k-nearest neighbor method
    Huang, Guohua
    Zhou, You
    Zhang, Yuchao
    Li, Bi-Qing
    Zhang, Ning
    Cai, Yu-Dong
    MOLECULAR BIOSYSTEMS, 2013, 9 (11) : 2729 - 2740