A GEV-Based Classification Algorithm for Imbalanced Data

被引:0
|
作者
Fu J. [1 ]
Liu G. [1 ]
机构
[1] School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai
基金
中国国家自然科学基金;
关键词
Classification; Extreme value distribution; Imbalanced data; Linear model; Probability estimation;
D O I
10.7544/issn1000-1239.2018.20170514
中图分类号
学科分类号
摘要
The problem of binary classification with imbalanced data appears in many fields and is still not completely solved. In addition to predicting the classification label directly, many applications also care about the probability that data belongs to a certain class. However, much of the existing research is mainly focused on the classification performance but neglects the probability estimation. The aim of this paper is to improve the performance of class probability estimation (CPE) and ensure the classification performance. A new approach of regression is proposed by adopting the generalized linear model as the basic framework and using the calibration loss function as the objective optimization function. Considering the asymmetry and the flexibility of the generalized extreme value (GEV) distribution, we use it to formulate the link function, which contributes to binary classification with imbalanced data. As to the model estimation, because of the significant influence of the shape parameter on modeling precision, two methods to estimate the shape parameter in GEV distribution are proposed. Experiments on synthetic datasets prove the accuracy of the shape parameter estimation. Besides, experimental results on real data suggest that our proposed approach, compared with other three commonly used regression algorithms, performs well on the classification performance as well as CPE. In addition, the proposed algorithm also outperforms other optimization algorithms in terms of the computational efficiency. © 2018, Science Press. All right reserved.
引用
收藏
页码:2361 / 2371
页数:10
相关论文
共 16 条
  • [1] Qian H., He G., A survey of class-imbalanced data classification, Computer Engineering & Science, 32, 5, pp. 85-88, (2010)
  • [2] Wallace B.C., Dahabreh I.J., Class probability estimates are unreliable for imbalanced data (and how to fix them), Proc of the 12th IEEE Int Conf on Data Mining, pp. 695-704, (2012)
  • [3] Perlich C., Melville P., Liu Y., Et al., Winner's report: KDD CUP breast cancer identification, Proc of the 14th KDD CUP Workshop on Mining Medical Data, pp. 39-42, (2008)
  • [4] Liu X., Wu J., Zhou Z., Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39, 2, pp. 539-550, (2009)
  • [5] Sun Y., Kamel M.S., Wong A.K.C., Et al., Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition, 40, 12, pp. 3358-3378, (2007)
  • [6] Seiffert C., Khoshgoftaar T.M., Hulse J.V., Et al., RUSBoost: A hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 40, 1, pp. 185-197, (2010)
  • [7] Mccullagh P., Nelder J.A., Generalized linear models, European Journal of Operational Research, 16, 3, pp. 285-292, (1984)
  • [8] King G., Zeng L., Logistic regression in rare events data, Political Analysis, 9, 2, pp. 137-163, (2001)
  • [9] Wang X., Dey D.K., Generalized extreme value regression for binary response data: An application to B2B electronic payments system adoption, The Annals of Applied Statistics, 4, 4, pp. 2000-2023, (2010)
  • [10] Calabrese R., Osmetti S.A., Generalized extreme value regression for binary rare events data: An application to credit defaults, (2011)