Online AUC Optimization for Sparse High-Dimensional Datasets

被引:5
|
作者
Zhou, Baojian [1 ]
Ying, Yiming [2 ]
Skiena, Steven [1 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] SUNY Albany, Dept Math & Stat, Albany, NY 12222 USA
关键词
online learning; Follow-The-Regularized-Leader; sparsity; AUC optimization; AREA;
D O I
10.1109/ICDM50108.2020.00097
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each d dimensional sample has only k non-zero features with k << d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, FTRL-AUC. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d) to O(k). Furthermore, FTRL-AUC can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that FTRL-AUC significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that FTRL-AUC achieves higher AUC scores especially when datasets are imbalanced.
引用
收藏
页码:881 / 890
页数:10
相关论文
共 50 条
  • [21] High-Dimensional Sparse Linear Bandits
    Hao, Botao
    Lattimore, Tor
    Wang, Mengdi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [22] Sparse High-Dimensional Models in Economics
    Fan, Jianqing
    Lv, Jinchi
    Qi, Lei
    ANNUAL REVIEW OF ECONOMICS, VOL 3, 2011, 3 : 291 - 317
  • [23] High-Dimensional Indexing by Sparse Approximation
    Borges, Pedro
    Mourao, Andre
    Magalhaes, Joao
    ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 163 - 170
  • [24] Classification with High-Dimensional Sparse Samples
    Huang, Dayu
    Meyn, Sean
    2012 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2012,
  • [25] Interpolation of sparse high-dimensional data
    Thomas C. H. Lux
    Layne T. Watson
    Tyler H. Chang
    Yili Hong
    Kirk Cameron
    Numerical Algorithms, 2021, 88 : 281 - 313
  • [26] High-dimensional sparse Fourier algorithms
    Choi, Bosu
    Christlieb, Andrew
    Wang, Yang
    NUMERICAL ALGORITHMS, 2021, 87 (01) : 161 - 186
  • [27] Interpolation of sparse high-dimensional data
    Lux, Thomas C. H.
    Watson, Layne T.
    Chang, Tyler H.
    Hong, Yili
    Cameron, Kirk
    NUMERICAL ALGORITHMS, 2021, 88 (01) : 281 - 313
  • [28] Pattern discovery for high-dimensional binary datasets
    Snasel, Vaclav
    Moravec, Pavel
    Husek, Dusan
    Frolov, Alexander
    Rezankova, Hana
    Polyakov, Pavel
    NEURAL INFORMATION PROCESSING, PART I, 2008, 4984 : 861 - +
  • [29] Visual terrain analysis of high-dimensional datasets
    Li, W
    Ong, KL
    Ng, WK
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 593 - 600
  • [30] Quantifying and comparing features in high-dimensional datasets
    Piringer, Harald
    Berger, Wolfgang
    Hauser, Helwig
    PROCEEDINGS OF THE 12TH INTERNATIONAL INFORMATION VISUALISATION, 2008, : 240 - 245