Online AUC Optimization for Sparse High-Dimensional Datasets

被引:5
|
作者
Zhou, Baojian [1 ]
Ying, Yiming [2 ]
Skiena, Steven [1 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] SUNY Albany, Dept Math & Stat, Albany, NY 12222 USA
关键词
online learning; Follow-The-Regularized-Leader; sparsity; AUC optimization; AREA;
D O I
10.1109/ICDM50108.2020.00097
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each d dimensional sample has only k non-zero features with k << d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, FTRL-AUC. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d) to O(k). Furthermore, FTRL-AUC can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that FTRL-AUC significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that FTRL-AUC achieves higher AUC scores especially when datasets are imbalanced.
引用
收藏
页码:881 / 890
页数:10
相关论文
共 50 条
  • [11] A CONVEX OPTIMIZATION APPROACH TO HIGH-DIMENSIONAL SPARSE QUADRATIC DISCRIMINANT ANALYSIS
    Cai, T. Tony
    Zhang, Linjun
    ANNALS OF STATISTICS, 2021, 49 (03): : 1537 - 1568
  • [12] An Ant Colony Optimization Based Dimension Reduction Method for High-Dimensional Datasets
    Li, Ying
    Wang, Gang
    Chen, Huiling
    Shi, Lian
    Qin, Lei
    JOURNAL OF BIONIC ENGINEERING, 2013, 10 (02) : 231 - 241
  • [13] High-Dimensional Stochastic Design Optimization by Adaptive-Sparse Polynomial Dimensional Decomposition
    Rahman, Sharif
    Ren, Xuchun
    Yadav, Vaibhav
    SPARSE GRIDS AND APPLICATIONS - STUTTGART 2014, 2016, 109 : 247 - 264
  • [14] An Ant Colony Optimization Based Dimension Reduction Method for High-Dimensional Datasets
    Ying Li
    Gang Wang
    Huiling Chen
    Lian Shi
    Lei Qin
    Journal of Bionic Engineering, 2013, 10 : 231 - 241
  • [15] Sparse High-Dimensional Isotonic Regression
    Gamarnik, David
    Gaudio, Julia
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [16] Classification of sparse high-dimensional vectors
    Ingster, Yuri I.
    Pouet, Christophe
    Tsybakov, Alexandre B.
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2009, 367 (1906): : 4427 - 4448
  • [17] High-dimensional sparse Fourier algorithms
    Bosu Choi
    Andrew Christlieb
    Yang Wang
    Numerical Algorithms, 2021, 87 : 161 - 186
  • [18] High-Dimensional Computing with Sparse Vectors
    Laiho, Mika
    Poikonen, Jussi H.
    Kanerva, Pentti
    Lehtonen, Eero
    2015 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS), 2015, : 515 - 518
  • [19] On the anonymization of sparse high-dimensional data
    Ghinita, Gabriel
    Tao, Yufei
    Kalnis, Panos
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 715 - +
  • [20] The sparse structure of high-dimensional integrands
    Verlinden, P
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2001, 132 (01) : 33 - 49