Online AUC Optimization for Sparse High-Dimensional Datasets

被引:5
|
作者
Zhou, Baojian [1 ]
Ying, Yiming [2 ]
Skiena, Steven [1 ]
机构
[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA
[2] SUNY Albany, Dept Math & Stat, Albany, NY 12222 USA
关键词
online learning; Follow-The-Regularized-Leader; sparsity; AUC optimization; AREA;
D O I
10.1109/ICDM50108.2020.00097
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each d dimensional sample has only k non-zero features with k << d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, FTRL-AUC. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d) to O(k). Furthermore, FTRL-AUC can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that FTRL-AUC significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that FTRL-AUC achieves higher AUC scores especially when datasets are imbalanced.
引用
收藏
页码:881 / 890
页数:10
相关论文
共 50 条
  • [1] Ensembled sparse-input hierarchical networks for high-dimensional datasets
    Feng, Jean
    Simon, Noah
    STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (06) : 736 - 750
  • [2] Sparse Stochastic Online AUC Optimization for Imbalanced Streaming Data
    Yang, Min
    Cai, Xufen
    Hu, Ruimin
    Ye, Long
    Zhu, Rong
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 960 - 969
  • [3] High-dimensional Data Stream Classification via Sparse Online Learning
    Wang, Dayong
    Wu, Pengcheng
    Zhao, Peilin
    Wu, Yue
    Miao, Chunyan
    Hoi, Steven C. H.
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 1007 - 1012
  • [4] Online sparse sliced inverse regression for high-dimensional streaming data
    Xu, Jianjun
    Cui, Wenquan
    Cheng, Haoyang
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2023, 21 (02)
  • [5] Architectural optimization and feature learning for high-dimensional time series datasets
    Colgan, Robert E.
    Yan, Jingkai
    Marka, Zsuzsa
    Bartos, Imre
    Marka, Szabolcs
    Wright, John N.
    PHYSICAL REVIEW D, 2023, 107 (02)
  • [6] High-dimensional sparse MANOVA
    Cai, T. Tony
    Xia, Yin
    JOURNAL OF MULTIVARIATE ANALYSIS, 2014, 131 : 174 - 196
  • [7] A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets
    Severson, Kristen A.
    Monian, Brinda
    Love, J. Christopher
    Braatz, Richard D.
    BIOINFORMATICS, 2017, 33 (18) : 2897 - 2905
  • [8] Synthetic Generation of High-Dimensional Datasets
    Albuquerque, Georgia
    Loewe, Thomas
    Magnor, Marcus
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (12) : 2317 - 2324
  • [9] Joining massive high-dimensional datasets
    Kahveci, T
    Lang, CA
    Singh, AK
    19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 265 - 276
  • [10] Cluster validation for high-dimensional datasets
    Kim, M
    Yoo, H
    Ramakrishna, RS
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2004, 3192 : 178 - 187