Online AUC Optimization for Sparse High-Dimensional Datasets

被引：5

作者：

Zhou, Baojian ^{[1
]}

Ying, Yiming ^{[2
]}

Skiena, Steven ^{[1
]}

机构：

[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA

[2] SUNY Albany, Dept Math & Stat, Albany, NY 12222 USA

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) | 2020年

关键词：

online learning; Follow-The-Regularized-Leader; sparsity; AUC optimization; AREA;

D O I：

10.1109/ICDM50108.2020.00097

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each d dimensional sample has only k non-zero features with k << d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, FTRL-AUC. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d) to O(k). Furthermore, FTRL-AUC can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that FTRL-AUC significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that FTRL-AUC achieves higher AUC scores especially when datasets are imbalanced.

引用

页码：881 / 890

页数：10

共 50 条

[1] Ensembled sparse-input hierarchical networks for high-dimensional datasets
Feng, Jean
Simon, Noah
STATISTICAL ANALYSIS AND DATA MINING, 2022, 15 (06) : 736 - 750
[2] Sparse Stochastic Online AUC Optimization for Imbalanced Streaming Data
Yang, Min
Cai, Xufen
Hu, Ruimin
Ye, Long
Zhu, Rong
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 960 - 969
[3] High-dimensional Data Stream Classification via Sparse Online Learning
Wang, Dayong
Wu, Pengcheng
Zhao, Peilin
Wu, Yue
Miao, Chunyan
Hoi, Steven C. H.
2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 1007 - 1012
[4] Online sparse sliced inverse regression for high-dimensional streaming data
Xu, Jianjun
Cui, Wenquan
Cheng, Haoyang
INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2023, 21 (02)
[5] Architectural optimization and feature learning for high-dimensional time series datasets
Colgan, Robert E.
Yan, Jingkai
Marka, Zsuzsa
Bartos, Imre
Marka, Szabolcs
Wright, John N.
PHYSICAL REVIEW D, 2023, 107 (02)
[6] High-dimensional sparse MANOVA
Cai, T. Tony
Xia, Yin
JOURNAL OF MULTIVARIATE ANALYSIS, 2014, 131 : 174 - 196
[7] A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets
Severson, Kristen A.
Monian, Brinda
Love, J. Christopher
Braatz, Richard D.
BIOINFORMATICS, 2017, 33 (18) : 2897 - 2905
[8] Synthetic Generation of High-Dimensional Datasets
Albuquerque, Georgia
Loewe, Thomas
Magnor, Marcus
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2011, 17 (12) : 2317 - 2324
[9] Joining massive high-dimensional datasets
Kahveci, T
Lang, CA
Singh, AK
19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 265 - 276
[10] Cluster validation for high-dimensional datasets
Kim, M
Yoo, H
Ramakrishna, RS
ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2004, 3192 : 178 - 187

← 1 2 3 4 5 →