Online AUC Optimization for Sparse High-Dimensional Datasets

被引：5

作者：

Zhou, Baojian ^{[1
]}

Ying, Yiming ^{[2
]}

Skiena, Steven ^{[1
]}

机构：

[1] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA

[2] SUNY Albany, Dept Math & Stat, Albany, NY 12222 USA

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020) | 2020年

关键词：

online learning; Follow-The-Regularized-Leader; sparsity; AUC optimization; AREA;

D O I：

10.1109/ICDM50108.2020.00097

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Area Under the ROC Curve (AUC) is a widely used performance measure for imbalanced classification arising from many application domains where high-dimensional sparse data is abundant. In such cases, each d dimensional sample has only k non-zero features with k << d, and data arrives sequentially in a streaming form. Current online AUC optimization algorithms have high per-iteration cost O(d) and usually produce non-sparse solutions in general, and hence are not suitable for handling the data challenge mentioned above. In this paper, we aim to directly optimize the AUC score for high-dimensional sparse datasets under online learning setting and propose a new algorithm, FTRL-AUC. Our proposed algorithm can process data in an online fashion with a much cheaper per-iteration cost O(k), making it amenable for high-dimensional sparse streaming data analysis. Our new algorithmic design critically depends on a novel reformulation of the U-statistics AUC objective function as the empirical saddle point reformulation, and the innovative introduction of the "lazy update" rule so that the per-iteration complexity is dramatically reduced from O(d) to O(k). Furthermore, FTRL-AUC can inherently capture sparsity more effectively by applying a generalized Follow-The-Regularized-Leader (FTRL) framework. Experiments on real-world datasets demonstrate that FTRL-AUC significantly improves both run time and model sparsity while achieving competitive AUC scores compared with the state-of-the-art methods. Comparison with the online learning method for logistic loss demonstrates that FTRL-AUC achieves higher AUC scores especially when datasets are imbalanced.

引用

页码：881 / 890

页数：10

共 50 条

[21] High-Dimensional Sparse Linear Bandits
Hao, Botao
Lattimore, Tor
Wang, Mengdi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[22] Sparse High-Dimensional Models in Economics
Fan, Jianqing
Lv, Jinchi
Qi, Lei
ANNUAL REVIEW OF ECONOMICS, VOL 3, 2011, 3 : 291 - 317
[23] High-Dimensional Indexing by Sparse Approximation
Borges, Pedro
Mourao, Andre
Magalhaes, Joao
ICMR'15: PROCEEDINGS OF THE 2015 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2015, : 163 - 170
[24] Classification with High-Dimensional Sparse Samples
Huang, Dayu
Meyn, Sean
2012 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS (ISIT), 2012,
[25] Interpolation of sparse high-dimensional data
Thomas C. H. Lux
Layne T. Watson
Tyler H. Chang
Yili Hong
Kirk Cameron
Numerical Algorithms, 2021, 88 : 281 - 313
[26] High-dimensional sparse Fourier algorithms
Choi, Bosu
Christlieb, Andrew
Wang, Yang
NUMERICAL ALGORITHMS, 2021, 87 (01) : 161 - 186
[27] Interpolation of sparse high-dimensional data
Lux, Thomas C. H.
Watson, Layne T.
Chang, Tyler H.
Hong, Yili
Cameron, Kirk
NUMERICAL ALGORITHMS, 2021, 88 (01) : 281 - 313
[28] Pattern discovery for high-dimensional binary datasets
Snasel, Vaclav
Moravec, Pavel
Husek, Dusan
Frolov, Alexander
Rezankova, Hana
Polyakov, Pavel
NEURAL INFORMATION PROCESSING, PART I, 2008, 4984 : 861 - +
[29] Visual terrain analysis of high-dimensional datasets
Li, W
Ong, KL
Ng, WK
KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 593 - 600
[30] Quantifying and comparing features in high-dimensional datasets
Piringer, Harald
Berger, Wolfgang
Hauser, Helwig
PROCEEDINGS OF THE 12TH INTERNATIONAL INFORMATION VISUALISATION, 2008, : 240 - 245

← 1 2 3 4 5 →