Efficient Coreset Selection with Cluster-based Methods

被引：7

作者：

Chai, Chengliang ^{[1
]}

Wang, Jiayi ^{[2
]}

Tang, Nan ^{[3
]}

Yuan, Ye ^{[1
]}

Liu, Jiabin ^{[1
]}

Deng, Yuhao ^{[1
]}

Wang, Guoren ^{[1
]}

机构：

[1] Beijing Inst Technol, Beijing, Peoples R China

[2] Tsinghua Univ, Beijing, Peoples R China

[3] HKUST GZ, Guangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023 | 2023年

基金：

国家重点研发计划;

关键词：

Coreset selection; Data-efficient ML; Product quantization; NEAREST-NEIGHBOR;

D O I：

10.1145/3580305.3599326

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e., gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10x comparing with SOTA almost without sacrificing the accuracy.

引用

页码：167 / 178

页数：12

共 50 条

[1] Cluster-based selection
Dunbar, JB
PERSPECTIVES IN DRUG DISCOVERY AND DESIGN, 1997, 7-8 : 51 - 63
[2] Cluster Integration for the Cluster-Based Instance Selection
Czarnowski, Ireneusz
Jedrzejowicz, Piotr
COMPUTATIONAL COLLECTIVE INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS, PT I, 2010, 6421 : 353 - 362
[3] A Cluster-Based Feature Selection Approach
Covoes, Thiago F.
Hruschka, Eduardo R.
de Castro, Leandro N.
Santos, Atila M.
HYBRID ARTIFICIAL INTELLIGENCE SYSTEMS, 2009, 5572 : 169 - +
[4] Efficient cluster-based portfolio optimization
Bnouachir, Najla
Mkhadri, Abdallah
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (11) : 3241 - 3255
[5] Methods for cluster-based incident detection
Carrier, BD
Matheny, B
SECOND IEEE INTERNATIONAL INFORMATION ASSURANCE WORKSHOP, PROCEEDINGS, 2004, : 71 - 78
[6] Adversarial Coreset Selection for Efficient Robust Training
Dolatabadi, Hadi M.
Erfani, Sarah M.
Leckie, Christopher
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (12) : 3307 - 3331
[7] Cluster-based instance selection for machine classification
Czarnowski, Ireneusz
KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 30 (01) : 113 - 133
[8] A Cluster-Based Sequential Feature Selection Algorithm
Zhu, Kexin
Yang, Jian
2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 848 - 852
[9] A New Cluster-based Instance Selection Algorithm
Czarnowski, Ireneusz
Jedrzejowicz, Piotr
AGENT AND MULTI-AGENT SYSTEMS: TECHNOLOGIES AND APPLICATIONS, 2011, 6682 : 436 - 445
[10] Cluster-Based Selection of Statistical Answering Strategies
Lita, Lucian Vlad
Carbonell, Jaime
20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 1653 - 1658

← 1 2 3 4 5 →