Efficient Coreset Selection with Cluster-based Methods

被引:7
|
作者
Chai, Chengliang [1 ]
Wang, Jiayi [2 ]
Tang, Nan [3 ]
Yuan, Ye [1 ]
Liu, Jiabin [1 ]
Deng, Yuhao [1 ]
Wang, Guoren [1 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Tsinghua Univ, Beijing, Peoples R China
[3] HKUST GZ, Guangzhou, Peoples R China
基金
国家重点研发计划;
关键词
Coreset selection; Data-efficient ML; Product quantization; NEAREST-NEIGHBOR;
D O I
10.1145/3580305.3599326
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e., gradient approximation. In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10x comparing with SOTA almost without sacrificing the accuracy.
引用
收藏
页码:167 / 178
页数:12
相关论文
共 50 条
  • [31] Optimistic Selection of Cluster Heads Based on Facility Location Problem in Cluster-Based Routing Protocols
    Nafiseh Masaeli
    Hamid Haj Seyed Javadi
    Elham Noori
    Wireless Personal Communications, 2013, 72 : 2721 - 2740
  • [32] Optimistic Selection of Cluster Heads Based on Facility Location Problem in Cluster-Based Routing Protocols
    Masaeli, Nafiseh
    Javadi, Hamid Haj Seyed
    Noori, Elham
    WIRELESS PERSONAL COMMUNICATIONS, 2013, 72 (04) : 2721 - 2740
  • [33] An efficient cluster-based communication protocol for wireless sensor networks
    Bajaber, Fuad
    Awan, Irfan
    TELECOMMUNICATION SYSTEMS, 2014, 55 (03) : 387 - 401
  • [34] A Cluster-based Approach to provide Energy-Efficient in WSN
    Silva, Claudio
    Costa, Rodrigo
    Pires, Adonias
    Rosario, Denis
    Cerqueira, Eduardo
    Machado, Kassio
    Neto, Augusto
    Ueyama, Jo
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2012, 12 (12): : 59 - 66
  • [35] A cluster-based energy-efficient scheme for sensor networks
    Zhang, J
    Huang, BX
    Tu, L
    Zhang, F
    PDCAT 2005: SIXTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2005, : 191 - 195
  • [36] Efficient packet distribution scheme in cluster-based active router
    Jang, Y
    Maeng, S
    Cho, J
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (09) : 2201 - 2204
  • [37] A Cluster-based Approach to provide Energy- Efficient in WSN
    Silva, Claudio
    Costa, Rodrigo
    Pires, Adonias
    Rosario, Denis
    Cerqueira, Eduardo
    Machado, Kassio
    Neto, Augusto
    Ueyama, Jo
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (01): : 55 - 62
  • [38] Efficient Change Verification of Member Cardinality in Cluster-based VCPS
    Qian, Jin
    Jing, Tao
    Huo, Yan
    Li, Hui
    Li, Zhen
    2015 INTERNATIONAL CONFERENCE ON IDENTIFICATION, INFORMATION, AND KNOWLEDGE IN THE INTERNET OF THINGS (IIKI), 2015, : 169 - 174
  • [39] Towards an efficient cluster-based E-commerce server
    Ungureanu, V
    Melamed, B
    Katehakis, M
    IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, PROCEEDINGS, 2003, : 474 - 477
  • [40] Efficient Cluster-Based Routing Algorithm for Body Sensor Networks
    Boudargham, Nadine
    Abdo, Jacques Bou
    Demerjian, Jacques
    Guyeux, Christophe
    Atechian, Talar
    2018 IEEE MIDDLE EAST AND NORTH AFRICA COMMUNICATIONS CONFERENCE (MENACOMM), 2018, : 276 - 281