In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

被引：4

作者：

Xu, Lijie ^{[1
,2
]}

Qiu, Shuang ^{[3
]}

Yuan, Binhang ^{[1
]}

Jiang, Jiawei ^{[1
]}

Renggli, Cedric ^{[1
]}

Gan, Shaoduo ^{[1
]}

Kara, Kaan ^{[1
]}

Li, Guoliang ^{[4
]}

Liu, Ji ^{[5
]}

Wu, Wentao ^{[6
]}

Ye, Jieping ^{[7
]}

Zhang, Ce ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Zurich, Switzerland

[2] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing, Peoples R China

[3] Univ Chicago, Chicago, IL 60637 USA

[4] Tsinghua Univ, Beijing, Peoples R China

[5] Kwai Inc, Beijing, Peoples R China

[6] Microsoft Res, Redmond, WA USA

[7] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) | 2022年

基金：

瑞士国家科学基金会; 欧盟地平线“2020”;

关键词：

In-database machine learning; Stochastic Gradient Descent; Shuffle; ALGEBRA; SYSTEM;

D O I：

10.1145/3514221.3526150

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6x-12.8x faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

引用

页码：1286 / 1300

页数：15

共 50 条

[1] Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems
Xu, Lijie
Qiu, Shuang
Yuan, Binhang
Jiang, Jiawei
Renggli, Cedric
Gan, Shaoduo
Kara, Kaan
Li, Guoliang
Liu, Ji
Wu, Wentao
Ye, Jieping
Zhang, Ce
VLDB JOURNAL, 2024, 33 (05): : 1231 - 1255
[2] RECENT TRENDS IN STOCHASTIC GRADIENT DESCENT FOR MACHINE LEARNING AND BIG DATA
Newton, David
Pasupathy, Raghu
Yousefian, Farzad
2018 WINTER SIMULATION CONFERENCE (WSC), 2018, : 366 - 380
[3] Stochastic Gradient Descent and Its Variants in Machine Learning
Netrapalli, Praneeth
JOURNAL OF THE INDIAN INSTITUTE OF SCIENCE, 2019, 99 (02) : 201 - 213
[4] In-Database Machine Learning with SQL on GPUs
Schuele, Maximilian
Lang, Harald
Springer, Maximilian
Kemper, Alfons
Neumann, Thomas
Guennemann, Stephan
33RD INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM 2021), 2020, : 25 - 36
[5] Stochastic Gradient Descent and Its Variants in Machine Learning
Praneeth Netrapalli
Journal of the Indian Institute of Science, 2019, 99 : 201 - 213
[6] Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions
Lei, Yunwen
Hu, Ting
Li, Guiying
Tang, Ke
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (10) : 4394 - 4400
[7] Large-Scale Machine Learning with Stochastic Gradient Descent
Bottou, Leon
COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, : 177 - 186
[8] MLog: Towards Declarative In-Database Machine Learning
Li, Xupeng
Cui, Bin
Chen, Yiru
Wu, Wentao
Zhang, Ce
PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 10 (12): : 1933 - 1936
[9] Learning to Learn without Gradient Descent by Gradient Descent
Chen, Yutian
Hoffman, Matthew W.
Colmenarejo, Sergio Gomez
Denil, Misha
Lillicrap, Timothy P.
Botvinick, Matt
de Freitas, Nando
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[10] InferDB: In-Database Machine Learning Inference Using Indexes
Salazar-Diaz, Ricardo
Glavic, Boris
Rabl, Tilmann
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (08): : 1830 - 1842

← 1 2 3 4 5 →