In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

被引：4

作者：

Xu, Lijie ^{[1
,2
]}

Qiu, Shuang ^{[3
]}

Yuan, Binhang ^{[1
]}

Jiang, Jiawei ^{[1
]}

Renggli, Cedric ^{[1
]}

Gan, Shaoduo ^{[1
]}

Kara, Kaan ^{[1
]}

Li, Guoliang ^{[4
]}

Liu, Ji ^{[5
]}

Wu, Wentao ^{[6
]}

Ye, Jieping ^{[7
]}

Zhang, Ce ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Zurich, Switzerland

[2] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing, Peoples R China

[3] Univ Chicago, Chicago, IL 60637 USA

[4] Tsinghua Univ, Beijing, Peoples R China

[5] Kwai Inc, Beijing, Peoples R China

[6] Microsoft Res, Redmond, WA USA

[7] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) | 2022年

基金：

瑞士国家科学基金会; 欧盟地平线“2020”;

关键词：

In-database machine learning; Stochastic Gradient Descent; Shuffle; ALGEBRA; SYSTEM;

D O I：

10.1145/3514221.3526150

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6x-12.8x faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.

引用

页码：1286 / 1300

页数：15

共 50 条

[21] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
Li, Yuanzhi
Liang, Yingyu
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[22] Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
Nacson, Mor Shpigel
Srebro, Nathan
Soudry, Daniel
22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
[23] From big data to smart data: a sample gradient descent approach for machine learning
Ganie, Aadil Gani
Dadvandipour, Samad
JOURNAL OF BIG DATA, 2023, 10 (01)
[24] SQML: Large-scale in-database machine learning with pure SQL
Syed, Umar
Vassilvitskii, Sergei
PROCEEDINGS OF THE 2017 SYMPOSIUM ON CLOUD COMPUTING (SOCC '17), 2017, : 659 - 659
[25] From big data to smart data: a sample gradient descent approach for machine learning
Aadil Gani Ganie
Samad Dadvandipour
Journal of Big Data, 10
[26] In-database Distributed Machine Learning: Demonstration using Teradata SQL Engine
Sandha, Sandeep Singh
Cabrera, Wellington
Al-Kateb, Mohammed
Nair, Sanjay
Srivastava, Mani
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 1854 - 1857
[27] Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent
Pu, Shi
Olshevsky, Alex
Paschalidis, Ioannis Ch.
IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) : 114 - 122
[28] Cost-Based Lightweight Storage Automatic Decision for In-Database Machine Learning
Cui, Shuangshuang
Wang, Hongzhi
Gu, Haiyao
Xie, Yuntian
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT I, 2021, 13080 : 119 - 126
[29] Fast Stochastic Kalman Gradient Descent for Reinforcement Learning
Totaro, Simone
Jonsson, Anders
LEARNING FOR DYNAMICS AND CONTROL, VOL 144, 2021, 144
[30] Towards Learning Stochastic Population Models by Gradient Descent
Kreikemeyer, Justin N.
Andelfinger, Philipp
Uhrmacher, Adelinde M.
PROCEEDINGS OF THE 38TH ACM SIGSIM INTERNATIONAL CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, ACM SIGSIM-PADS 2024, 2024, : 88 - 92

← 1 2 3 4 5 →