In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

被引:4
|
作者
Xu, Lijie [1 ,2 ]
Qiu, Shuang [3 ]
Yuan, Binhang [1 ]
Jiang, Jiawei [1 ]
Renggli, Cedric [1 ]
Gan, Shaoduo [1 ]
Kara, Kaan [1 ]
Li, Guoliang [4 ]
Liu, Ji [5 ]
Wu, Wentao [6 ]
Ye, Jieping [7 ]
Zhang, Ce [1 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing, Peoples R China
[3] Univ Chicago, Chicago, IL 60637 USA
[4] Tsinghua Univ, Beijing, Peoples R China
[5] Kwai Inc, Beijing, Peoples R China
[6] Microsoft Res, Redmond, WA USA
[7] Univ Michigan, Ann Arbor, MI 48109 USA
来源
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) | 2022年
基金
瑞士国家科学基金会; 欧盟地平线“2020”;
关键词
In-database machine learning; Stochastic Gradient Descent; Shuffle; ALGEBRA; SYSTEM;
D O I
10.1145/3514221.3526150
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6x-12.8x faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
引用
收藏
页码:1286 / 1300
页数:15
相关论文
共 50 条
  • [21] Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data
    Li, Yuanzhi
    Liang, Yingyu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [22] Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
    Nacson, Mor Shpigel
    Srebro, Nathan
    Soudry, Daniel
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [23] From big data to smart data: a sample gradient descent approach for machine learning
    Ganie, Aadil Gani
    Dadvandipour, Samad
    JOURNAL OF BIG DATA, 2023, 10 (01)
  • [24] SQML: Large-scale in-database machine learning with pure SQL
    Syed, Umar
    Vassilvitskii, Sergei
    PROCEEDINGS OF THE 2017 SYMPOSIUM ON CLOUD COMPUTING (SOCC '17), 2017, : 659 - 659
  • [25] From big data to smart data: a sample gradient descent approach for machine learning
    Aadil Gani Ganie
    Samad Dadvandipour
    Journal of Big Data, 10
  • [26] In-database Distributed Machine Learning: Demonstration using Teradata SQL Engine
    Sandha, Sandeep Singh
    Cabrera, Wellington
    Al-Kateb, Mohammed
    Nair, Sanjay
    Srivastava, Mani
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 1854 - 1857
  • [27] Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent
    Pu, Shi
    Olshevsky, Alex
    Paschalidis, Ioannis Ch.
    IEEE SIGNAL PROCESSING MAGAZINE, 2020, 37 (03) : 114 - 122
  • [28] Cost-Based Lightweight Storage Automatic Decision for In-Database Machine Learning
    Cui, Shuangshuang
    Wang, Hongzhi
    Gu, Haiyao
    Xie, Yuntian
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2021, PT I, 2021, 13080 : 119 - 126
  • [29] Fast Stochastic Kalman Gradient Descent for Reinforcement Learning
    Totaro, Simone
    Jonsson, Anders
    LEARNING FOR DYNAMICS AND CONTROL, VOL 144, 2021, 144
  • [30] Towards Learning Stochastic Population Models by Gradient Descent
    Kreikemeyer, Justin N.
    Andelfinger, Philipp
    Uhrmacher, Adelinde M.
    PROCEEDINGS OF THE 38TH ACM SIGSIM INTERNATIONAL CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, ACM SIGSIM-PADS 2024, 2024, : 88 - 92