In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

被引:4
|
作者
Xu, Lijie [1 ,2 ]
Qiu, Shuang [3 ]
Yuan, Binhang [1 ]
Jiang, Jiawei [1 ]
Renggli, Cedric [1 ]
Gan, Shaoduo [1 ]
Kara, Kaan [1 ]
Li, Guoliang [4 ]
Liu, Ji [5 ]
Wu, Wentao [6 ]
Ye, Jieping [7 ]
Zhang, Ce [1 ]
机构
[1] Swiss Fed Inst Technol, Zurich, Switzerland
[2] Chinese Acad Sci, State Key Lab Comp Sci, Inst Software, Beijing, Peoples R China
[3] Univ Chicago, Chicago, IL 60637 USA
[4] Tsinghua Univ, Beijing, Peoples R China
[5] Kwai Inc, Beijing, Peoples R China
[6] Microsoft Res, Redmond, WA USA
[7] Univ Michigan, Ann Arbor, MI 48109 USA
基金
瑞士国家科学基金会; 欧盟地平线“2020”;
关键词
In-database machine learning; Stochastic Gradient Descent; Shuffle; ALGEBRA; SYSTEM;
D O I
10.1145/3514221.3526150
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stochastic gradient descent (SGD) is the cornerstone of modern ML systems. Despite its computational efficiency, SGD requires random data access that is inherently inefficient when implemented in systems that rely on block-addressable secondary storage such as HDD and SSD, e.g., in-DB ML systems and TensorFlow/PyTorch over large files. To address this impedance mismatch, various data shuffling strategies have been proposed to balance the convergence rate of SGD (which favors randomness) and its I/O performance (which favors sequential access). In this paper, we first conduct a systematic empirical study on existing data shuffling strategies, which reveals that all existing strategies have room for improvement-they suffer in terms of I/O performance or convergence rate. With this in mind, we propose a simple but novel hierarchical data shuffling strategy, CorgiPile. Compared with existing strategies, CorgiPile avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We provide a non-trivial theoretical analysis of CorgiPile on its convergence behavior. We further integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. Our experimental results show that CorgiPile can achieve comparable convergence rate to the full shuffle based SGD, and 1.6x-12.8x faster than two state-ofthe-art in-DB ML systems, Apache MADlib and Bismarck, on both HDD and SSD.
引用
收藏
页码:1286 / 1300
页数:15
相关论文
共 50 条
  • [1] Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems
    Xu, Lijie
    Qiu, Shuang
    Yuan, Binhang
    Jiang, Jiawei
    Renggli, Cedric
    Gan, Shaoduo
    Kara, Kaan
    Li, Guoliang
    Liu, Ji
    Wu, Wentao
    Ye, Jieping
    Zhang, Ce
    VLDB JOURNAL, 2024, 33 (05): : 1231 - 1255
  • [2] RECENT TRENDS IN STOCHASTIC GRADIENT DESCENT FOR MACHINE LEARNING AND BIG DATA
    Newton, David
    Pasupathy, Raghu
    Yousefian, Farzad
    2018 WINTER SIMULATION CONFERENCE (WSC), 2018, : 366 - 380
  • [3] Stochastic Gradient Descent and Its Variants in Machine Learning
    Netrapalli, Praneeth
    JOURNAL OF THE INDIAN INSTITUTE OF SCIENCE, 2019, 99 (02) : 201 - 213
  • [4] In-Database Machine Learning with SQL on GPUs
    Schuele, Maximilian
    Lang, Harald
    Springer, Maximilian
    Kemper, Alfons
    Neumann, Thomas
    Guennemann, Stephan
    33RD INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM 2021), 2020, : 25 - 36
  • [5] Stochastic Gradient Descent and Its Variants in Machine Learning
    Praneeth Netrapalli
    Journal of the Indian Institute of Science, 2019, 99 : 201 - 213
  • [6] Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions
    Lei, Yunwen
    Hu, Ting
    Li, Guiying
    Tang, Ke
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (10) : 4394 - 4400
  • [7] Large-Scale Machine Learning with Stochastic Gradient Descent
    Bottou, Leon
    COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, : 177 - 186
  • [8] MLog: Towards Declarative In-Database Machine Learning
    Li, Xupeng
    Cui, Bin
    Chen, Yiru
    Wu, Wentao
    Zhang, Ce
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 10 (12): : 1933 - 1936
  • [9] Learning to Learn without Gradient Descent by Gradient Descent
    Chen, Yutian
    Hoffman, Matthew W.
    Colmenarejo, Sergio Gomez
    Denil, Misha
    Lillicrap, Timothy P.
    Botvinick, Matt
    de Freitas, Nando
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [10] InferDB: In-Database Machine Learning Inference Using Indexes
    Salazar-Diaz, Ricardo
    Glavic, Boris
    Rabl, Tilmann
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (08): : 1830 - 1842