SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引:4
|
作者
Zhao, Hanyu [1 ]
Han, Zhenhua [2 ]
Yang, Zhi [1 ]
Zhang, Quanlu [2 ]
Li, Mingxia [3 ]
Yang, Fan [2 ]
Zhang, Qianxi [2 ]
Li, Binyang [4 ]
Yang, Yuqing [2 ]
Qiu, Lili [2 ]
Zhang, Lintao [5 ]
Zhou, Lidong [2 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] USTC, Hefei, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] BaseBit Technol, Hong Kong, Peoples R China
关键词
Machine learning systems; cloud computing; cache systems;
D O I
10.1145/3552326.3567499
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
引用
收藏
页码:883 / 898
页数:16
相关论文
共 50 条
  • [1] Compiler Technologies in Deep Learning Co-Design: A Survey
    Zhang, Hongbin
    Xing, Mingjie
    Wu, Yanjun
    Zhao, Chen
    Intelligent Computing, 2023, 2
  • [2] HW/SW Co-Design and Co-Optimizations for Deep Learning
    Marchisio, Alberto
    Putra, Rachmad Vidya Wicaksana
    Hanif, Muhammad Abdullah
    Shafique, Muhammad
    WORKSHOP PROCEEDINGS 2018: INTELLIGENT EMBEDDED SYSTEMS ARCHITECTURES AND APPLICATIONS (INTESA), 2018, : 13 - 18
  • [3] Harmonic Scheduling and Control Co-Design
    Xu, Yang
    Cervin, Anton
    Arzen, Karl-Erik
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON EMBEDDED AND REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS (RTCSA), 2016, : 182 - 187
  • [4] Co-Design of Control, Computation, and Network Scheduling Based on Reinforcement Learning
    Liu, Yuqi
    Zeng, Peng
    Cui, Jinghan
    Xia, Changqing
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (03) : 5249 - 5258
  • [5] An introduction to control and scheduling co-design
    Årzén, KE
    Cervin, A
    Eker, J
    Sha, L
    PROCEEDINGS OF THE 39TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 2000, : 4865 - 4870
  • [6] A Hardware/Software Co-Design Vision for Deep Learning at the Edge
    Ponzina, Flavio
    Machetti, Simone
    Rios, Marco
    Denkinger, Benoit Walter
    Levisse, Alexandre
    Ansaloni, Giovanni
    Peon-Quiros, Miguel
    Atienza, David
    IEEE MICRO, 2022, 42 (06) : 48 - 54
  • [7] Hardware-Software Co-design Approach for Deep Learning Inference
    Paul, Debdeep
    Singh, Jawar
    Mathew, Jimson
    2019 7TH INTERNATIONAL CONFERENCE ON SMART COMPUTING & COMMUNICATIONS (ICSCC), 2019, : 118 - 122
  • [8] Deep Learning Hardware/Software Co-Design for Heart Sound Classification
    Jhong, Wun-Siou
    Chu, Shao-, I
    Huang, Yu-Jung
    Hsu, Tsun-Yi
    Lin, Wei-Chen
    Huang, Pokai
    Wang, Jia-Jung
    2020 17TH INTERNATIONAL SOC DESIGN CONFERENCE (ISOCC 2020), 2020, : 27 - 28
  • [9] Communication Algorithm-Architecture Co-Design for Distributed Deep Learning
    Huang, Jiayi
    Majumder, Pritam
    Kim, Sungkeun
    Muzahid, Abdullah
    Yum, Ki Hwan
    Kim, Eun Jung
    2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 181 - 194
  • [10] Preparing Future Designers for their Role in Co-Design: Student Insights on Learning Co-Design
    Ornekoglu-Selcuk, Melis
    Emmanouil, Marina
    Hasirci, Deniz
    Grizioti, Marianthi
    Van Langenhove, Lieva
    INTERNATIONAL JOURNAL OF ART & DESIGN EDUCATION, 2024, 43 (02) : 241 - 257