SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引:4
|
作者
Zhao, Hanyu [1 ]
Han, Zhenhua [2 ]
Yang, Zhi [1 ]
Zhang, Quanlu [2 ]
Li, Mingxia [3 ]
Yang, Fan [2 ]
Zhang, Qianxi [2 ]
Li, Binyang [4 ]
Yang, Yuqing [2 ]
Qiu, Lili [2 ]
Zhang, Lintao [5 ]
Zhou, Lidong [2 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] USTC, Hefei, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] BaseBit Technol, Hong Kong, Peoples R China
关键词
Machine learning systems; cloud computing; cache systems;
D O I
10.1145/3552326.3567499
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
引用
收藏
页码:883 / 898
页数:16
相关论文
共 50 条
  • [11] Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration
    Li, Zhengang
    Xie, Yanyue
    Dong, Peiyan
    Chen, Olivia
    Wang, Yanzhi
    2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,
  • [12] Optimizing Deep Learning Efficiency through Algorithm-Hardware Co-design
    Santoso, Joseph T.
    Wibowo, Mars C.
    Raharjo, Budi
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2024, 15 (10) : 1163 - 1173
  • [13] Learning to Communicate with Limited Co-design
    Sahai, Anant
    Sanz, Joshua
    Subramanian, Vignesh
    Tran, Caryn
    Vodrahalli, Kailas
    2019 57TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2019, : 184 - 191
  • [14] Development of a co-design learning environment
    McGinn, Shawn F.
    Kent, Kenneth B.
    2007 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS 1 AND 2, 2007, : 363 - 366
  • [15] Lyra: Elastic Scheduling for Deep Learning Clusters
    Li, Jiamin
    Xu, Hong
    Zhu, Yibo
    Liu, Zherui
    Guo, Chuanxiong
    Wang, Cong
    PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 835 - 850
  • [16] LQG-Based Control and Scheduling Co-Design
    Xu, Yang
    Arzen, Karl-Erik
    Bini, Enrico
    Cervin, Anton
    IFAC PAPERSONLINE, 2017, 50 (01): : 5895 - 5900
  • [17] Optimal Co-Design of Scheduling and Control for Networked Systems
    Hirche, Sandra
    HSCC'16: PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON HYBRID SYSTEMS: COMPUTATION AND CONTROL, 2016, : 245 - 245
  • [18] Co-design of Model-Reference Scheduling and Control
    Zhao, Shunli
    Ji, Yuehui
    2018 37TH CHINESE CONTROL CONFERENCE (CCC), 2018, : 6465 - 6468
  • [19] Integrated partitioning and scheduling for hardware/software co-design
    Liu, HQ
    Wong, DF
    INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1998, : 609 - 614
  • [20] Scheduling and feedback co-design for networked control systems
    Branicky, MS
    Phillips, SM
    Zhang, W
    PROCEEDINGS OF THE 41ST IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-4, 2002, : 1211 - 1217