SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引:4
|
作者
Zhao, Hanyu [1 ]
Han, Zhenhua [2 ]
Yang, Zhi [1 ]
Zhang, Quanlu [2 ]
Li, Mingxia [3 ]
Yang, Fan [2 ]
Zhang, Qianxi [2 ]
Li, Binyang [4 ]
Yang, Yuqing [2 ]
Qiu, Lili [2 ]
Zhang, Lintao [5 ]
Zhou, Lidong [2 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] USTC, Hefei, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] BaseBit Technol, Hong Kong, Peoples R China
关键词
Machine learning systems; cloud computing; cache systems;
D O I
10.1145/3552326.3567499
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
引用
收藏
页码:883 / 898
页数:16
相关论文
共 50 条
  • [21] Cerebras Architecture Deep Dive: First Look Inside the Hardware/Software Co-Design for Deep Learning
    Lie, Sean
    IEEE MICRO, 2023, 43 (03) : 18 - 30
  • [22] Learning and Communications Co-Design for Remote Inference Systems: Feature Length Selection and Transmission Scheduling
    Shisher M.K.C.
    Ji B.
    Hou I.-H.
    Sun Y.
    IEEE Journal on Selected Areas in Information Theory, 2023, 4 : 524 - 538
  • [23] Facilitating learning in SPI through co-design
    Seigerroth, Ulf
    Lind, Mikael
    ADVANCES IN INFORMATION SYSTEMS DEVELOPMENT, VOL 1 AND 2: BRIDGING THE GAP BETWEEN ACADEMIA AND INDUSTRY, 2006, : 119 - +
  • [24] Learning about co-design in primary care
    Thorburn, Kathryn
    Harris, Mark
    Spooner, Catherine
    Fisher, Karen
    AUSTRALIAN JOURNAL OF PRIMARY HEALTH, 2021, 27 (04) : LI - LI
  • [25] The Case for Learning-and-System Co-design
    Liang C.-J.M.
    Xue H.
    Yang M.
    Zhou L.
    Operating Systems Review (ACM), 2019, 53 (01): : 68 - 74
  • [26] Co-design of scheduling and Control for Networked Motion Control Systems
    Zhao Weiquan
    Li Di
    PROCEEDINGS OF THE 27TH CHINESE CONTROL CONFERENCE, VOL 5, 2008, : 104 - 108
  • [27] Network Scheduling and Optimal Guaranteed Cost Control Co-Design
    Chen, Jinbiao
    Lei, Bicheng
    2010 8TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2010, : 4366 - 4371
  • [28] Model Reference Scheduling and Resilient Control Co-design with Modulation
    Zhao, Shunli
    Ji, Yuehui
    PROCEEDINGS OF THE 32ND 2020 CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2020), 2020, : 845 - 849
  • [29] On the strategy of scheduling and control co-design in networked control system
    Chen Bo
    Wang Changhong
    PROCEEDINGS OF THE 24TH CHINESE CONTROL CONFERENCE, VOLS 1 AND 2, 2005, : 237 - 239
  • [30] Robust control/scheduling co-design: application to robot control
    Simon, D
    Robert, D
    Sename, O
    RTAS 2005: 11TH IEEE REAL TIME AND EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2005, : 118 - 127