SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引:4
|
作者
Zhao, Hanyu [1 ]
Han, Zhenhua [2 ]
Yang, Zhi [1 ]
Zhang, Quanlu [2 ]
Li, Mingxia [3 ]
Yang, Fan [2 ]
Zhang, Qianxi [2 ]
Li, Binyang [4 ]
Yang, Yuqing [2 ]
Qiu, Lili [2 ]
Zhang, Lintao [5 ]
Zhou, Lidong [2 ]
机构
[1] Peking Univ, Beijing, Peoples R China
[2] Microsoft Res, Beijing, Peoples R China
[3] USTC, Hefei, Peoples R China
[4] Microsoft, Beijing, Peoples R China
[5] BaseBit Technol, Hong Kong, Peoples R China
关键词
Machine learning systems; cloud computing; cache systems;
D O I
10.1145/3552326.3567499
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.
引用
收藏
页码:883 / 898
页数:16
相关论文
共 50 条
  • [31] Co-design of stabilisation and transmission scheduling for wireless control systems
    Lyu, Ling
    Chen, Cailian
    Hua, Cunqing
    Zhu, Shanying
    Guan, Xinping
    IET CONTROL THEORY AND APPLICATIONS, 2017, 11 (11): : 1767 - 1778
  • [32] Co-Design of Topology, Scheduling, and Path Planning in Automated Warehouses
    Leet, Christopher
    Oh, Chanwook
    Lora, Michele
    Koenig, Sven
    Nuzzo, Pierluigi
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [33] Efficient algorithm for functional scheduling in hardware/software co-design
    Jigang, Wu
    Srikanthan, Thambipillai
    Jiao, Tao
    2006 IEEE INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY, PROCEEDINGS, 2006, : 65 - +
  • [34] Double Coded Caching in Ultra Dense Networks: Caching and Multicast Scheduling via Deep Reinforcement Learning
    Zhang, Zhengming
    Chen, Hongyang
    Hua, Meng
    Li, Chunguo
    Huang, Yongming
    Yang, Luxi
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2020, 68 (02) : 1071 - 1086
  • [35] Co-Design of a Trustworthy AI System in Healthcare: Deep Learning Based Skin Lesion Classifier
    Zicari, Roberto V.
    Ahmed, Sheraz
    Amann, Julia
    Braun, Stephan Alexander
    Brodersen, John
    Bruneault, Frederick
    Brusseau, James
    Campano, Erik
    Coffee, Megan
    Dengel, Andreas
    Duedder, Boris
    Gallucci, Alessio
    Gilbert, Thomas Krendl
    Gottfrois, Philippe
    Goffi, Emmanuel
    Haase, Christoffer Bjerre
    Hagendorff, Thilo
    Hickman, Eleanore
    Hildt, Elisabeth
    Holm, Sune
    Kringen, Pedro
    Kuehne, Ulrich
    Lucieri, Adriano
    Madai, Vince I.
    Moreno-Sanchez, Pedro A.
    Medlicott, Oriana
    Ozols, Matiss
    Schnebel, Eberhard
    Spezzatti, Andy
    Tithi, Jesmin Jahan
    Umbrello, Steven
    Vetter, Dennis
    Volland, Holger
    Westerlund, Magnus
    Wurth, Renee
    FRONTIERS IN HUMAN DYNAMICS, 2021, 3
  • [36] An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning
    Nguyen, Truong Thao
    Wahib, Mohamed
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 396 - 405
  • [37] Bioinspired Co-Design of Tactile Sensor and Deep Learning Algorithm for Human-Robot Interaction
    Kong, Depeng
    Yang, Geng
    Pang, Gaoyang
    Ye, Zhiqiu
    Lv, Honghao
    Yu, Zhangwei
    Wang, Fei
    Wang, Xi Vincent
    Xu, Kaichen
    Yang, Huayong
    ADVANCED INTELLIGENT SYSTEMS, 2022, 4 (06)
  • [38] Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
    Mudigere, Dheevatsa
    Hao, Yuchen
    Huang, Jianyu
    Jia, Zhihao
    Tulloch, Andrew
    Sridharan, Srinivas
    Liu, Xing
    Ozdal, Mustafa
    Nie, Jade
    Park, Jongsoo
    Luo, Liang
    Yang, Jie
    Gao, Leon
    Ivchenko, Dmytro
    Basant, Aarti
    Hu, Yuxi
    Yang, Jiyan
    Ardestani, Ehsan K.
    Wang, Xiaodong
    Komuravelli, Rakesh
    Chu, Ching-Hsiang
    Yilmaz, Serhat
    Li, Huayu
    Qian, Jiyuan
    Feng, Zhuobo
    Ma, Yinbin
    Yang, Junjie
    Wen, Ellie
    Li, Hong
    Yang, Lin
    Sun, Chonglin
    Zhao, Whitney
    Melts, Dimitry
    Dhulipala, Krishna
    Kishore, K. R.
    Graf, Tyler
    Eisenman, Assaf
    Matam, Kiran Kumar
    Gangidi, Adi
    Chen, Guoqiang Jerry
    Krishnan, Manoj
    Nayak, Avinash
    Nair, Krishnakumar
    Muthiah, Bharath
    Khorashadi, Mahmoud
    Bhattacharya, Pallab
    Lapukhov, Petr
    Naumov, Maxim
    Mathews, Ajit
    Qiao, Lin
    PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22), 2022, : 993 - 1011
  • [39] Intellectual Property Protection of Deep-Learning Systems via Hardware/Software Co-Design
    Chen, Huili
    Fu, Cheng
    Rouhani, Bita Darvish
    Zhao, Jishen
    Koushanfar, Farinaz
    IEEE DESIGN & TEST, 2024, 41 (02) : 23 - 31
  • [40] Algorithm and Hardware Co-design for Deep Learning-powered Channel Decoder: A Case Study
    Zhang, Boyang
    Sui, Yang
    Huang, Lingyi
    Liao, Siyu
    Deng, Chunhua
    Yuan, Bo
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,