SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters

被引：4

作者：

Zhao, Hanyu ^{[1
]}

Han, Zhenhua ^{[2
]}

Yang, Zhi ^{[1
]}

Zhang, Quanlu ^{[2
]}

Li, Mingxia ^{[3
]}

Yang, Fan ^{[2
]}

Zhang, Qianxi ^{[2
]}

Li, Binyang ^{[4
]}

Yang, Yuqing ^{[2
]}

Qiu, Lili ^{[2
]}

Zhang, Lintao ^{[5
]}

Zhou, Lidong ^{[2
]}

机构：

[1] Peking Univ, Beijing, Peoples R China

[2] Microsoft Res, Beijing, Peoples R China

[3] USTC, Hefei, Peoples R China

[4] Microsoft, Beijing, Peoples R China

[5] BaseBit Technol, Hong Kong, Peoples R China

来源：

PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023 | 2023年

关键词：

Machine learning systems; cloud computing; cache systems;

D O I：

10.1145/3552326.3567499

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Deep learning training on cloud platforms usually follows the tradition of the separation of storage and computing. The training executes on a compute cluster equipped with GPUs/TPUs while reading data from a separate cluster hosting the storage service. To alleviate the potential bottleneck, a training cluster usually leverages its local storage as a cache to reduce the remote IO from the storage cluster. However, existing deep learning schedulers do not manage storage resources thus fail to consider the diverse caching effects across different training jobs. This could degrade scheduling quality significantly. To address this issue, we present SiloD, a scheduling framework that co-designs the cluster scheduler and the cache subsystems for deep learning training. SiloD treats cache and remote IO as first-class resources and can integrate different state-of-the-art deep learning scheduling policies in a unified scheduling framework. To achieve this, SiloD develops an enhanced job performance estimator to help different schedulers to jointly consider the impact of storage and compute resource allocation while preserving their respective scheduling objectives. The SiloD-enhanced performance estimator leverages the unique data access pattern of deep learning training to develop a closed-form analytic model that captures the diverse cache / remote IO requirements from different training jobs. Evaluations show that SiloD improves the average job completion time, cluster utilization, and fairness by up to 7.4x, 2.57x, and 1.89x, respectively, compared to different combinations of cache systems and cluster schedulers where they operate independently.

引用

页码：883 / 898

页数：16

共 50 条

[31] Co-design of stabilisation and transmission scheduling for wireless control systems
Lyu, Ling
Chen, Cailian
Hua, Cunqing
Zhu, Shanying
Guan, Xinping
IET CONTROL THEORY AND APPLICATIONS, 2017, 11 (11): : 1767 - 1778
[32] Co-Design of Topology, Scheduling, and Path Planning in Automated Warehouses
Leet, Christopher
Oh, Chanwook
Lora, Michele
Koenig, Sven
Nuzzo, Pierluigi
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
[33] Efficient algorithm for functional scheduling in hardware/software co-design
Jigang, Wu
Srikanthan, Thambipillai
Jiao, Tao
2006 IEEE INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY, PROCEEDINGS, 2006, : 65 - +
[34] Double Coded Caching in Ultra Dense Networks: Caching and Multicast Scheduling via Deep Reinforcement Learning
Zhang, Zhengming
Chen, Hongyang
Hua, Meng
Li, Chunguo
Huang, Yongming
Yang, Luxi
IEEE TRANSACTIONS ON COMMUNICATIONS, 2020, 68 (02) : 1071 - 1086
[35] Co-Design of a Trustworthy AI System in Healthcare: Deep Learning Based Skin Lesion Classifier
Zicari, Roberto V.
Ahmed, Sheraz
Amann, Julia
Braun, Stephan Alexander
Brodersen, John
Bruneault, Frederick
Brusseau, James
Campano, Erik
Coffee, Megan
Dengel, Andreas
Duedder, Boris
Gallucci, Alessio
Gilbert, Thomas Krendl
Gottfrois, Philippe
Goffi, Emmanuel
Haase, Christoffer Bjerre
Hagendorff, Thilo
Hickman, Eleanore
Hildt, Elisabeth
Holm, Sune
Kringen, Pedro
Kuehne, Ulrich
Lucieri, Adriano
Madai, Vince I.
Moreno-Sanchez, Pedro A.
Medlicott, Oriana
Ozols, Matiss
Schnebel, Eberhard
Spezzatti, Andy
Tithi, Jesmin Jahan
Umbrello, Steven
Vetter, Dennis
Volland, Holger
Westerlund, Magnus
Wurth, Renee
FRONTIERS IN HUMAN DYNAMICS, 2021, 3
[36] An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning
Nguyen, Truong Thao
Wahib, Mohamed
21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 396 - 405
[37] Bioinspired Co-Design of Tactile Sensor and Deep Learning Algorithm for Human-Robot Interaction
Kong, Depeng
Yang, Geng
Pang, Gaoyang
Ye, Zhiqiu
Lv, Honghao
Yu, Zhangwei
Wang, Fei
Wang, Xi Vincent
Xu, Kaichen
Yang, Huayong
ADVANCED INTELLIGENT SYSTEMS, 2022, 4 (06)
[38] Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
Mudigere, Dheevatsa
Hao, Yuchen
Huang, Jianyu
Jia, Zhihao
Tulloch, Andrew
Sridharan, Srinivas
Liu, Xing
Ozdal, Mustafa
Nie, Jade
Park, Jongsoo
Luo, Liang
Yang, Jie
Gao, Leon
Ivchenko, Dmytro
Basant, Aarti
Hu, Yuxi
Yang, Jiyan
Ardestani, Ehsan K.
Wang, Xiaodong
Komuravelli, Rakesh
Chu, Ching-Hsiang
Yilmaz, Serhat
Li, Huayu
Qian, Jiyuan
Feng, Zhuobo
Ma, Yinbin
Yang, Junjie
Wen, Ellie
Li, Hong
Yang, Lin
Sun, Chonglin
Zhao, Whitney
Melts, Dimitry
Dhulipala, Krishna
Kishore, K. R.
Graf, Tyler
Eisenman, Assaf
Matam, Kiran Kumar
Gangidi, Adi
Chen, Guoqiang Jerry
Krishnan, Manoj
Nayak, Avinash
Nair, Krishnakumar
Muthiah, Bharath
Khorashadi, Mahmoud
Bhattacharya, Pallab
Lapukhov, Petr
Naumov, Maxim
Mathews, Ajit
Qiao, Lin
PROCEEDINGS OF THE 2022 THE 49TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA '22), 2022, : 993 - 1011
[39] Intellectual Property Protection of Deep-Learning Systems via Hardware/Software Co-Design
Chen, Huili
Fu, Cheng
Rouhani, Bita Darvish
Zhao, Jishen
Koushanfar, Farinaz
IEEE DESIGN & TEST, 2024, 41 (02) : 23 - 31
[40] Algorithm and Hardware Co-design for Deep Learning-powered Channel Decoder: A Case Study
Zhang, Boyang
Sui, Yang
Huang, Lingyi
Liao, Siyu
Deng, Chunhua
Yuan, Bo
2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,

← 1 2 3 4 5 →