MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

被引:0
|
作者
Choudhury, Arnab [1 ]
Wang, Yang [1 ,2 ]
Pelkonen, Tuomas [1 ]
Srinivasan, Kutta [1 ]
Jain, Abha [1 ]
Lin, Shenghao [1 ]
David, Delia [1 ]
Soleimanifard, Siavash [1 ]
Chen, Michael [1 ]
Yadav, Abhishek [1 ]
Tijoriwala, Ritesh [1 ]
Samoylov, Denis [1 ]
Tang, Chunqiang [1 ]
机构
[1] Meta Platforms, Menlo Pk, CA 94025 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In public clouds, users must manually select a data-center region to upload their ML training data and launch ML training workloads in the same region to ensure data and computation colocation. Unfortunately, isolated decisions by individual users can lead to a mismatch between workload demand and hardware supply across regions, hurting the cloud provider's hardware utilization and profitability. To address this problem in Meta's hyperscale private cloud, we provide a global-scheduling abstraction to all ML training workloads. Users simply submit their training workloads to MAST, our global scheduler, and rely on it to intelligently place both data and training workloads to different regions. We describe three design principles that enable MAST to schedule complex ML training workloads at a global scale: temporal decoupling, scope decoupling, and exhaustive search. MAST successfully balances the load across global regions. Before MAST, the most overloaded region had a GPU demand-to-supply ratio of 2.63 for high-priority workloads. With MAST, this ratio has been reduced to 0.98, effectively eliminating the overload.
引用
收藏
页码:563 / 580
页数:18
相关论文
共 50 条
  • [1] Scheduling Jobs Across Geo-distributed Datacenters
    Hung, Chien-Chun
    Golubchik, Leana
    Yu, Minlan
    ACM SOCC'15: PROCEEDINGS OF THE SIXTH ACM SYMPOSIUM ON CLOUD COMPUTING, 2015, : 111 - 124
  • [2] Flutter: Scheduling Tasks Closer to Data Across Geo-Distributed Datacenters
    Hu, Zhiming
    Li, Baochun
    Luo, Jun
    IEEE INFOCOM 2016 - THE 35TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS, 2016,
  • [3] A Scheduling Strategy for Jobs Across Geo-Distributed Datacenters in Cloud Computing
    Li Y.
    Zheng Y.-S.
    Li J.
    Zhu C.-G.
    Liu X.-R.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2017, 45 (10): : 2416 - 2424
  • [4] Endpoint-Flexible Coflow Scheduling Across Geo-Distributed Datacenters
    Li, Wenxin
    Yuan, Xu
    Li, Keqiu
    Qi, Heng
    Zhou, Xiaobo
    Xu, Renhai
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (10) : 2466 - 2481
  • [5] Scheduling Jobs across Geo-Distributed Datacenters with Max-Min Fairness
    Chen, Li
    Liu, Shuhao
    Li, Baochun
    Li, Bo
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2019, 6 (03): : 488 - 500
  • [6] Leveraging Endpoint Flexibility When Scheduling Coflows across Geo-distributed Datacenters
    Li, Wenxin
    Yuan, Xu
    Li, Keqiu
    Qi, Heng
    Zhou, Xiaobo
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2018), 2018, : 873 - 881
  • [7] Scheduling Jobs across Geo-Distributed Datacenters with Max-Min Fairness
    Chen, Li
    Liu, Shuhao
    Li, Baochun
    Li, Bo
    IEEE INFOCOM 2017 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2017,
  • [8] Calantha: Content Distribution across Geo-Distributed Datacenters
    Li, Yangyang
    Zhang, Linchao
    Jia, Yue
    Liao, Yong
    Xie, Haiyong
    2017 IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2017, : 724 - 729
  • [9] Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
    Wu, Yu
    Zhang, Zhizhong
    Wu, Chuan
    Guo, Chuanxiong
    Li, Zongpeng
    Lau, Francis C. M.
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2017, 5 (01) : 112 - 125
  • [10] Joint Online Coflow Optimization Across Geo-Distributed Datacenters
    Wu, Zhaoxi
    IEEE ACCESS, 2020, 8 : 213602 - 213610