Carbon-Aware and Fault-Tolerant Migration of Deep Learning Workloads in the Geo-distributed Cloud

被引:0
|
作者
Park, Jeonghyeon [1 ]
Kim, Daero [1 ]
Kim, Jiseon [1 ]
Han, Jungkyu [1 ]
Chun, Sejin [1 ]
机构
[1] Dong A Univ, Dept Comp Engn & Artificial Intelligence, Busan, South Korea
来源
2024 IEEE 17TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD 2024 | 2024年
基金
新加坡国家研究基金会;
关键词
carbon-aware; fault-tolerant; geo-distributed cloud; deep learning; task migration; OPTIMIZATION; FOOTPRINT; POWER;
D O I
10.1109/CLOUD62652.2024.00062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, many deep learning models have been trained in geographically distributed data centers. The carbon emissions produced by training the models may pose a significant threat to climate change like increasing temperatures. Existing studies have a hardship in shifting the workload of training models to a data center with low carbon emissions. So, they fail to ensure low emissions of the workload during training, especially when long-term workloads like Large Language Models (LLMs) are trained. To cope with this problem, we propose a method that shifts the workload to a cloud with low carbon emissions while enduring a lack of computational resources. Specifically, we define a task scheduler that includes states and their transitions to migrate mini-batches dynamically. Next, we present a faulttolerant control that optimizes a GPU frequency to adapt to workload variations of training models while guaranteeing its power consumption. Last, we conducted exhaustive experiments using real-world data in terms of carbon emissions, transfer time, and power consumption compared to state-of-the-art methods.
引用
收藏
页码:494 / 501
页数:8
相关论文
共 50 条
  • [31] Accelerating Geo-Distributed Machine Learning With Network-Aware Adaptive Tree and Auxiliary Route
    Li, Zonghang
    Feng, Wenjiao
    Cai, Weibo
    Yu, Hongfang
    Luo, Long
    Sun, Gang
    Du, Hongyang
    Niyato, Dusit
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (05) : 4238 - 4253
  • [32] An energy and carbon-aware algorithm for renewable energy usage maximization in distributed cloud data centers
    Zhao, Daming
    Zhou, Jiantao
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 165 : 156 - 166
  • [33] Power and Time aware VM Migration for Multi-tier Applications over Geo-distributed Clouds
    Addya, Sourav Kanti
    Satpathy, Anurag
    Ghosh, Bishakh Chandra
    Chakraborty, Sandip
    Ghosh, Soumya K.
    2019 IEEE 12TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (IEEE CLOUD 2019), 2019, : 339 - 343
  • [34] Management of geo-distributed intelligence: Deep Insight as a Service (DINSaaS) on Forged Cloud Platforms (FCP)
    Kuru, Kaya
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 149 : 103 - 118
  • [35] Hybrid Computing Architecture for Fault-tolerant Deep Learning Accelerators
    Xu, Dawen
    Chu, Cheng
    Wang, Qianlong
    Liu, Cheng
    Wang, Ying
    Zhang, Lei
    Liang, Huaguo
    Cheng, Kwang-Ting
    2020 IEEE 38TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2020), 2020, : 478 - 485
  • [36] Special Session: Fault-Tolerant Deep Learning: A Hierarchical Perspective
    Liu, Cheng
    Gao, Zhen
    Liu, Siting
    Ning, Xuefei
    Li, Huawei
    Li, Xiaowei
    2022 IEEE 40TH VLSI TEST SYMPOSIUM (VTS), 2022,
  • [37] HyCA: A Hybrid Computing Architecture for Fault-Tolerant Deep Learning
    Liu, Cheng
    Chu, Cheng
    Xu, Dawen
    Wang, Ying
    Wang, Qianlong
    Li, Huawei
    Li, Xiaowei
    Cheng, Kwang-Ting
    IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2022, 41 (10) : 3400 - 3413
  • [38] Fault Tree Analysis based Virtual Machine Migration for Fault-Tolerant Cloud Data Center
    Leelipushpam, Getzi Jeba
    Jebadurai, Immanuel Johnraja
    Jebadurai, Jebaveerasingh
    JOURNAL OF INTEGRATED DESIGN & PROCESS SCIENCE, 2019, 23 (03) : 73 - 89
  • [39] CDMCR: multi-level fault-tolerant system for distributed applications in cloud
    Qiang, Weizhong
    Jiang, Changqing
    Ran, Longbo
    Zou, Deqing
    Jin, Hai
    SECURITY AND COMMUNICATION NETWORKS, 2016, 9 (15) : 2766 - 2778
  • [40] Latency-Aware Leader Selection for Geo-Replicated Byzantine Fault-Tolerant Systems
    Eischer, Michael
    Distler, Tobias
    2018 48TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOPS (DSN-W), 2018, : 140 - 145