Carbon-Aware and Fault-Tolerant Migration of Deep Learning Workloads in the Geo-distributed Cloud

被引:0
|
作者
Park, Jeonghyeon [1 ]
Kim, Daero [1 ]
Kim, Jiseon [1 ]
Han, Jungkyu [1 ]
Chun, Sejin [1 ]
机构
[1] Dong A Univ, Dept Comp Engn & Artificial Intelligence, Busan, South Korea
基金
新加坡国家研究基金会;
关键词
carbon-aware; fault-tolerant; geo-distributed cloud; deep learning; task migration; OPTIMIZATION; FOOTPRINT; POWER;
D O I
10.1109/CLOUD62652.2024.00062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, many deep learning models have been trained in geographically distributed data centers. The carbon emissions produced by training the models may pose a significant threat to climate change like increasing temperatures. Existing studies have a hardship in shifting the workload of training models to a data center with low carbon emissions. So, they fail to ensure low emissions of the workload during training, especially when long-term workloads like Large Language Models (LLMs) are trained. To cope with this problem, we propose a method that shifts the workload to a cloud with low carbon emissions while enduring a lack of computational resources. Specifically, we define a task scheduler that includes states and their transitions to migrate mini-batches dynamically. Next, we present a faulttolerant control that optimizes a GPU frequency to adapt to workload variations of training models while guaranteeing its power consumption. Last, we conducted exhaustive experiments using real-world data in terms of carbon emissions, transfer time, and power consumption compared to state-of-the-art methods.
引用
收藏
页码:494 / 501
页数:8
相关论文
共 50 条
  • [1] Carbon-Aware Online Control of Geo-Distributed Cloud Services
    Zhou, Zhi
    Liu, Fangming
    Zou, Ruolan
    Liu, Jiangchuan
    Xu, Hong
    Jin, Hai
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (09) : 2506 - 2519
  • [2] Carbon-aware Load Balancing for Geo-distributed Cloud Services
    Zhou, Zhi
    Liu, Fangming
    Xu, Yong
    Zou, Ruolan
    Xu, Hong
    Lui, John C. S.
    Jin, Hai
    2013 IEEE 21ST INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS & SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2013), 2013, : 232 - +
  • [3] Distributed Cost-Aware Fault-Tolerant Load Balancing in Geo-Distributed Data Centers
    Tripathi, Rakesh
    Sivaraman, Vignesh
    Tamarapalli, Venkatesh
    IEEE TRANSACTIONS ON GREEN COMMUNICATIONS AND NETWORKING, 2022, 6 (01): : 472 - 483
  • [4] Cost-aware Capacity Provisioning for Fault-tolerant Geo-distributed Data Centers
    Tripathi, Rakesh
    Vignesh, S.
    Tamarapalli, Venkatesh
    2016 8TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORKS (COMSNETS), 2016,
  • [5] Electricity and Carbon-aware Task Scheduling in Geo-distributed Internet Data Centers
    Wang, Peng
    Liu, Wenyu
    Cheng, Ming
    Ding, Zhaohao
    Wang, Yi
    2022 IEEE/IAS INDUSTRIAL AND COMMERCIAL POWER SYSTEM ASIA (I&CPS ASIA 2022), 2022, : 1416 - 1421
  • [6] Cost-aware & Fault-tolerant Geo-distributed Edge Computing for Low-latency Stream Processing
    Xu, Jinlai
    Palanisamy, Balaji
    2021 IEEE 7TH INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (CIC 2021), 2021, : 117 - 124
  • [7] Fault-tolerant scheduling and data placement for scientific workflow processing in geo-distributed clouds
    Li, Chunlin
    Liu, Jun
    Wang, Min
    Luo, Youlong
    JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 187
  • [8] A Novel Fault-Tolerant Aware Task Scheduler Using Deep Reinforcement Learning in Cloud Computing
    Krishna, Mallu Shiva Rama
    Mangalampalli, Sudheer
    APPLIED SCIENCES-BASEL, 2023, 13 (21):
  • [9] Cost and green aware workload migration on geo-distributed datacentres
    Jiang J.
    Wu Y.
    Xiang D.
    Yu K.
    Wang T.
    International Journal of Information Technology and Management, 2019, 18 (2-3) : 213 - 226
  • [10] Cost Efficient Design of Fault Tolerant Geo-Distributed Data Centers
    Tripathi, Rakesh
    Vignesh, S.
    Tamarapalli, Venkatesh
    Medhi, Deep
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2017, 14 (02): : 289 - 301