Carbon-Aware and Fault-Tolerant Migration of Deep Learning Workloads in the Geo-distributed Cloud

被引:0
|
作者
Park, Jeonghyeon [1 ]
Kim, Daero [1 ]
Kim, Jiseon [1 ]
Han, Jungkyu [1 ]
Chun, Sejin [1 ]
机构
[1] Dong A Univ, Dept Comp Engn & Artificial Intelligence, Busan, South Korea
基金
新加坡国家研究基金会;
关键词
carbon-aware; fault-tolerant; geo-distributed cloud; deep learning; task migration; OPTIMIZATION; FOOTPRINT; POWER;
D O I
10.1109/CLOUD62652.2024.00062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, many deep learning models have been trained in geographically distributed data centers. The carbon emissions produced by training the models may pose a significant threat to climate change like increasing temperatures. Existing studies have a hardship in shifting the workload of training models to a data center with low carbon emissions. So, they fail to ensure low emissions of the workload during training, especially when long-term workloads like Large Language Models (LLMs) are trained. To cope with this problem, we propose a method that shifts the workload to a cloud with low carbon emissions while enduring a lack of computational resources. Specifically, we define a task scheduler that includes states and their transitions to migrate mini-batches dynamically. Next, we present a faulttolerant control that optimizes a GPU frequency to adapt to workload variations of training models while guaranteeing its power consumption. Last, we conducted exhaustive experiments using real-world data in terms of carbon emissions, transfer time, and power consumption compared to state-of-the-art methods.
引用
收藏
页码:494 / 501
页数:8
相关论文
共 50 条
  • [41] GeoCol: A Geo-distributed Cloud Storage System with Low Cost and Latency using Reinforcement Learning
    Wang, Haoyu
    Shen, Haiying
    Li, Zijian
    Tian, Shuhao
    2021 IEEE 41ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2021), 2021, : 149 - 159
  • [42] Fault-Tolerant BPEL Workflow Execution via Cloud-Aware Recovery Policies
    Juhnke, Ernst
    Doernemann, Tim
    Freisleben, Bernd
    2009 35TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS, PROCEEDINGS, 2009, : 31 - 38
  • [43] Distributed Fault-Tolerant Control of Multiagent Systems: An Adaptive Learning Approach
    Khalili, Mohsen
    Zhang, Xiaodong
    Cao, Yongcan
    Polycarpou, Marios M.
    Parisini, Thomas
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (02) : 420 - 432
  • [44] Renewable Energy-Aware Big Data Analytics in Geo-Distributed Data Centers with Reinforcement Learning
    Xu, Chenhan
    Wang, Kun
    Li, Peng
    Xia, Rui
    Guo, Song
    Guo, Minyi
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (01): : 205 - 215
  • [45] A Cost-Effective and Multi-Source-Aware Replica Migration Approach for Geo-Distributed Data Centers
    Fatemipour, Bita
    Shi, Wei
    St-Hilaire, Marc
    2022 IEEE CLOUD SUMMIT, 2022, : 17 - 22
  • [46] LECC: Location, energy, carbon and cost-aware VM placement model in geo-distributed DCs
    Rawas, Soha
    Zekri, Ahmed
    El-Zaart, Ali
    SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2022, 33
  • [47] Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud
    Kumari, Priti
    Kaur, Parmeet
    WIRELESS PERSONAL COMMUNICATIONS, 2021, 117 (03) : 1853 - 1877
  • [48] Deep Learning Based Active Fault-Tolerant Control for Missile Actuators
    Jin, Luohuan
    Du, Mingjian
    Ma, Jianjun
    2020 CHINESE AUTOMATION CONGRESS (CAC 2020), 2020, : 5760 - 5766
  • [49] Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud
    Priti Kumari
    Parmeet Kaur
    Wireless Personal Communications, 2021, 117 : 1853 - 1877
  • [50] Value-aware Parity Insertion ECC for Fault-tolerant Deep Neural Network
    Lee, Seo-Seok
    Yang, Joon-Sung
    PROCEEDINGS OF THE 2022 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2022), 2022, : 724 - 729