Carbon-Aware and Fault-Tolerant Migration of Deep Learning Workloads in the Geo-distributed Cloud

被引:0
|
作者
Park, Jeonghyeon [1 ]
Kim, Daero [1 ]
Kim, Jiseon [1 ]
Han, Jungkyu [1 ]
Chun, Sejin [1 ]
机构
[1] Dong A Univ, Dept Comp Engn & Artificial Intelligence, Busan, South Korea
来源
2024 IEEE 17TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, CLOUD 2024 | 2024年
基金
新加坡国家研究基金会;
关键词
carbon-aware; fault-tolerant; geo-distributed cloud; deep learning; task migration; OPTIMIZATION; FOOTPRINT; POWER;
D O I
10.1109/CLOUD62652.2024.00062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, many deep learning models have been trained in geographically distributed data centers. The carbon emissions produced by training the models may pose a significant threat to climate change like increasing temperatures. Existing studies have a hardship in shifting the workload of training models to a data center with low carbon emissions. So, they fail to ensure low emissions of the workload during training, especially when long-term workloads like Large Language Models (LLMs) are trained. To cope with this problem, we propose a method that shifts the workload to a cloud with low carbon emissions while enduring a lack of computational resources. Specifically, we define a task scheduler that includes states and their transitions to migrate mini-batches dynamically. Next, we present a faulttolerant control that optimizes a GPU frequency to adapt to workload variations of training models while guaranteeing its power consumption. Last, we conducted exhaustive experiments using real-world data in terms of carbon emissions, transfer time, and power consumption compared to state-of-the-art methods.
引用
收藏
页码:494 / 501
页数:8
相关论文
共 50 条
  • [21] Carbon-aware distributed cloud: multi-level grouping genetic algorithm
    Fereydoun Farrahi Moghaddam
    Reza Farrahi Moghaddam
    Mohamed Cheriet
    Cluster Computing, 2015, 18 : 477 - 491
  • [22] Carbon-aware distributed cloud: multi-level grouping genetic algorithm
    Moghaddam, Fereydoun Farrahi
    Moghaddam, Reza Farrahi
    Cheriet, Mohamed
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (01): : 477 - 491
  • [23] DFARM: a deadline-aware fault-tolerant scheduler for cloud computing
    Awan, Ahmad
    Aleem, Muhammad
    Hussain, Altaf
    Prodan, Radu
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (07): : 9323 - 9344
  • [24] A fault-tolerant aware scheduling method for fog-cloud environments
    Alarifi, Abdulaziz
    Abdelsamie, Fathi
    Amoon, Mohammed
    PLOS ONE, 2019, 14 (10):
  • [25] Customer Satisfaction-aware Scheduling for Utility Maximization on Geo-distributed Cloud Data Centers
    Jing, Chao
    Zhu, Yanmin
    Li, Minglu
    2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 218 - 225
  • [26] Energy and carbon-aware initial VM placement in geographically distributed cloud data centers
    Khodayarseresht, Ehsan
    Shameli-Sendi, Alireza
    Fournier, Quentin
    Dagenais, Michel
    SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2023, 39
  • [27] Quantum-secure fault-tolerant distributed cloud storage system
    Ma, Chun-Li
    Li, Dong-Dong
    Li, Yalin
    Wu, Yinghao
    Ding, Song-Yan
    Wang, Jun
    Li, Pei-Yuan
    Zhang, Song
    Chen, Junjie
    Zhang, Xiaoxing
    Wang, Jia-Yong
    Li, Jin
    Li, Qiang
    Chen, Zhi-Tong
    Zhou, Lei
    Zhao, Mei-Sheng
    Zhao, Yong
    AIP ADVANCES, 2023, 13 (11)
  • [28] Hybrid Deep Learning Framework for Privacy Preservation in Geo-Distributed Data Centre
    Nithyanantham, S.
    Singaravel, G.
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (03): : 1905 - 1919
  • [29] Deep Reinforcement Learning based VNF Management in Geo-distributed Edge Computing
    Gu, Lin
    Zeng, Deze
    Li, Wei
    Guo, Song
    Zomaya, Albert Y.
    Jin, Hai
    2019 39TH IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2019), 2019, : 934 - 943
  • [30] Data locality optimization based on data migration and hotspots prediction in geo-distributed cloud environment
    Li, Chunlin
    Zhang, Jing
    Ma, Tao
    Tang, Hengliang
    Zhang, Lei
    Luo, Youlong
    KNOWLEDGE-BASED SYSTEMS, 2019, 165 : 321 - 334