Robust Scheduling for Large-Scale Distributed Systems

被引:0
|
作者
Lee, Young Choon [1 ]
King, Jayden [1 ]
Kim, Young Ki [2 ]
Hong, Seok-Hee [2 ]
机构
[1] Macquarie Univ, Dept Comp, Sydney, NSW, Australia
[2] Univ Sydney, Sch Comp Sci, Sydney, NSW, Australia
基金
澳大利亚研究理事会;
关键词
Robust scheduling; clouds; server failures; reliability;
D O I
10.1109/TrustCom50675.2020.00019
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.
引用
收藏
页码:38 / 45
页数:8
相关论文
共 50 条
  • [1] Different aspects of workflow scheduling in large-scale distributed systems
    Stavrinides, Georgios L.
    Rodrigo Duro, Francisco
    Karatza, Helen D.
    Garcia Blas, Javier
    Carretero, Jesus
    SIMULATION MODELLING PRACTICE AND THEORY, 2017, 70 : 120 - 134
  • [2] DISTRIBUTED RESILIENT FILTERING OF LARGE-SCALE SYSTEMS WITH CHANNEL SCHEDULING
    Xu, Lili
    Zhang, Sunjie
    Wang, Licheng
    KYBERNETIKA, 2020, 56 (01) : 170 - 188
  • [3] The power of epidemics: Robust communication for large-scale distributed systems
    Vogels, W
    van Renesse, R
    Birman, K
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2003, 33 (01) : 131 - 135
  • [4] Distributed Control of Networked Large-Scale Systems Based on A Scheduling Middleware
    Lin, Yufeng
    Wang, Jia
    Han, Qing-Long
    Jarvis, Dennis
    IECON 2017 - 43RD ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2017, : 5523 - 5528
  • [5] Scheduling for large-scale distributed platforms - Preface
    Carter, Larry
    Casanova, Henri
    Desprez, Frederic
    Ferrante, Jeanne
    Robert, Yves
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2006, 20 (04): : 441 - 442
  • [6] Distributed constraints for large-scale scheduling problems
    Abril, M
    Salido, MA
    Barber, F
    PRINCIPLES AND PRACTICE OF CONSTRAINT PROGRAMMING - CP 2005, PROCEEDINGS, 2005, 3709 : 837 - 837
  • [7] A Augmented Lagrangian Approach for Distributed Robust Estimation in Large-Scale Systems
    Chan, Shing Chow
    Wu, Ho Chun
    Ho, Cheuk Hei
    Zhang, Li
    IEEE SYSTEMS JOURNAL, 2019, 13 (03): : 2986 - 2997
  • [8] Distributed and Robust Optimal Scheduling Model for Large-Scale Electric Vehicles Connected to Grid
    Xu G.
    Zhang B.
    Zhang G.
    Diangong Jishu Xuebao/Transactions of China Electrotechnical Society, 2021, 36 (03): : 565 - 578
  • [9] Scheduling large-scale divisible load on distributed systems in multi-installment
    Shang, Mingsheng
    Sun, Shixin
    Journal of Computational Information Systems, 2005, 1 (02): : 269 - 276
  • [10] Towards the robustness of dynamic loop scheduling on large-scale heterogeneous distributed systems
    Banicescu, Ioana
    Ciorba, Florina M.
    Carino, Ricolindo L.
    EIGHTH INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED COMPUTING, PROCEEDINGS, 2009, : 129 - +