Robust Scheduling for Large-Scale Distributed Systems

被引:0
|
作者
Lee, Young Choon [1 ]
King, Jayden [1 ]
Kim, Young Ki [2 ]
Hong, Seok-Hee [2 ]
机构
[1] Macquarie Univ, Dept Comp, Sydney, NSW, Australia
[2] Univ Sydney, Sch Comp Sci, Sydney, NSW, Australia
基金
澳大利亚研究理事会;
关键词
Robust scheduling; clouds; server failures; reliability;
D O I
10.1109/TrustCom50675.2020.00019
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.
引用
收藏
页码:38 / 45
页数:8
相关论文
共 50 条
  • [41] Interoperability in large-scale distributed information delivery systems
    Liu, L
    Yan, LL
    Özsu, MT
    WORKFLOW MANAGEMENT SYSTEMS AND INTEROPERABILITY, 1998, 164 : 246 - 280
  • [42] Evaluation of distributed recovery in large-scale storage systems
    Xin, Q
    Miller, EL
    Schwarz, TJE
    13TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2004, : 172 - 181
  • [43] Distributed Bayesian Inference for Large-Scale IoT Systems
    Vlachou, Eleni
    Karras, Aristeidis
    Karras, Christos
    Theodorakopoulos, Leonidas
    Halkiopoulos, Constantinos
    Sioutas, Spyros
    BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (01)
  • [44] Secure Distributed Outsourcing of Large-scale Linear Systems
    Feng, Da
    Zhou, Fucai
    He, Debiao
    Guo, Mengna
    Wu, Qiyu
    2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 1110 - 1121
  • [45] Emulation Framework for Distributed Large-Scale Systems Integration
    Imam, Neena
    Rao, Nageswara S., V
    Al-Najjar, Anees
    Naughton, Thomas
    Hitefield, Seth
    SYSCON 2022: THE 16TH ANNUAL IEEE INTERNATIONAL SYSTEMS CONFERENCE (SYSCON), 2022,
  • [46] Robust pole placement in large-scale interval systems
    Shashikhin, V.N.
    Avtomatika i Telemekhanika, 2002, (02): : 34 - 43
  • [47] DESIGN OF ROBUST ADAPTIVE REGULATORS FOR LARGE-SCALE SYSTEMS
    KAMOUN, M
    INTERNATIONAL JOURNAL OF SYSTEMS SCIENCE, 1995, 26 (01) : 47 - 63
  • [48] ROBUST STABILITY ANALYSIS OF UNCERTAIN LARGE-SCALE SYSTEMS
    HUANG, SN
    SHAO, HH
    CONTROL AND COMPUTERS, 1995, 23 (01): : 1 - 5
  • [49] Robust Assignment of Poles in Large-Scale Interval Systems
    V. N. Shashikhin
    Automation and Remote Control, 2002, 63 : 200 - 208
  • [50] Efficient Distributed Test Architectures for Large-Scale Systems
    de Almeida, Eduardo Cunha
    Marynowski, Joao Eugenio
    Sunye, Gerson
    Le Traon, Yves
    Valduriez, Patrick
    TESTING SOFTWARE AND SYSTEMS, 2010, 6435 : 174 - +