Robust Scheduling for Large-Scale Distributed Systems

被引：0

作者：

Lee, Young Choon ^{[1
]}

King, Jayden ^{[1
]}

Kim, Young Ki ^{[2
]}

Hong, Seok-Hee ^{[2
]}

机构：

[1] Macquarie Univ, Dept Comp, Sydney, NSW, Australia

[2] Univ Sydney, Sch Comp Sci, Sydney, NSW, Australia

来源：

2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020) | 2020年

基金：

澳大利亚研究理事会;

关键词：

Robust scheduling; clouds; server failures; reliability;

D O I：

10.1109/TrustCom50675.2020.00019

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.

引用

页码：38 / 45

页数：8

共 50 条

[41] Interoperability in large-scale distributed information delivery systems
Liu, L
Yan, LL
Özsu, MT
WORKFLOW MANAGEMENT SYSTEMS AND INTEROPERABILITY, 1998, 164 : 246 - 280
[42] Evaluation of distributed recovery in large-scale storage systems
Xin, Q
Miller, EL
Schwarz, TJE
13TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2004, : 172 - 181
[43] Distributed Bayesian Inference for Large-Scale IoT Systems
Vlachou, Eleni
Karras, Aristeidis
Karras, Christos
Theodorakopoulos, Leonidas
Halkiopoulos, Constantinos
Sioutas, Spyros
BIG DATA AND COGNITIVE COMPUTING, 2024, 8 (01)
[44] Secure Distributed Outsourcing of Large-scale Linear Systems
Feng, Da
Zhou, Fucai
He, Debiao
Guo, Mengna
Wu, Qiyu
2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 1110 - 1121
[45] Emulation Framework for Distributed Large-Scale Systems Integration
Imam, Neena
Rao, Nageswara S., V
Al-Najjar, Anees
Naughton, Thomas
Hitefield, Seth
SYSCON 2022: THE 16TH ANNUAL IEEE INTERNATIONAL SYSTEMS CONFERENCE (SYSCON), 2022,
[46] Robust pole placement in large-scale interval systems
Shashikhin, V.N.
Avtomatika i Telemekhanika, 2002, (02): : 34 - 43
[47] DESIGN OF ROBUST ADAPTIVE REGULATORS FOR LARGE-SCALE SYSTEMS
KAMOUN, M
INTERNATIONAL JOURNAL OF SYSTEMS SCIENCE, 1995, 26 (01) : 47 - 63
[48] ROBUST STABILITY ANALYSIS OF UNCERTAIN LARGE-SCALE SYSTEMS
HUANG, SN
SHAO, HH
CONTROL AND COMPUTERS, 1995, 23 (01): : 1 - 5
[49] Robust Assignment of Poles in Large-Scale Interval Systems
V. N. Shashikhin
Automation and Remote Control, 2002, 63 : 200 - 208
[50] Efficient Distributed Test Architectures for Large-Scale Systems
de Almeida, Eduardo Cunha
Marynowski, Joao Eugenio
Sunye, Gerson
Le Traon, Yves
Valduriez, Patrick
TESTING SOFTWARE AND SYSTEMS, 2010, 6435 : 174 - +

← 1 2 3 4 5 →