Robust Scheduling for Large-Scale Distributed Systems

被引：0

作者：

Lee, Young Choon ^{[1
]}

King, Jayden ^{[1
]}

Kim, Young Ki ^{[2
]}

Hong, Seok-Hee ^{[2
]}

机构：

[1] Macquarie Univ, Dept Comp, Sydney, NSW, Australia

[2] Univ Sydney, Sch Comp Sci, Sydney, NSW, Australia

来源：

2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020) | 2020年

基金：

澳大利亚研究理事会;

关键词：

Robust scheduling; clouds; server failures; reliability;

D O I：

10.1109/TrustCom50675.2020.00019

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.

引用

页码：38 / 45

页数：8

共 50 条

[21] A dependability layer for large-scale distributed systems
Cristea, Valentin
Dobre, C.
Pop, F.
Stratan, C.
Costan, A.
Leordeanu, C.
Tirsa, E.
INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2011, 2 (02) : 109 - 118
[22] Failure detectors for large-scale distributed systems
Hayashibara, N
Cherif, A
Katayama, T
21ST IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2002, : 404 - 409
[23] Energy efficiency in large-scale distributed systems
Tuan Anh Trinh
Hlavacs, Helmut
Talia, Domenico
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING AND ESCIENCE, 2012, 28 (05): : 743 - 744
[24] Stability of large-scale distributed parameter systems
Ladde, GS
Li, TT
DYNAMIC SYSTEMS AND APPLICATIONS, 2002, 11 (03): : 311 - 323
[25] Monitoring and control of large-scale distributed systems
Legrand, C.
GRID AND CLOUD COMPUTING: CONCEPTS AND PRACTICAL APPLICATIONS, 2016, 192 : 101 - 151
[26] Distributed Orchestration in Large-scale IoT Systems
Yigitoglu, Emre
Liu, Ling
Looper, Margaret
Pu, Calton
2017 IEEE 2ND INTERNATIONAL CONGRESS ON INTERNET OF THINGS (IEEE ICIOT), 2017, : 58 - 65
[27] Robustness of large-scale distributed computer systems
Khoroshevsky, VG
EUROSIM '96 - HPCN CHALLENGES IN TELECOMP AND TELECOM: PARALLEL SIMULATION OF COMPLEX SYSTEMS AND LARGE-SCALE APPLICATIONS, 1996, : 141 - 150
[28] Analysis of large-scale distributed information systems
Hellerstein, JL
Jayram, TS
Squillante, MS
8TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, PROCEEDINGS, 2000, : 164 - 171
[29] Adaptation Engine for Large-Scale Distributed Systems
Nemes, Tania
COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2015, 2015, 9520 : 244 - 251
[30] Legal reliability in large-scale distributed systems
Sommer, P
SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 416 - 421

← 1 2 3 4 5 →