Reliability-aware resource management for computational grid/cluster environments

被引：0

作者：

Limaye, K ^{[1
]}

Leangsuksun, B ^{[1
]}

Liu, YD ^{[1
]}

Greenwood, Z ^{[1
]}

Scott, SL ^{[1
]}

Libby, R ^{[1
]}

Chanchio, K ^{[1
]}

机构：

[1] Louisiana Tech Univ, Ruston, LA 71270 USA

来源：

2005 6TH INTERNATIONAL WORKSHOP ON GRID COMPUTING (GRID) | 2005年

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational rid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability: issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.

引用

页码：211 / 218

页数：8

共 50 条

[1] Resource discovery and management in computational GRID environments
Bradley, Alan
Curran, Kevin
Parr, Gerard
INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2006, 19 (06) : 639 - 657
[2] Reliability-Aware Resource Allocation in HPC Systems
Gottumukkala, Narasimha Raju
Leangsuksun, Chokchai Box
Taerat, Narate
Nassar, Raja
Scott, Stephen L.
2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 312 - +
[3] Fog Resource Provisioning in Reliability-Aware IoT Networks
Yao, Jingjing
Ansari, Nirwan
IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (05) : 8262 - 8269
[4] Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era
Haghbayan, Mohammad-Hashem
Miele, Antonio
Rahmani, Amir M.
Liljeberg, Pasi
Tenhunen, Hannu
IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (09) : 1599 - 1612
[5] Reliability-Aware Runahead
Naithani, Ajeya
Eeckhout, Lieven
2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 786 - 799
[6] Reliability-aware link management strategy for network on chip
Jiao, Jia-Jia
Fu, Yu-Zhuo
Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2013, 47 (01): : 39 - 43
[7] Reliability-Aware Resource Management in Multi-/Many-Core Systems: A Perspective Paper
Sahoo, Siva Satyendra
Ranjbar, Behnaz
Kumar, Akash
JOURNAL OF LOW POWER ELECTRONICS AND APPLICATIONS, 2021, 11 (01) : 1 - 37
[8] DUAL: Reliability-Aware Power Management in Data Centers
Xu, Xin
Teramoto, Kayo
Morales, Allan
Huang, H. Howie
PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013), 2013, : 530 - 545
[9] Reliability-Aware Energy Management for Hybrid Storage Systems
Felter, Wes
Hylick, Anthony
Carter, John
2011 IEEE 27TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST), 2011,
[10] RuleDRL: Reliability-Aware SFC Provisioning With Bounded Approximations in Dynamic Environments
Zeng, Yue
Qu, Zhihao
Guo, Song
Tang, Bin
Ye, Baoliu
Li, Jing
Zhang, Jie
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (05) : 3651 - 3664

← 1 2 3 4 5 →