Reliability-aware resource management for computational grid/cluster environments

被引:0
|
作者
Limaye, K [1 ]
Leangsuksun, B [1 ]
Liu, YD [1 ]
Greenwood, Z [1 ]
Scott, SL [1 ]
Libby, R [1 ]
Chanchio, K [1 ]
机构
[1] Louisiana Tech Univ, Ruston, LA 71270 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational rid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability: issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
引用
收藏
页码:211 / 218
页数:8
相关论文
共 50 条
  • [31] The case for lifetime reliability-aware microprocessors
    Srinivasan, J
    Adve, SV
    Bose, P
    Rivers, JA
    31ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2004, : 276 - 287
  • [32] Reliability-aware probabilistic reserve procurement
    Herre, Lars
    Pinson, Pierre
    Chatzivasileiadis, Spyros
    ELECTRIC POWER SYSTEMS RESEARCH, 2022, 212
  • [33] Reliability-Aware Energy Management for Periodic Real-Time Tasks
    Zhu, Dakai
    Aydin, Hakan
    IEEE TRANSACTIONS ON COMPUTERS, 2009, 58 (10) : 1382 - 1397
  • [34] Lifetime Reliability-Aware Digital Synthesis
    Duan, Shengyu
    Zwolinski, Mark
    Halak, Basel
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2018, 26 (11) : 2205 - 2216
  • [35] Instruction Scheduling for Reliability-Aware Compilation
    Rehman, Semeen
    Shafique, Muhammad
    Henkel, Joerg
    2012 49TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2012, : 1288 - 1296
  • [36] Reliability-Aware Optimization of a Wideband Antenna
    Kouassi, Attibaud
    Nghia Nguyen-Trong
    Kaufmann, Thomas
    Lallechere, Sebastien
    Bonnet, Pierre
    Fumeaux, Christophe
    IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, 2016, 64 (02) : 450 - 460
  • [37] Resource-aware distributed scheduling strategies for large-scale computational Cluster/Grid systems
    Viswanathan, Sivakumar
    Veeravalli, Bharadwaj
    Robertazzi, Thomas G.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2007, 18 (10) : 1450 - 1461
  • [38] Global Reliability-Aware Power Management for Multiprocessor Real-Time Systems
    Qi, Xuan
    Zhu, Dakai
    Aydin, Hakan
    16TH IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS (RTCSA 2010), 2010, : 183 - 192
  • [39] PowerPlanningDL: Reliability-Aware Framework for On-Chip Power Grid Design using Deep Learning
    Dey, Sukanta
    Nandi, Sukumar
    Trivedi, Gaurav
    PROCEEDINGS OF THE 2020 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2020), 2020, : 1520 - 1525
  • [40] Reliability-Aware Requirements Development for Autonomy Software
    Meshkat, Leila
    Magnusson, Gudjon
    Diep, Madeline
    Lindvall, Mikael
    2022 68TH ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM (RAMS 2022), 2022,