Reliability-aware resource management for computational grid/cluster environments

被引:0
|
作者
Limaye, K [1 ]
Leangsuksun, B [1 ]
Liu, YD [1 ]
Greenwood, Z [1 ]
Scott, SL [1 ]
Libby, R [1 ]
Chanchio, K [1 ]
机构
[1] Louisiana Tech Univ, Ruston, LA 71270 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational rid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability: issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
引用
收藏
页码:211 / 218
页数:8
相关论文
共 50 条
  • [1] Resource discovery and management in computational GRID environments
    Bradley, Alan
    Curran, Kevin
    Parr, Gerard
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2006, 19 (06) : 639 - 657
  • [2] Reliability-Aware Resource Allocation in HPC Systems
    Gottumukkala, Narasimha Raju
    Leangsuksun, Chokchai Box
    Taerat, Narate
    Nassar, Raja
    Scott, Stephen L.
    2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 312 - +
  • [3] Fog Resource Provisioning in Reliability-Aware IoT Networks
    Yao, Jingjing
    Ansari, Nirwan
    IEEE INTERNET OF THINGS JOURNAL, 2019, 6 (05) : 8262 - 8269
  • [4] Performance/Reliability-Aware Resource Management for Many-Cores in Dark Silicon Era
    Haghbayan, Mohammad-Hashem
    Miele, Antonio
    Rahmani, Amir M.
    Liljeberg, Pasi
    Tenhunen, Hannu
    IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (09) : 1599 - 1612
  • [5] Reliability-Aware Runahead
    Naithani, Ajeya
    Eeckhout, Lieven
    2022 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA 2022), 2022, : 786 - 799
  • [6] Reliability-aware link management strategy for network on chip
    Jiao, Jia-Jia
    Fu, Yu-Zhuo
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2013, 47 (01): : 39 - 43
  • [7] Reliability-Aware Resource Management in Multi-/Many-Core Systems: A Perspective Paper
    Sahoo, Siva Satyendra
    Ranjbar, Behnaz
    Kumar, Akash
    JOURNAL OF LOW POWER ELECTRONICS AND APPLICATIONS, 2021, 11 (01) : 1 - 37
  • [8] DUAL: Reliability-Aware Power Management in Data Centers
    Xu, Xin
    Teramoto, Kayo
    Morales, Allan
    Huang, H. Howie
    PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013), 2013, : 530 - 545
  • [9] Reliability-Aware Energy Management for Hybrid Storage Systems
    Felter, Wes
    Hylick, Anthony
    Carter, John
    2011 IEEE 27TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST), 2011,
  • [10] RuleDRL: Reliability-Aware SFC Provisioning With Bounded Approximations in Dynamic Environments
    Zeng, Yue
    Qu, Zhihao
    Guo, Song
    Tang, Bin
    Ye, Baoliu
    Li, Jing
    Zhang, Jie
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (05) : 3651 - 3664