Reliability-aware resource management for computational grid/cluster environments

被引:0
|
作者
Limaye, K [1 ]
Leangsuksun, B [1 ]
Liu, YD [1 ]
Greenwood, Z [1 ]
Scott, SL [1 ]
Libby, R [1 ]
Chanchio, K [1 ]
机构
[1] Louisiana Tech Univ, Ruston, LA 71270 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational rid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability: issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
引用
收藏
页码:211 / 218
页数:8
相关论文
共 50 条
  • [21] RATE: Reliability-Aware Task Service in Fog-Enabled IoV Environments
    Tiwari, Minu
    Maity, Ilora
    Misra, Sudip
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2024, 10 (04) : 1525 - 1534
  • [22] Reliability-aware multi-objective approach for predictive asset management: A Danish distribution grid case study
    Mirshekali, Hamid
    Mortensen, Lasse Kappel
    Shaker, Hamid Reza
    APPLIED ENERGY, 2024, 358
  • [23] Reliability-aware performance model for optimal GPU-enabled cluster environment
    Supada Laosooksathit
    Raja Nassar
    Chokchai Leangsuksun
    Mihaela Paun
    The Journal of Supercomputing, 2014, 68 : 1630 - 1651
  • [24] Reliability-Aware Ratioed Logic Operations for Energy-Efficient Computational ReRAM
    Fernandez, Carlos
    Vourkas, Ioannis
    PROCEEDINGS OF THE 2022 IFIP/IEEE 30TH INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC), 2022,
  • [25] Reliability-aware Fog Resource Provisioning for Deadline-driven IoT Services
    Yao, Jingjing
    Ansari, Nirwan
    2018 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2018,
  • [26] Reliability-aware performance model for optimal GPU-enabled cluster environment
    Laosooksathit, Supada
    Nassar, Raja
    Leangsuksun, Chokchai
    Paun, Mihaela
    JOURNAL OF SUPERCOMPUTING, 2014, 68 (03): : 1630 - 1651
  • [27] Latency and Reliability-Aware Task Offloading and Resource Allocation for Mobile Edge Computing
    Liu, Chen-Feng
    Bennis, Mehdi
    Poor, H. Vincent
    2017 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), 2017,
  • [28] Reliability-Aware Design to Suppress Aging
    Amrouch, Hussam
    Khaleghi, Behnam
    Gerstlauer, Andreas
    Henkel, Joerg
    2016 ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2016,
  • [29] Robust Reliability-aware Buffer Management for DTN Multicast in Disaster Scenarios
    Begerow, Peggy
    Krug, Silvia
    Schellenberg, Sebastian
    Seitz, Jochen
    2015 7TH INTERNATIONAL WORKSHOP ON RELIABLE NETWORKS DESIGN AND MODELING (RNDM) PROCE4EDINGS, 2015, : 274 - 280
  • [30] Reliability-aware energy management for periodic real-time tasks
    Zhu, Dakai
    Aydin, Hakan
    RTAS 2007: 13TH REAL-TIME AND EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2007, : 225 - +