Predicting Job Failures in AuverGrid Based on Workload Log Analysis

被引:9
|
作者
Saadatfar, Hamid [1 ]
Fadishei, Hamid [1 ]
Deldari, Hossein [1 ]
机构
[1] Ferdowsi Univ Mashhad, Parallel & Distributed Proc Lab, Dept Comp Engn, Mashhad, Iran
关键词
Job Failure Prediction; Grid Workload Archive; Trace Analysis; Bayesian Networks;
D O I
10.1007/s00354-012-0105-z
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Grid systems are popular today due to their ability to solve large problems in business and science. Job failures which are inherent in any computational environment are more common in grids due to their dynamic and complex nature. Furthermore, traditional methods for job failure recovery have proven costly and thus a need to shift toward proactive and predictive management strategies is necessary in such systems. In this paper, an innovative effort has been made to predict the futurity of jobs in a production grid environment. First of all, we investigated the relationship between workload characteristics and job failures by analyzing workload traces of AuverGrid which is a part of EGEE (Enabling Grids for E-science) project. After the recognition of failure patterns, the success or failure status of jobs during 6 months of AuverGrid activity was predicted with approximately 96% accuracy. The quality of services on the grid can be improved by integrating the result of this work into management services like scheduling and monitoring.
引用
收藏
页码:73 / 94
页数:22
相关论文
共 50 条
  • [1] Predicting Job Failures in AuverGrid Based on Workload Log Analysis
    Hamid Saadatfar
    Hamid Fadishei
    Hossein Deldari
    New Generation Computing, 2012, 30 : 73 - 94
  • [2] Analyzing and predicting job failures from HPC system log
    Ju-Won Park
    Xin Huang
    Chul-Ho Lee
    The Journal of Supercomputing, 2024, 80 : 435 - 462
  • [3] Analyzing and predicting job failures from HPC system log
    Park, Ju-Won
    Huang, Xin
    Lee, Chul-Ho
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (01): : 435 - 462
  • [4] Job failure prediction in Hadoop based on log file analysis
    Shirzad E.
    Saadatfar H.
    International Journal of Computers and Applications, 2022, 44 (03) : 260 - 269
  • [5] PREDICTING FAILURES WITH VIBRATION ANALYSIS
    HINES, GE
    INSTRUMENTS & CONTROL SYSTEMS, 1978, 51 (07): : 31 - 35
  • [6] PREDICTING FROM EARLY FAILURES THE LAST FAILURE TIME OF A (LOG) NORMAL SAMPLE
    SCHMEE, J
    NELSON, W
    IEEE TRANSACTIONS ON RELIABILITY, 1979, 28 (01) : 23 - 26
  • [7] Machine Learning for Predicting Infrastructure Faults and Job Failures in Clouds: A Survey
    Shayesteh, Behshid
    Ebrahimzadeh, Amin
    Glitho, Roch
    IEEE COMMUNICATIONS MAGAZINE, 2025, 63 (01) : 148 - 154
  • [8] Workload based order acceptance in job shop environments
    Ebben, MJR
    Hans, EW
    Olde Weghuis, FM
    OR SPECTRUM, 2005, 27 (01) : 107 - 122
  • [9] Workload based order acceptance in job shop environments
    M. J. R. Ebben
    E. W. Hans
    F. M. Olde Weghuis
    OR Spectrum, 2005, 27 : 107 - 122
  • [10] PRODUCTION LOG DATA ANALYSIS FOR REJECT RATE PREDICTION AND WORKLOAD ESTIMATION
    Pfeiffer, Andras
    Gyulai, David
    Szaller, Adam
    Monostori, Laszlo
    2018 WINTER SIMULATION CONFERENCE (WSC), 2018, : 3364 - 3374