A survey on failure prediction of large-scale server clusters

被引:17
|
作者
Xue, Zhenghua [1 ]
Dong, Xiaoshe [1 ]
Ma, Siyuan [1 ]
Dong, Weiqing [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China
关键词
D O I
10.1109/SNPD.2007.284
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.
引用
收藏
页码:733 / +
页数:2
相关论文
共 50 条
  • [1] Cascading failure prediction and recovery in large-scale critical infrastructure networks: A survey
    Li, Beibei
    Hu, Wei
    Yuan, Chaoxuan
    Wang, Xinxin
    Li, Yiwei
    Wu, Yibing
    INFORMATION AND SOFTWARE TECHNOLOGY, 2025, 182
  • [2] Failure data analysis of a large-scale heterogeneous server environment
    Sahoo, RK
    Sivasubramaniam, A
    Squillante, MS
    Zhang, YY
    2004 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2004, : 772 - 781
  • [3] Design and implementation of an adaptive monitoring system for large-scale server clusters
    School of Electronics and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
    不详
    Hsi An Chiao Tung Ta Hsueh, 2008, 4 (399-403):
  • [4] An energy-efficient management mechanism for large-scale server clusters
    Xue, Zhenghua
    Dong, Xiaoshe
    Ma, Siyuan
    Fan, Shengqun
    Mei, Yiduo
    2ND IEEE ASIA-PACIFIC SERVICES COMPUTING CONFERENCE, PROCEEDINGS, 2007, : 509 - 516
  • [5] Cosmology with galaxy clusters in the XMM large-scale structure survey
    Refregier, A.
    Valtchanov, I.
    Pierre, M.
    Astronomy and Astrophysics, 2002, 390 (01): : 1 - 12
  • [6] Cosmology with galaxy clusters in the XMM large-scale structure survey
    Refregier, A
    Valtchanov, I
    Pierre, M
    ASTRONOMY & ASTROPHYSICS, 2002, 390 (01) : 1 - 12
  • [7] DRAM Failure Prediction in Large-Scale Data Centers
    Yu, Fengyuan
    Xu, Hongzuo
    Jian, Songlei
    Huang, Chenlin
    Wang, Yijie
    Wu, Zhiyue
    2021 IEEE INTERNATIONAL CONFERENCE ON JOINT CLOUD COMPUTING (JCC 2021) / 2021 9TH IEEE INTERNATIONAL CONFERENCE ON MOBILE CLOUD COMPUTING, SERVICES, AND ENGINEERING (MOBILECLOUD 2021), 2021, : 1 - 8
  • [8] A multi-agent based autonomic management architecture for large-scale server clusters
    Xue, Zhenghua
    Dong, Xiaoshe
    Liu, Weizhe
    Li, Junyang
    Liao, Shihua
    CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 247 - +
  • [9] Clusters and large-scale structure
    Bahcall, NA
    SEVENTEENTH TEXAS SYMPOSIUM ON RELATIVISTIC ASTROPHYSICS AND COSMOLOGY, 1995, 759 : 636 - 649
  • [10] Design of a log server for distributed and large-scale server environments
    Özgit, A
    Dayioglu, B
    Anuk, E
    Kanbur, I
    Alptekin, O
    Ermis, U
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 885 - 891