A survey on failure prediction of large-scale server clusters

被引:17
|
作者
Xue, Zhenghua [1 ]
Dong, Xiaoshe [1 ]
Ma, Siyuan [1 ]
Dong, Weiqing [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China
关键词
D O I
10.1109/SNPD.2007.284
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.
引用
收藏
页码:733 / +
页数:2
相关论文
共 50 条
  • [21] A Review of Resource Scheduling in Large-Scale Server Cluster
    He, Libo
    Qiang, Zhenping
    Zhou, Wei
    Yao, Shaowen
    KNOWLEDGE MANAGEMENT IN ORGANIZATIONS (KMO 2017), 2017, 731 : 494 - 505
  • [22] A novel management architecture for large-scale server cluster
    Xue, Zhenghua
    Dong, Xiaoshe
    Fan, Shengqun
    2008 IEEE 8TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2008, : 273 - 278
  • [23] Flying Communication Server in case of a Large-scale Disaster
    Kobayashi, Toru
    Matsuoka, Hiroaki
    Betsumiya, Shouta
    PROCEEDINGS 2016 IEEE 40TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS (COMPSAC), VOL 2, 2016, : 571 - 576
  • [24] ESIR: A Deployment System for Large-scale Server Cluster
    Xue, Zhenghua
    Dong, Xiaoshe
    Li, Junyang
    Tian, Hongbo
    GCC 2008: SEVENTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, 2008, : 563 - 569
  • [25] Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters
    Wu, LF
    Hughes, TR
    Davierwala, AP
    Robinson, MD
    Stoughton, R
    Altschuler, SJ
    NATURE GENETICS, 2002, 31 (03) : 255 - 265
  • [26] Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters
    Lani F. Wu
    Timothy R. Hughes
    Armaity P. Davierwala
    Mark D. Robinson
    Roland Stoughton
    Steven J. Altschuler
    Nature Genetics, 2002, 31 : 255 - 265
  • [27] Failure Prediction for Large-scale Water Pipe Networks Using GNN and Temporal Failure Series
    Liang, Shuming
    Li, Zhidong
    Liang, Bin
    Ding, Yu
    Wang, Yang
    Chen, Fang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3955 - 3964
  • [28] Network Control for Large-Scale Container Clusters
    Zhang, Weiqi
    Wang, Baosheng
    Deng, Wenping
    Zeng, Hao
    WIRELESS ALGORITHMS, SYSTEMS, AND APPLICATIONS (WASA 2018), 2018, 10874 : 827 - 833