A survey on failure prediction of large-scale server clusters

被引:17
|
作者
Xue, Zhenghua [1 ]
Dong, Xiaoshe [1 ]
Ma, Siyuan [1 ]
Dong, Weiqing [1 ]
机构
[1] Xi An Jiao Tong Univ, Dept Comp Sci & Technol, Xian 710049, Peoples R China
关键词
D O I
10.1109/SNPD.2007.284
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.
引用
收藏
页码:733 / +
页数:2
相关论文
共 50 条
  • [31] Large-Scale Structure studies with clusters of galaxies
    Nichol, RC
    TRACING COSMIC EVOLUTION WITH GALAXY CLUSTERS, PROCEEDINGS, 2002, 268 : 57 - 68
  • [32] LARGE-SCALE DISPERSION OF CLUSTERS OF PARTICLES IN ATMOSPHERE
    KAO, SK
    ALGAIN, AA
    JOURNAL OF THE ATMOSPHERIC SCIENCES, 1968, 25 (02) : 214 - +
  • [33] The alignment of clusters using large-scale simulations
    Onuora, LI
    Thomas, PA
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2000, 319 (02) : 614 - 618
  • [34] LARGE-SCALE STRUCTURE - EVOLUTION OF GALAXY CLUSTERS
    HENRY, JP
    NATURE, 1995, 377 (6544) : 13 - 13
  • [35] A large-scale bulk flow of galaxy clusters
    Hudson, MJ
    Smith, RJ
    Lucey, JR
    Schlegel, DJ
    Davies, RL
    ASTROPHYSICAL JOURNAL, 1999, 512 (02): : L79 - L82
  • [36] Deep Learning on Large-scale Muticore Clusters
    Sakiyama, Kazumasa
    Kato, Shinpei
    Ishikawa, Yutaka
    Hori, Atsushi
    Monrroy, Abraham
    2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 314 - 321
  • [37] Autonomic failure prediction based on manifold learning for large-scale distributed systems
    Lu X.
    Wang H.-Q.
    Zhou R.-J.
    Ge B.-Y.
    Journal of China Universities of Posts and Telecommunications, 2010, 17 (04): : 116 - 124
  • [38] On Workload-Aware DRAM Failure Prediction in Large-Scale Data Centers
    Wang, Xingyi
    Li, Yu
    Chen, Yiquan
    Wang, Shiwen
    Du, Yin
    He, Cheng
    Zhang, YuZhong
    Chen, Pinan
    Li, Xin
    Song, Wenjun
    Xu, Qiang
    Jiang, Li
    2021 IEEE 39TH VLSI TEST SYMPOSIUM (VTS), 2021,
  • [39] SURVEY OF LARGE-SCALE HOG PRODUCTION
    RHODES, VJ
    AMERICAN JOURNAL OF AGRICULTURAL ECONOMICS, 1974, 56 (05) : 1204 - 1204
  • [40] A study of dynamic meta-learning for failure prediction in large-scale systems
    Lan, Zhiling
    Gu, Jiexing
    Zheng, Ziming
    Thakur, Rajeev
    Coghlan, Susan
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2010, 70 (06) : 630 - 643