Model-based Fault Localization: Finding Behavioral Outliers in Large-scale Computing Systems

被引:2
|
作者
Maruyama, Naoya [1 ]
Matsuoka, Satoshi [1 ,2 ]
机构
[1] Tokyo Inst Technol, Global Sci Informat & Comp Ctr GSIC, Meguro Ku, Tokyo 1528550, Japan
[2] Res Org Informat & Syst, Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
Distributed Systems; Fault Localization; PERFORMANCE;
D O I
10.1007/s00354-009-0088-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present a model-based approach to fault localization that aims to help the human analyst narrow down the manual localization into a small fraction of the overall system. Our method consists of two parts: pre-failure model derivation and post-failure model-based anomaly detection. The first part collects function-call traces from all processes and derives an execution model that reflects the function-calling behaviors of the target system. When a failure occurs, we identify the most deviant behaviors in the failed run by comparing the failure traces with the derived model. We claim that the analyst can substantially reduce the burden of fault localization by prioritizing such behaviors. Our preliminary experiment with a distributed job manager supports this claim: Our method narrows down localization of a 70-second faulty run on a 78-node distributed platform into just sub-second behaviors involving only two nodes.
引用
收藏
页码:237 / 255
页数:19
相关论文
共 50 条
  • [31] SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs
    Shiokawa, Hiroaki
    Fujiwara, Yasuhiro
    Onizuka, Makoto
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (11): : 1178 - 1189
  • [32] Model-based performance test and chilled water fault diagnosis for large-scale water-cooling chillers
    Li, Zhisheng
    Zhang, Guoqiang
    Li, Dongmei
    Liu, Xuhong
    Mei, Sheng
    Wang, Xiaoxia
    Liu, Jianlong
    PROCEEDINGS OF THE 5TH INTERNATIONAL SYMPOSIUM ON HEATING, VENTILATING AND AIR CONDITIONING, VOLS I AND II, 2007, : 102 - 107
  • [33] Proactive fault management in large scale computing systems
    Wu, Linping
    Luo, Hongbing
    Ai, Zhiwei
    Shen, Yue
    Huazhong Keji Daxue Xuebao (Ziran Kexue Ban)/Journal of Huazhong University of Science and Technology (Natural Science Edition), 2010, 38 (SUPPL. 1): : 20 - 24
  • [34] FAULT-DETECTION AND IDENTIFICATION FOR RELIABLE LARGE-SCALE COMPUTING
    LOMBARDI, F
    ELECTRONICS LETTERS, 1985, 21 (02) : 50 - 52
  • [35] Model-based fault localization in bottling plants
    Voigt, Tobias
    Flad, Stefan
    Struss, Peter
    ADVANCED ENGINEERING INFORMATICS, 2015, 29 (01) : 101 - 114
  • [36] Model-Based Leakage Detection for Large-Scale Water Pipeline Networks
    Momeni, Ahmad
    Piratla, Kalyan R.
    PIPELINES 2023: CONDITION ASSESSMENT, UTILITY ENGINEERING, SURVEYING, AND MULTIDISCIPLINE, 2023, : 120 - 127
  • [37] GARCH Model-Based Large-Scale IP Traffic Matrix Estimation
    Jiang, Dingde
    Hu, Guangmin
    IEEE COMMUNICATIONS LETTERS, 2009, 13 (01) : 52 - 54
  • [38] MMES: Mixture Model-Based Evolution Strategy for Large-Scale Optimization
    He, Xiaoyu
    Zheng, Zibin
    Zhou, Yuren
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2021, 25 (02) : 320 - 333
  • [39] Monitoring high-dimensional data for failure detection and localization in large-scale computing systems
    Chen, Haifeng
    Jiang, Guofei
    Yoshihira, Kenji
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (01) : 13 - 25
  • [40] Parallel algorithm for finding modules of large-scale coherent fault trees
    Li, Z. F.
    Ren, Y.
    Liu, L. L.
    Wang, Z. L.
    MICROELECTRONICS RELIABILITY, 2015, 55 (9-10) : 1400 - 1403