Model-based Fault Localization: Finding Behavioral Outliers in Large-scale Computing Systems

被引:2
|
作者
Maruyama, Naoya [1 ]
Matsuoka, Satoshi [1 ,2 ]
机构
[1] Tokyo Inst Technol, Global Sci Informat & Comp Ctr GSIC, Meguro Ku, Tokyo 1528550, Japan
[2] Res Org Informat & Syst, Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
Distributed Systems; Fault Localization; PERFORMANCE;
D O I
10.1007/s00354-009-0088-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present a model-based approach to fault localization that aims to help the human analyst narrow down the manual localization into a small fraction of the overall system. Our method consists of two parts: pre-failure model derivation and post-failure model-based anomaly detection. The first part collects function-call traces from all processes and derives an execution model that reflects the function-calling behaviors of the target system. When a failure occurs, we identify the most deviant behaviors in the failed run by comparing the failure traces with the derived model. We claim that the analyst can substantially reduce the burden of fault localization by prioritizing such behaviors. Our preliminary experiment with a distributed job manager supports this claim: Our method narrows down localization of a 70-second faulty run on a 78-node distributed platform into just sub-second behaviors involving only two nodes.
引用
收藏
页码:237 / 255
页数:19
相关论文
共 50 条
  • [41] Energy Efficiency in Large-Scale Distributed Computing Systems
    Trobec, R.
    Depolli, M.
    Skala, K.
    Lipic, T.
    2013 36TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2013, : 253 - 257
  • [42] MODELS FOR CONFIGURING LARGE-SCALE DISTRIBUTED COMPUTING SYSTEMS
    GAVISH, B
    AT&T TECHNICAL JOURNAL, 1985, 64 (02): : 491 - 532
  • [43] Locating Sensors in Large-Scale Engineering Systems for Fault Isolation Based on Fault Feature Reduction
    Wang, Jinxin
    Wang, Zhongwei
    Ma, Xiuzhen
    Smith, Ann
    Gu, Fengshou
    Zhang, Chi
    Ball, Andrew
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2020, 357 (12): : 8181 - 8202
  • [44] Collaborative Fault Detection for Large-Scale Photovoltaic Systems
    Zhao, Yingying
    Li, Dongsheng
    Lu, Tun
    Lv, Qin
    Gu, Ning
    Shang, Li
    IEEE TRANSACTIONS ON SUSTAINABLE ENERGY, 2020, 11 (04) : 2745 - 2754
  • [45] MODEL CREDIBILITY FOR LARGE-SCALE SYSTEMS
    KAHNE, S
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1976, 6 (08): : 586 - 590
  • [46] Distributed fault detection for large-scale interconnected systems
    Zhang, Jiarui
    Ding, Steven X.
    Zhang, Deyu
    Li, Linlin
    IET CONTROL THEORY AND APPLICATIONS, 2024, 18 (17): : 2347 - 2357
  • [47] MODEL REDUCTION OF LARGE-SCALE SYSTEMS
    SOONG, TT
    JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1977, 60 (02) : 477 - 482
  • [48] Behavioral fault modeling for model-based safety analysis
    Joshi, Anjali
    Heimdahl, Mats P. E.
    HASE 2007: 10TH IEEE HIGH ASSURANCE SYSTEMS ENGINEERING SYMPOSIUM, PROCEEDINGS, 2007, : 199 - +
  • [49] Gramian based model reduction of large-scale dynamical systems
    Van Dooren, PM
    NUMERICAL ANALYSIS 1999, 2000, 420 : 231 - 247
  • [50] Model-based control techniques for large-scale high-precision stage
    Ohnishi W.
    Fujimoto H.
    Sakata K.
    IEEJ Transactions on Industry Applications, 2020, 140 (04) : 272 - 280