Model-based Fault Localization: Finding Behavioral Outliers in Large-scale Computing Systems

被引:2
|
作者
Maruyama, Naoya [1 ]
Matsuoka, Satoshi [1 ,2 ]
机构
[1] Tokyo Inst Technol, Global Sci Informat & Comp Ctr GSIC, Meguro Ku, Tokyo 1528550, Japan
[2] Res Org Informat & Syst, Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
关键词
Distributed Systems; Fault Localization; PERFORMANCE;
D O I
10.1007/s00354-009-0088-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We present a model-based approach to fault localization that aims to help the human analyst narrow down the manual localization into a small fraction of the overall system. Our method consists of two parts: pre-failure model derivation and post-failure model-based anomaly detection. The first part collects function-call traces from all processes and derives an execution model that reflects the function-calling behaviors of the target system. When a failure occurs, we identify the most deviant behaviors in the failed run by comparing the failure traces with the derived model. We claim that the analyst can substantially reduce the burden of fault localization by prioritizing such behaviors. Our preliminary experiment with a distributed job manager supports this claim: Our method narrows down localization of a 70-second faulty run on a 78-node distributed platform into just sub-second behaviors involving only two nodes.
引用
收藏
页码:237 / 255
页数:19
相关论文
共 50 条
  • [21] A MODEL-BASED AID FOR MONITORING AND CONTROLLING A LARGE-SCALE SYSTEM
    ZINSER, K
    HENNEMAN, RL
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS, 1989, 19 (04): : 888 - 892
  • [22] Fault Localization in Large-Scale Network Policy Deployment
    Tammana, Praveen
    Nagarajan, Chandra
    Mamillapalli, Pavan
    Kompella, Ramana Rao
    Lee, Myungjin
    2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 54 - 64
  • [23] Distributed switched model-based predictive control for distributed large-scale systems with switched topology
    Ahandani, Morteza Alinia
    Kharrati, Hamed
    Hashemzadeh, Farzad
    Baradarannia, Mahdi
    INTERNATIONAL JOURNAL OF SYSTEMS SCIENCE, 2024, 55 (05) : 980 - 1004
  • [24] Superconducting Computing in Large-Scale Hybrid Systems
    Holmes, D. Scott
    Kadin, Alan M.
    Johnson, Mark W.
    COMPUTER, 2015, 48 (12) : 34 - 42
  • [25] Decentralized switched model-based predictive control for distributed large-scale systems with topology switching
    Ahandani, Morteza Alinia
    Kharrati, Hamed
    Hashemzadeh, Farzad
    Baradarannia, Mahdi
    NONLINEAR ANALYSIS-HYBRID SYSTEMS, 2020, 38
  • [26] Granular computing and optimization model-based method for large-scale group decision-making and its application
    Zheng, Yuanhang
    Xu, Zeshui
    Tian, Yuhang
    ECONOMIC RESEARCH-EKONOMSKA ISTRAZIVANJA, 2022, 35 (01): : 5221 - 5252
  • [27] A robust model-based information system for monitoring and fault detection of large scale belt conveyor systems
    Jeinsch, T
    Sader, M
    Noack, R
    Barber, K
    Ding, SX
    Zang, P
    Zhong, M
    PROCEEDINGS OF THE 4TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-4, 2002, : 3283 - 3287
  • [28] Value of service based resource management for large-scale computing systems
    Tunc, Cihan
    Machovec, Dylan
    Kumbhare, Nirmal
    Akoglu, Ali
    Hariri, Salim
    Khemka, Bhavesh
    Siegel, Howard Jay
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (03): : 2013 - 2030
  • [29] Exploring Plan-Based Scheduling for Large-Scale Computing Systems
    Zheng, Xingwu
    Zhou, Zhou
    Yang, Xu
    Lan, Zhiling
    Wang, Jia
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 259 - 268
  • [30] Value of service based resource management for large-scale computing systems
    Cihan Tunc
    Dylan Machovec
    Nirmal Kumbhare
    Ali Akoglu
    Salim Hariri
    Bhavesh Khemka
    Howard Jay Siegel
    Cluster Computing, 2017, 20 : 2013 - 2030