Root Cause Analysis of Failures in Microservices through Causal Discovery

被引:0
|
作者
Ikram, Azam [1 ]
Chakraborty, Sarthak [2 ]
Mitra, Subrata [2 ]
Saini, Shiv Kumar [2 ]
Bagchi, Saurabh [1 ]
Kocaoglu, Murat [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Adobe Res, Mountain View, CA USA
来源
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most cloud applications use a large number of smaller sub-components (called microservices) that interact with each other in the form of a complex graph to provide the overall functionality to the user. While the modularity of the microservice architecture is beneficial for rapid software development, maintaining and debugging such a system quickly in cases of failure is challenging. We propose a scalable algorithm for rapidly detecting the root cause of failures in complex microservice architectures. The key ideas behind our novel hierarchical and localized learning approach are: (1) to treat the failure as an intervention on the root cause to quickly detect it, (2) only learn the portion of the causal graph related to the root cause, thus avoiding a large number of costly conditional independence tests, and (3) hierarchically explore the graph. The proposed technique is highly scalable and produces useful insights about the root cause, while the use of traditional techniques becomes infeasible due to high computation time. Our solution is application agnostic and relies only on the data collected for diagnosis. For the evaluation, we compare the proposed solution with a modified version of the PC algorithm and the state-of-the-art for root cause analysis. The results show a considerable improvement in top-k recall while significantly reducing the execution time.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] URCD: Unsupervised Root Cause Detection in Microservices Architecture with HGAN
    Borse, Harsh
    Satapathy, Utkalika
    Mondal, Mainack
    Mitra, Bivas
    2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS 2024, 2024, : 1423 - 1426
  • [32] Correlating Failures with Asynchronous Changes for Root Cause Analysis in Enterprise Environments
    Agarwal, Manoj K.
    Madduri, Venkateswara R.
    2010 IEEE-IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS DSN, 2010, : 517 - 526
  • [33] Root Cause Analysis of Failures in Interdependent Power-Communication Networks
    Das, Arun
    Banerjee, Joydeep
    Sen, Arunabha
    2014 IEEE MILITARY COMMUNICATIONS CONFERENCE: AFFORDABLE MISSION SUCCESS: MEETING THE CHALLENGE (MILCOM 2014), 2014, : 910 - 915
  • [34] Counterfactual Root Cause Analysis via Anomaly Detection and Causal Graphs
    Rehak, Josephine
    Sommer, Anouk
    Becker, Maximilian
    Pfrommer, Julius
    Beyerer, Juergen
    2023 IEEE 21ST INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS, INDIN, 2023,
  • [35] Events and Causal Factors Charting of Kernel Traces for Root Cause Analysis
    Liao, Yi-Ching
    Langweg, Hanno
    2015 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATION (ISCC), 2015, : 245 - 250
  • [36] Utilization of root cause failure analysis in the investigation of marine deck fitting failures
    Huff, DS
    Lynaugh, KM
    NAVAL ENGINEERS JOURNAL, 2001, 113 (01) : 93 - 99
  • [37] ABC in Root Cause Analysis: Discovering Missing Information and Repairing System Failures
    Li, Xue
    Bundy, Alan
    Zhu, Ruiqi
    Wang, Fangrong
    Mauceri, Stefano
    Xu, Lei
    Pan, Jeff Z.
    MACHINE LEARNING, OPTIMIZATION, AND DATA SCIENCE, LOD 2022, PT I, 2023, 13810 : 346 - 359
  • [38] Root cause analysis of failures and quality deviations in manufacturing using machine learning
    Lokrantz, Anna
    Gustavsson, Emil
    Jirstrand, Mats
    51ST CIRP CONFERENCE ON MANUFACTURING SYSTEMS, 2018, 72 : 1057 - 1062
  • [39] Root Cause Analysis of Network Failures Using Machine Learning and Summarization Techniques
    Navarro Gonzalez, Jose Manuel
    Andion Jimenez, Javier
    Duenas Lopez, Juan Carlos
    Parada G, Hugo A.
    IEEE COMMUNICATIONS MAGAZINE, 2017, 55 (09) : 126 - 131
  • [40] Application of root cause analysis on malpractice claim files related to diagnostic failures
    van Noord, I.
    Eikens, M. P.
    Hamersma, A. M.
    de Bruijne, M. C.
    QUALITY & SAFETY IN HEALTH CARE, 2010, 19 (06): : e21