Understanding and Handling Alert Storm for Online Service Systems

被引:40
|
作者
Zhao, Nengwen [1 ,2 ,6 ]
Chen, Junjie [3 ]
Peng, Xiao [4 ]
Wang, Honglin [5 ]
Wu, Xinya [5 ]
Zhang, Yuanzong [5 ]
Chen, Zikai [1 ,2 ]
Zheng, Xiangzhong [5 ]
Nie, Xiaohui [1 ,2 ]
Wang, Gang [4 ]
Wu, Yong [4 ]
Zhou, Fang [4 ]
Zhang, Wenchi [5 ]
Sui, Kaixin [5 ]
Pei, Dan [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] BNRist, Beijing, Peoples R China
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[4] China EverBright Bank, Beijing, Peoples R China
[5] BizSeer, Beijing, Peoples R China
[6] BNRist Beijing Natl Res Ctr Informat Sci & Techno, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
Alert Storm; Alert Summary; Problem Identification; Failure Diagnoisis;
D O I
10.1145/3377813.3381363
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry.
引用
收藏
页码:162 / 171
页数:10
相关论文
共 50 条
  • [1] Understanding and Handling Alert Storm for Online Service Systems
    Zhao, Nengwen
    Chen, Junjie
    Peng, Xiao
    Wang, Honglin
    Wu, Xinya
    Zhang, Yuanzong
    Chen, Zikai
    Zheng, Xiangzhong
    Nie, Xiaohui
    Wang, Gang
    Wu, Yong
    Zhou, Fang
    Zhang, Wenchi
    Sui, Kaixin
    Pei, Dan
    2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2020), 2020, : 262 - 263
  • [2] Dynamic Graph Neural Networks-based Alert Link Prediction for Online Service Systems
    Chen, Yiru
    Zhang, Chenxi
    Dong, Zhen
    Yang, Dingyu
    Peng, Xin
    Ou, Jiayu
    Yang, Hong
    Wu, Zheshun
    Qu, Xiaojun
    Li, Wei
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 79 - 90
  • [3] STORM ALERT - REPLY
    NORGOOD, DG
    WEATHERWISE, 1983, 36 (02) : 93 - 93
  • [4] Handling online service recovery: Effects of perceived justice on online games
    Ding, May-Ching
    Lii, Yuan-Shuh
    TELEMATICS AND INFORMATICS, 2016, 33 (04) : 881 - 895
  • [5] Geomagnetic storm forecasting service StormFocus: 5 years online
    Podladchikova, Tatiana
    Petrukovich, Anatoly
    Yermolaev, Yuri
    JOURNAL OF SPACE WEATHER AND SPACE CLIMATE, 2018, 8
  • [6] An Online National Forumon Alert Systems for Missing Older Canadians
    Adekoya, Adebusola
    Daum, Christine
    Neubauer, Noelannah
    Liu, Lili
    INTERNATIONAL JOURNAL OF QUALITATIVE METHODS, 2021, 20 : 74 - 74
  • [7] Handling Uncertainty Online for Self-Adaptive Systems
    Cheng, Wen
    Li, Qingshan
    Wang, Lu
    He, Liu
    2018 5TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI), 2018, : 135 - 139
  • [8] Structured handling of online interface upgrades in integrating dependable systems of systems
    Jones, C
    Periorellis, P
    Romanovsky, A
    Welch, I
    SCIENTIFIC ENGINEERING FOR DISTRIBUTED JAVA APPLICATIONS, 2002, 2604 : 73 - 86
  • [9] Structured handling of online interface upgrades in integrating dependable systems of systems
    Jones, Cliff
    Periorellis, Panos
    Romanovsky, Alexander
    Welch, Ian
    2003, Springer Verlag (2604):
  • [10] Inquiry into handling of CJD alert welcome
    Hunter, M
    BMJ-BRITISH MEDICAL JOURNAL, 2002, 325 (7372): : 1055 - 1055