Understanding and Handling Alert Storm for Online Service Systems

被引:40
|
作者
Zhao, Nengwen [1 ,2 ,6 ]
Chen, Junjie [3 ]
Peng, Xiao [4 ]
Wang, Honglin [5 ]
Wu, Xinya [5 ]
Zhang, Yuanzong [5 ]
Chen, Zikai [1 ,2 ]
Zheng, Xiangzhong [5 ]
Nie, Xiaohui [1 ,2 ]
Wang, Gang [4 ]
Wu, Yong [4 ]
Zhou, Fang [4 ]
Zhang, Wenchi [5 ]
Sui, Kaixin [5 ]
Pei, Dan [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] BNRist, Beijing, Peoples R China
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[4] China EverBright Bank, Beijing, Peoples R China
[5] BizSeer, Beijing, Peoples R China
[6] BNRist Beijing Natl Res Ctr Informat Sci & Techno, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
Alert Storm; Alert Summary; Problem Identification; Failure Diagnoisis;
D O I
10.1145/3377813.3381363
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry.
引用
收藏
页码:162 / 171
页数:10
相关论文
共 50 条
  • [31] On the Alert for Cytokine Storm: Immunopathology in COVID-19
    Henderson, Lauren A.
    Canna, Scott W.
    Schulert, Grant S.
    Volpi, Stefano
    Lee, Pui Y.
    Kernan, Kate F.
    Caricchio, Roberto
    Mahmud, Shawn
    Hazen, Melissa M.
    Halyabar, Olha
    Hoyt, Kacie J.
    Han, Joseph
    Grom, Alexei A.
    Gattorno, Marco
    Ravelli, Angelo
    De Benedetti, Fabrizio
    Behrens, Edward M.
    Cron, Randy Q.
    Nigrovic, Peter A.
    ARTHRITIS & RHEUMATOLOGY, 2020, 72 (07) : 1059 - 1063
  • [32] Understanding Persuasion Cascades in Online Product Rating Systems
    Xie, Hong
    Li, Yongkun
    Lui, John C. S.
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5490 - 5497
  • [33] Understanding the online therapeutic alliance through the eyes of adolescent service users
    Hanley, Terry
    COUNSELLING & PSYCHOTHERAPY RESEARCH, 2012, 12 (01): : 35 - 43
  • [34] Understanding Physicians' Motivation to Provide Healthcare Service Online in the Digital Age
    Zhang, Tingting
    Chen, Qin
    Wang, William Yu Chung
    Wei, Yuhan
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (22)
  • [35] An Online Adaptive Approach to Alert Correlation
    Ren, Hanli
    Stakhanova, Natalia
    Ghorbani, Ali A.
    DETECTION OF INTRUSIONS AND MALWARE, AND VULNERABILITY ASSESSMENT, 2010, 6201 : 153 - 172
  • [36] Understanding the Impact of Service Reputation on the Online Group-buying Behaviors
    Yi, Xinhui
    Wang, Ying
    Zhang, Yutian
    Chang, Qing
    THIRTEENTH WUHAN INTERNATIONAL CONFERENCE ON E-BUSINESS, 2014, 2014, : 161 - 168
  • [37] PKE: A Model for Recommender Systems in Online Service Platform
    Tseng, Yun-Chien
    WWW'20: COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2020, 2020, : 289 - 293
  • [38] A survey of trust and reputation systems for online service provision
    Josang, Audun
    Ismail, Roslan
    Boyd, Colin
    DECISION SUPPORT SYSTEMS, 2007, 43 (02) : 618 - 644
  • [39] Online Learning and Pricing for Service Systems with Reusable Resources
    Jia, Huiwen
    Shi, Cong
    Shen, Siqian
    OPERATIONS RESEARCH, 2024, 72 (03) : 1203 - 1241
  • [40] Online Prediction and Improvement of Reliability for Service Oriented Systems
    Ding, Zuohua
    Xu, Ting
    Ye, Tiantian
    Zhou, Yuan
    IEEE TRANSACTIONS ON RELIABILITY, 2016, 65 (03) : 1133 - 1148