Understanding and Handling Alert Storm for Online Service Systems

被引:40
|
作者
Zhao, Nengwen [1 ,2 ,6 ]
Chen, Junjie [3 ]
Peng, Xiao [4 ]
Wang, Honglin [5 ]
Wu, Xinya [5 ]
Zhang, Yuanzong [5 ]
Chen, Zikai [1 ,2 ]
Zheng, Xiangzhong [5 ]
Nie, Xiaohui [1 ,2 ]
Wang, Gang [4 ]
Wu, Yong [4 ]
Zhou, Fang [4 ]
Zhang, Wenchi [5 ]
Sui, Kaixin [5 ]
Pei, Dan [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] BNRist, Beijing, Peoples R China
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[4] China EverBright Bank, Beijing, Peoples R China
[5] BizSeer, Beijing, Peoples R China
[6] BNRist Beijing Natl Res Ctr Informat Sci & Techno, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
Alert Storm; Alert Summary; Problem Identification; Failure Diagnoisis;
D O I
10.1145/3377813.3381363
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry.
引用
收藏
页码:162 / 171
页数:10
相关论文
共 50 条
  • [41] An Empirical Investigation of Incident Triage for Online Service Systems
    Chen, Junjie
    He, Xiaoting
    Lin, Qingwei
    Xu, Yong
    Zhang, Hongyu
    Hao, Dan
    Gao, Feng
    Xu, Zhangwei
    Dang, Yingnong
    Zhang, Dongmei
    2019 IEEE/ACM 41ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2019), 2019, : 111 - 120
  • [42] A novel framework for alert correlation and understanding
    Yu, D
    Frincke, D
    APPLIED CRYPTOGRAPHY AND NETWORK SECURITY, PROCEEDINGS, 2004, 3089 : 452 - 466
  • [43] Understanding quality requirements in knowledge intensive service systems
    Wang, Ye
    Sun, Jie
    Zhao, Liping
    Wang, Xinyu
    Yang, Xiaohu
    Kavs, Aleksander J.
    INTERNATIONAL JOURNAL OF SERVICES TECHNOLOGY AND MANAGEMENT, 2011, 16 (02) : 208 - 221
  • [44] Global Environmental Alert Service (GEAS)
    Grasso, Veronica F.
    Singh, Ashbindu
    ADVANCES IN SPACE RESEARCH, 2008, 41 (11) : 1836 - 1852
  • [45] Alert service in VANET: Analysis and design
    Fracchia, Roberta
    Meo, Michela
    2006 4TH INTERNATIONAL SYMPOSIUM ON MODELING AND OPTIMIZATION IN MOBILE, AD HOC AND WIRELESS NETWORKS, VOLS 1 AND 2, 2006, : 632 - +
  • [46] Individual introduces customizable Alert service
    Pemberton, H
    DATABASE, 1996, 19 (02): : 10 - 10
  • [47] Scalable Architecture of Alert Notification as a Service
    Gusev, Marjan
    Ristov, Sasko
    Velkoski, Goran
    Guseva, Ana
    Gushev, Pano
    2014 INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2014), 2014, : 80 - 85
  • [48] A Check and Alert Service based on IoT
    Jang, Jae J.
    Kim, Jinseong
    Jung, Im Y.
    2015 IEEE CONFERENCE ON TECHNOLOGIES FOR SUSTAINABILITY (SUSTECH), 2015, : 113 - 116
  • [49] Understanding Online Education in Metaverse: Systems and User Experience Perspectives
    Cheng, Ruizhi
    Murat, Erdem
    Yu, Lap-Fai
    Chen, Songqing
    Han, Bo
    2024 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES, VR 2024, 2024, : 598 - 608
  • [50] Understanding service attributes of robot hotels: A sentiment analysis of customer online reviews
    Luo, Jian Ming
    Vu, Huy Quan
    Li, Gang
    Law, Rob
    INTERNATIONAL JOURNAL OF HOSPITALITY MANAGEMENT, 2021, 98