Understanding and Handling Alert Storm for Online Service Systems

被引:40
|
作者
Zhao, Nengwen [1 ,2 ,6 ]
Chen, Junjie [3 ]
Peng, Xiao [4 ]
Wang, Honglin [5 ]
Wu, Xinya [5 ]
Zhang, Yuanzong [5 ]
Chen, Zikai [1 ,2 ]
Zheng, Xiangzhong [5 ]
Nie, Xiaohui [1 ,2 ]
Wang, Gang [4 ]
Wu, Yong [4 ]
Zhou, Fang [4 ]
Zhang, Wenchi [5 ]
Sui, Kaixin [5 ]
Pei, Dan [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] BNRist, Beijing, Peoples R China
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[4] China EverBright Bank, Beijing, Peoples R China
[5] BizSeer, Beijing, Peoples R China
[6] BNRist Beijing Natl Res Ctr Informat Sci & Techno, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
Alert Storm; Alert Summary; Problem Identification; Failure Diagnoisis;
D O I
10.1145/3377813.3381363
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry.
引用
收藏
页码:162 / 171
页数:10
相关论文
共 50 条
  • [21] Performance Issue Diagnosis for Online Service Systems
    Fu, Qiang
    Lou, Jian-Guang
    Lin, Qing-Wei
    Ding, Rui
    Zhang, Dongmei
    Ye, Zihao
    Xie, Tao
    2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, : 273 - 278
  • [22] IMPACT OF ONLINE SYSTEMS ON A LITERATURE SEARCHING SERVICE
    HAWKINS, DT
    SPECIAL LIBRARIES, 1976, 67 (12) : 559 - 567
  • [23] HANDLING THE ALERT IN HEAVY WEATHER - BURDON,GEORGE
    MAY, WE
    MARINERS MIRROR, 1979, 65 (01): : 84 - 84
  • [24] A Resilient Framework for Fault Handling in Web Service Oriented Systems
    Wang, Weidong
    Wang, Liqiang
    Lu, Wei
    2015 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (ICWS), 2015, : 663 - 670
  • [25] Understanding Vulnerable Families in Multiple Service Systems
    Goerge, Robert M.
    Wiegand, Emily R.
    RSF-THE RUSSELL SAGE JOURNAL OF THE SOCIAL SCIENCES, 2019, 5 (02): : 86 - 104
  • [26] Understanding failure response in service discovery systems
    Dabrowski, C.
    Mills, K.
    Quirolgico, S.
    JOURNAL OF SYSTEMS AND SOFTWARE, 2007, 80 (06) : 896 - 917
  • [27] A SPECIAL SERVICE TEAM ON THE ALERT
    BREZEL, B
    DANIELLO, D
    ACADEMIC THERAPY, 1983, 19 (02): : 241 - 244
  • [28] MEDIC ALERT - A LIFESAVING SERVICE
    TODD, MC
    JOURNAL OF THE AMERICAN MEDICAL TECHNOLOGISTS, 1981, 43 (02): : 76 - 77
  • [29] Analytics-Based Solutions for Improving Alert Management Service for Enterprise Systems
    Kelkar, Anuja
    Naiknaware, Utkarsh
    Sukhlecha, Sachin
    Sanadhya, Ashish
    Natu, Maitreya
    Sadaphal, Vaishali
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2013, : 219 - 227
  • [30] The SIMBA user alert service architecture for dependable alert delivery
    Wang, YM
    Bahl, P
    Russell, W
    INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 463 - 472