StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance

被引:2
|
作者
Mao, Yancan [1 ]
Chen, Zhanghao [2 ]
Zhang, Yifan [2 ]
Wang, Meng [2 ]
Fang, Yong [2 ]
Zhang, Guanghui [2 ]
Shi, Rui [2 ]
Ma, Richard T. B. [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] ByteDance Inc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 12期
关键词
LATENCY; MODEL;
D O I
10.14778/3611540.3611543
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
引用
收藏
页码:3501 / 3514
页数:14
相关论文
共 50 条
  • [31] SliceSphere: Agile Service Orchestration and Management Framework for Cloud-Native Application Slices
    Habibi, Pooyan
    Leon-Garcia, Alberto
    IEEE ACCESS, 2024, 12 : 169024 - 169049
  • [32] Critical Technologies and Service Approaches for Cloud-Native Geospatial Knowledge Base Management
    Zhong, Teng
    Zhang, Xueying
    Xu, Pei
    Cao, Min
    Chen, Biyu
    Liu, Qiliang
    Wang, Shu
    Yang, Yizhou
    Journal of Geo-Information Science, 2024, 26 (09) : 2013 - 2025
  • [33] Toward Cloud-Native VNFs: An ETSI NFV Management and Orchestration Standards Approach
    Aelken J.
    Triay J.
    Chatras B.
    De Nicolas A.M.
    IEEE Communications Standards Magazine, 2024, 8 (02): : 12 - 19
  • [34] Ganos Aero: A Cloud-Native System for Big Raster Data Management and Processing
    Xiao, Fei
    Xie, Jiong
    Chen, Zhida
    Li, Feifei
    Chen, Zhen
    Liu, Jianwei
    Liu, Yinpei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3966 - 3969
  • [35] Enriching Cloud-native Applications with Sustainability Features
    Vitali, Monica
    Schmiedmayer, Paul
    Bootz, Valentin
    2023 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING, IC2E, 2023, : 21 - 31
  • [36] Enhancement of Cloud-native applications with Autonomic Features
    Joanna Kosińska
    Krzysztof Zieliński
    Journal of Grid Computing, 2023, 21
  • [37] Designing a Cloud-native Weigh-In-Motion
    Kirushanth, Sivaramalingam
    Kabaso, Boniface
    2019 OPEN INNOVATIONS CONFERENCE (OI), 2019, : 25 - 29
  • [38] Cloud-Native Repositories for Big Scientific Data
    Abernathey, Ryan P.
    Blackmon-Luca, Charles C.
    Crone, Timothy J.
    Henderson, Naomi
    Lepore, Chiara
    Augspurger, Tom
    Banihirwe, Anderson
    Gentemann, Chelle L.
    Hamman, Joseph J.
    Henderson, Naomi
    Lepore, Chiara
    McCaie, Theo A.
    Robinson, Niall H.
    Signell, Richard P.
    COMPUTING IN SCIENCE & ENGINEERING, 2021, 23 (02) : 26 - 35
  • [39] Bringing Cloud-Native Storage to SAP IQ
    Abouzour, Mohammed
    Aluc, Gunes
    Bowman, Ivan T.
    Deng, Xi
    Marathe, Nandan
    Ranadive, Sagar
    Sharique, Muhammed
    Smirnios, John C.
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2410 - 2422
  • [40] Moving Target Defense for Cloud-Native Applications
    Awarkeh, Ali
    El-Malki, Rim
    Rebecchi, Filippo
    PROCEEDINGS OF THE 27TH CONFERENCE ON INNOVATION IN CLOUDS, INTERNET AND NETWORKS, ICIN, 2024, : 130 - 137