StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance

被引:2
|
作者
Mao, Yancan [1 ]
Chen, Zhanghao [2 ]
Zhang, Yifan [2 ]
Wang, Meng [2 ]
Fang, Yong [2 ]
Zhang, Guanghui [2 ]
Shi, Rui [2 ]
Ma, Richard T. B. [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] ByteDance Inc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 12期
关键词
LATENCY; MODEL;
D O I
10.14778/3611540.3611543
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
引用
收藏
页码:3501 / 3514
页数:14
相关论文
共 50 条
  • [21] Benchmarking Scalability of Cloud-Native Applications
    Henning, Sören
    Hasselbring, Wilhelm
    Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), 2023, P-332 : 59 - 60
  • [22] Forensic analysis of cloud-native artifacts
    Roussev, Vassil
    McCulley, Shane
    DIGITAL INVESTIGATION, 2016, 16 : S104 - S113
  • [23] Dynamic Resource Management Scheme for Digital Twin on Cloud-Native Computing
    Kim, Gi Tae
    Jeong, Byeonghui
    Jeong, Young-Sik
    HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2025, 15
  • [24] Dynamic Resource Management for Cloud-native Bulk Synchronous Parallel Applications
    Wang, Evan
    Barve, Yogesh
    Gokhale, Aniruddha
    Sun, Hongyang
    2023 IEEE 26TH INTERNATIONAL SYMPOSIUM ON REAL-TIME DISTRIBUTED COMPUTING, ISORC, 2023, : 152 - 157
  • [25] Monitoring solution for cloud-native DevSecOps
    Sojan, Arun
    Rajan, Ranjit
    Kuvaja, Pasi
    2021 IEEE 6TH INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD 2021), 2021, : 125 - 131
  • [26] Approaches for migrating non cloud-native applications to the cloud
    Shastry, Abhigna L.
    Nair, Devika S.
    Prathima, B.
    Ramya, C. P.
    Hallymysore, Phalachandra
    2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 632 - 638
  • [27] Optimizing Cloud-native Services with SAGA: A Service Affinity Graph-based Approach
    Hai Dinh-Tuan
    Six, Franz Florian
    2024 INTERNATIONAL CONFERENCE ON SMART APPLICATIONS, COMMUNICATIONS AND NETWORKING, SMARTNETS-2024, 2024,
  • [28] A New Cloud-Native Tool for Pharmacogenetic Analysis
    Yuan, David Yu
    Park, Jun Hyuk
    Li, Zhenyu
    Thomas, Rohan
    Hwang, David M.
    Fu, Lei
    GENES, 2024, 15 (03)
  • [29] Enhancement of Cloud-native applications with Autonomic Features
    Kosinska, Joanna
    Zielinski, Krzysztof
    JOURNAL OF GRID COMPUTING, 2023, 21 (03)
  • [30] Cloud-native Deploy-ability: An Analysis of Required Features of Deployment Technologies to Deploy Arbitrary Cloud-native Applications
    Wurster, Michael
    Breitenbuecher, Uwe
    Brogi, Antonio
    Leymann, Frank
    Soldani, Jacopo
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE (CLOSER), 2020, : 171 - 180