StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance

被引:2
|
作者
Mao, Yancan [1 ]
Chen, Zhanghao [2 ]
Zhang, Yifan [2 ]
Wang, Meng [2 ]
Fang, Yong [2 ]
Zhang, Guanghui [2 ]
Shi, Rui [2 ]
Ma, Richard T. B. [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] ByteDance Inc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2023年 / 16卷 / 12期
关键词
LATENCY; MODEL;
D O I
10.14778/3611540.3611543
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively. To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
引用
收藏
页码:3501 / 3514
页数:14
相关论文
共 50 条
  • [1] CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance
    Zhang, Jiashu
    Jiang, Wen
    Tang, Bo
    Ma, Haoxiang
    Cao, Lixun
    Jiang, Zhongbin
    Nie, Yuanyuan
    Wang, Fan
    Zhang, Lei
    Liang, Yuming
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3584 - 3596
  • [2] Cloud-Native Applications and Services
    Kratzke, Nane
    FUTURE INTERNET, 2022, 14 (12)
  • [3] Kora: A Cloud-Native Event Streaming Platform For Kafka
    Povzner, Anna
    Mahajan, Prince
    Gustafson, Jason
    Rao, Jun
    Juma, Ismael
    Min, Feng
    Sridharan, Shriram
    Bhatia, Nikhil
    Attaluri, Gopi
    Chandra, Adithya
    Kozlovski, Stanislav
    Sivaram, Rajini
    Bradstreet, Lucas
    Barrett, Bob
    Shah, Dhruvil
    Jacot, David
    Arthur, David
    Chawla, Manveer
    Dagostino, Ron
    Mccabe, Colin
    Obili, Manikumar Reddy
    Prakasam, Kowshik
    Sancio, Jose Garcia
    Singh, Vikas
    Nikhil, Alok
    Gupta, Kamal
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3822 - 3834
  • [4] State Management for Cloud-Native Applications
    Szalay, Mark
    Matray, Peter
    Toka, Laszlo
    ELECTRONICS, 2021, 10 (04) : 1 - 27
  • [5] Enabling Cloud-native IoT Device Management
    Nanos, Anastassios
    Plakas, Ioannis
    Ntoutsos, Georgios
    Mainas, Charalampos
    PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON METAOS FOR THE CLOUD-EDGE-IOT CONTINUUM, MECC 2024, 2024, : 42 - 47
  • [6] Autonomic Management Framework for Cloud-Native Applications
    Kosinska, Joanna
    Zielinski, Krzysztof
    JOURNAL OF GRID COMPUTING, 2020, 18 (04) : 779 - 796
  • [7] Cloud-Native Computing: A Survey From the Perspective of Services
    Deng, Shuiguang
    Zhao, Hailiang
    Huang, Binbin
    Zhang, Cheng
    Chen, Feiyi
    Deng, Yinuo
    Yin, Jianwei
    Dustdar, Schahram
    Zomaya, Albert Y.
    PROCEEDINGS OF THE IEEE, 2024, 112 (01) : 12 - 46
  • [8] Autonomic Management Framework for Cloud-Native Applications
    Joanna Kosińska
    Krzysztof Zieliński
    Journal of Grid Computing, 2020, 18 : 779 - 796
  • [9] Ursa: Lightweight Resource Management for Cloud-Native Microservices
    Zhang, Yanqi
    Zhou, Zhuangzhuang
    Elnikety, Sameh
    Delimitrou, Christina
    2024 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA 2024, 2024, : 954 - 969
  • [10] Misconfiguration Discovery with Principal Component Analysis for Cloud-Native Services
    Pranata, Alif Akbar
    Barais, Olivier
    Bourcier, Johann
    Noirie, Ludovic
    2020 IEEE/ACM 13TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC 2020), 2020, : 269 - 278