Exploring Plan-Based Scheduling for Large-Scale Computing Systems

被引:9
|
作者
Zheng, Xingwu [1 ]
Zhou, Zhou [2 ]
Yang, Xu [2 ]
Lan, Zhiling [2 ]
Wang, Jia [1 ]
机构
[1] IIT, Dept Elect & Comp Engn, Chicago, IL 60616 USA
[2] IIT, Dept Comp Sci, Chicago, IL 60616 USA
关键词
Plan-based scheduling; Simulated Annealing algorithm; Optimization;
D O I
10.1109/CLUSTER.2016.43
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As HPC systems scale toward exascale, it becomes critical to manage the underlying resource more effectively. While almost all existing resource management systems schedule jobs in a queuing fashion and have drawbacks of making isolated scheduling decisions that would compromise system performance even with backfilling, plan-based schedulers have the potential to generate better job schedules by producing an execution plan of all waiting jobs but do not receive enough attention. In this paper, we present a novel plan-based scheduling system that utilizes simulated annealing as the optimization engine to support effective resource management on HPC systems. As demonstrated by extensive trace-based simulations with workload traces collected from a wide range of production supercomputers, in comparison with the queue-based scheduling system using FCFS with EASY backfilling, our plan-based scheduling system can reduce the job wait time by 40%, reduce the job response time by 30%, while slightly improving system utilization at the same time. Moreover, our plan-based system is able to run online by solving the scheduling problem at each scheduling iteration within one second, making it practical for production HPC systems.
引用
收藏
页码:259 / 268
页数:10
相关论文
共 50 条
  • [31] Large-Scale Distributed Graph Computing Systems: An Experimental Evaluation
    Lu, Yi
    Cheng, James
    Yan, Da
    Wu, Huanhuan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 281 - 292
  • [32] Considering Time in Designing Large-Scale Systems for Scientific Computing
    Chen, Nan-Chen
    Poon, Sarah S.
    Ramakrishnan, Lavanya
    Aragon, Cecilia R.
    ACM CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING (CSCW 2016), 2016, : 1535 - 1547
  • [33] Performance Visualization for Large-Scale Computing Systems: A Literature Review
    Gao, Qin
    Zhang, Xuhui
    Rau, Pei-Luen Patrick
    Maciejewski, Anthony A.
    Siegel, Howard Jay
    HUMAN-COMPUTER INTERACTION: DESIGN AND DEVELOPMENT APPROACHES, PT I, 2011, 6761 : 450 - 460
  • [34] Improving Failure Tolerance in Large-Scale Cloud Computing Systems
    Luo, Liang
    Meng, Sa
    Qiu, Xiwei
    Dai, Yuanshun
    IEEE TRANSACTIONS ON RELIABILITY, 2019, 68 (02) : 620 - 632
  • [35] Muclouds: Parallel Simulator for Large-scale Cloud Computing Systems
    Liu, Jinzhao
    Zhou, Yuezhi
    Zhang, Di
    Fang, Yujian
    Han, Wei
    Zhang, Yaoxue
    2014 IEEE 11TH INTL CONF ON UBIQUITOUS INTELLIGENCE AND COMPUTING AND 2014 IEEE 11TH INTL CONF ON AUTONOMIC AND TRUSTED COMPUTING AND 2014 IEEE 14TH INTL CONF ON SCALABLE COMPUTING AND COMMUNICATIONS AND ITS ASSOCIATED WORKSHOPS, 2014, : 80 - 87
  • [36] Analysis and prediction of performance variability in large-scale computing systems
    Beni, Majid Salimi
    Hunold, Sascha
    Cosenza, Biagio
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (10): : 14978 - 15005
  • [37] Advanced computing in intelligent large-scale distributed systems - Preface
    Koodziej, Joanna
    Nishino, Hiroaki
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2012, 27 (01): : 5 - 6
  • [38] Hybrid performance modeling and prediction of large-scale computing systems
    Pllana, Sabri
    Benkner, Siegfried
    Xhafa, Fatos
    Barolli, Leonard
    CISIS 2008: THE SECOND INTERNATIONAL CONFERENCE ON COMPLEX, INTELLIGENT AND SOFTWARE INTENSIVE SYSTEMS, PROCEEDINGS, 2008, : 132 - +
  • [39] Computing optimal Hankel norm approximations of large-scale systems
    Benner, P
    Quintana-Ortí, ES
    Quintana-Ortí, G
    2004 43RD IEEE CONFERENCE ON DECISION AND CONTROL (CDC), VOLS 1-5, 2004, : 3078 - 3083
  • [40] Cloud Computing Applications for Large-Scale Satellite Ground Systems
    Anthony, Richard
    Fritz, John
    Barnhart, Doug
    2011 - MILCOM 2011 MILITARY COMMUNICATIONS CONFERENCE, 2011, : 1894 - 1898