Task-based parallel programming model supporting fault tolerance

被引:0
|
作者
Wang Y.-Z. [1 ]
Chen X. [1 ]
Ji W.-X. [1 ]
Su Y. [1 ]
Wang X.-J. [1 ]
Shi F. [1 ]
机构
[1] School of Computer Science, Beijing Institute of Technology, Beijing
来源
Ruan Jian Xue Bao/Journal of Software | 2016年 / 27卷 / 07期
基金
中国国家自然科学基金;
关键词
Fault tolerance; Load balancing; Parallel programming; Task parallelism; Work-stealing scheduling;
D O I
10.13328/j.cnki.jos.004842
中图分类号
学科分类号
摘要
Task-Based parallel programming model has become the mainstream parallel programming model to improve the performance of parallel computer systems by exploiting task parallelism. This paper presents a novel task-based parallel programming model which supports hardware fault tolerance. This model incorporates fault tolerance mechanisms into the task-based parallel programming model and aim to improve system performance and reliability. It uses task as the basic unit of scheduling, execution, fault detection and recovery, and supports fault tolerance in the application level. A buffer-commit computation model is used for transient fault tolerance and application-level diskless checkpointing technique is employed for permanent fault tolerance. A work-stealing scheduling scheme supporting fault tolerance is adopted to achieve dynamic load balancing. Experimental results show that the proposed model provides hardware fault tolerance with low performance overhead. © Copyright 2016, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:1789 / 1804
页数:15
相关论文
共 39 条
  • [1] Reinders J., Intel Threading Building Blocks, (2015)
  • [2] Charles P., Grothoff C., Saraswat V., Donawa C., Kielstra A., Ebcioglu K., Von Praun C., Sarkar V., X10: An object-oriented approach to non-uniform cluster computing, Proc. of the 20th Annual ACM SIGPLAN Conf. on Object-Oriented Programming, pp. 519-538, (2005)
  • [3] Frigo M., Leiserson C.E., Randall K.H., The implementation of the cilk-5 multithreaded language, Proc. of the ACM SIGPLAN'98 Conf. on Programming Language Design and Implementation, pp. 212-223, (1998)
  • [4] Wang L., Cui H.M., Chen L., Feng X.B., Research on task parallel programming model, Ruan Jian Xue Bao/Journal of Software, 24, 1, pp. 77-90, (2013)
  • [5] Leijen D., Schulte W., Burckhardt S., The design of a task parallel library, Proc. of the 24th Annual ACM SIGPLAN Conf. on Object-Oriented Programming, pp. 227-242, (2009)
  • [6] Yang C., Orailoglu A., Full fault resilience and relaxed synchronization requirements at the cache-memory interface, IEEE Trans. on VLSI System, 19, 11, pp. 1996-2009, (2011)
  • [7] Reinhardt S.K., Mukherjee S.S., Transient fault detection via simultaneous multithreading, Proc. of the 27th Annual Int'l Symp. on Computer Architecture, pp. 25-36, (2000)
  • [8] Oh N., Shirvani P.P., McCluskey E.J., Error detection by duplicated instructions in super-scalar processors, IEEE Trans. on Reliability, 51, 1, pp. 63-75, (2002)
  • [9] Reis G.A., Chang J., Vachharajani N., Rangan R., August D.I., SWIFT: Software implemented fault tolerance, Proc. of the Int'l Symp. on Code Generation and Optimization, pp. 243-254, (2005)
  • [10] Rotenberg E., AR-SMT: A microarchitectural approach to fault tolerance in microprocessors, Proc. of the 29th Annual Int'l Symp. on Fault-Tolerant Computing, pp. 84-91, (1999)