The Bounded Data Reuse Problem in Scientific Workflows

被引:6
|
作者
Zohrevandi, Mohsen [1 ]
Bazzi, Rida A. [1 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
关键词
Scientific Workflows; Intermediate Data; Data Reuse; Series-Parallel;
D O I
10.1109/IPDPS.2013.71
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.
引用
收藏
页码:1051 / 1062
页数:12
相关论文
共 50 条
  • [31] Performance analysis and data reduction for exascale scientific workflows
    Kelly, Christopher
    Xu, Wei
    Pouchard, Line C.
    Van Dam, Hubertus
    Islam, Tanzima Z.
    Yoo, Shinjae
    Van Dam, Kerstin Kleese
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2025,
  • [32] Accelerating Scientific Workflows with Tiered Data Management System
    Cheng, Peng
    Lu, Yutong
    Du, Yunfei
    Chen, Zhiguang
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 75 - 82
  • [33] Securing the Intermediate Data of Scientific Workflows in Clouds With ACISO
    Wang, Yawen
    Guo, Yunfei
    Guo, Zehua
    Liu, Wenyan
    Yang, Chao
    IEEE ACCESS, 2019, 7 : 126603 - 126617
  • [34] Scientific workflows
    Anna-Lena Lamprecht
    Kenneth J. Turner
    International Journal on Software Tools for Technology Transfer, 2016, 18 : 575 - 580
  • [35] A Data Placement Strategy for Data-Intensive Scientific Workflows in Cloud
    Zhao, Qing
    Xiong, Congcong
    Zhao, Xi
    Yu, Ce
    Xiao, Jian
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 928 - 934
  • [36] Production workflows: A model for reuse
    Carchiolo, Vincenza
    Longheu, Alessandro
    Malgeri, Michele
    ETFA 2005: 10th IEEE International Conference on Emerging Technologies and Factory Automation, Vol 1, Pts 1 and 2, Proceedings, 2005, : 285 - 288
  • [37] GDPR and the Reuse of Personal Data in Scientific Research
    Katulic, Tihomir
    Katulic, Anita
    2018 41ST INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2018, : 1311 - 1316
  • [38] Data-Aware Scheduling of Scientific Workflows in Hybrid Clouds
    Pasdar, Amirmohammad
    Almi'ani, Khaled
    Lee, Young Choon
    COMPUTATIONAL SCIENCE - ICCS 2018, PT III, 2018, 10862 : 708 - 714
  • [39] A framework for collecting provenance in data-centric scientific workflows
    Simmhan, Yogesh L.
    Plale, Beth
    Gannon, Dennis
    ICWS 2006: IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2006, : 427 - +
  • [40] An ontology-driven framework for data transformation in scientific workflows
    Bowers, S
    Ludäscher, B
    DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS, 2004, 2994 : 1 - 16