The Bounded Data Reuse Problem in Scientific Workflows

被引:6
|
作者
Zohrevandi, Mohsen [1 ]
Bazzi, Rida A. [1 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
关键词
Scientific Workflows; Intermediate Data; Data Reuse; Series-Parallel;
D O I
10.1109/IPDPS.2013.71
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.
引用
收藏
页码:1051 / 1062
页数:12
相关论文
共 50 条
  • [41] Scheduling Data-Intensive Scientific Workflows with Reduced Communication
    Pietri, Ilia
    Sakellariou, Rizos
    30TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM 2018), 2018,
  • [42] End-to-End Scientific Data Management using Workflows
    Simmhan, Yogesh
    IEEE CONGRESS ON SERVICES 2008, PT I, PROCEEDINGS, 2008, : 472 - 473
  • [43] New Execution Paradigm for Data-Intensive Scientific Workflows
    El-Gayyar, Mahmoud
    Leng, Yan
    Shumilov, Serge
    Cremers, Armin
    2009 IEEE CONGRESS ON SERVICES (SERVICES-1 2009), VOLS 1 AND 2, 2009, : 334 - 339
  • [44] Adaptive Caching for Data-Intensive Scientific Workflows in the Cloud
    Heidsieck, Gaetan
    de Oliveira, Daniel
    Pacitti, Esther
    Pradal, Christophe
    Tardieu, Francois
    Valduriez, Patrick
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT II, 2019, 11707 : 452 - 466
  • [45] A Data Placement Strategy Based on Genetic Algorithm for Scientific Workflows
    Zhao Er-Dun
    Qi Yong-Qiang
    Xiang Xing-Xing
    Chen Yi
    PROCEEDINGS OF THE 2012 EIGHTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS 2012), 2012, : 146 - 149
  • [46] From data to knowledge to discoveries: Artificial intelligence and scientific workflows
    Gil, Yolanda
    SCIENTIFIC PROGRAMMING, 2009, 17 (03) : 231 - 246
  • [47] Addressing Scientific Rigor in Data Analytics Using Semantic Workflows
    Erickson, John S.
    Sheehan, John
    Bennett, Kristin P.
    McGuinness, Deborah L.
    Provenance and Annotation of Data and Processes, IPAW 2016, 2016, 9672 : 187 - 190
  • [48] An Ontology-Driven Framework for Data Transformation in Scientific Workflows
    Bowers, Shawn
    Ludäscher, Bertram
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, 2994 : 1 - 16
  • [49] Lifecycle Support for Scientific Investigations: Integrating Data, Computing, and Workflows
    Catlin, Ann Christine
    HewaNadungodage, Chandima
    Bejarano, Andres
    COMPUTING IN SCIENCE & ENGINEERING, 2019, 21 (04) : 49 - 61
  • [50] Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows
    Romanus, Melissa
    Zhang, Fan
    Jin, Tong
    Sun, Qian
    Bui, Hoang
    Parashar, Manish
    Choi, Jong
    Janhunen, Saloman
    Hager, Robert
    Klasky, Scott
    Chang, Choong-Seock
    Rodero, Ivan
    DIDC'16: PROCEEDINGS OF THE ACM INTERNATIONAL WORKSHOP ON DATA-INTENSIVE DISTRIBUTED COMPUTING, 2016, : 37 - 44