The Bounded Data Reuse Problem in Scientific Workflows

被引:6
|
作者
Zohrevandi, Mohsen [1 ]
Bazzi, Rida A. [1 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
关键词
Scientific Workflows; Intermediate Data; Data Reuse; Series-Parallel;
D O I
10.1109/IPDPS.2013.71
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.
引用
收藏
页码:1051 / 1062
页数:12
相关论文
共 50 条
  • [1] Addressing the Shimming Problem in Big Data Scientific Workflows
    Mohan, Aravind
    Lu, Shiyong
    Kotov, Alexander
    2014 IEEE INTERNATIONAL CONFERENCE ON SERVICES COMPUTING (SCC 2014), 2014, : 347 - 354
  • [2] Search, Adapt, and Reuse: The Future of Scientific Workflows
    Cohen-Boulakia, Sarah
    Leser, Ulf
    SIGMOD RECORD, 2011, 40 (02) : 6 - 16
  • [3] A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems
    Casas, Israel
    Taheri, Javid
    Ranjan, Rajiv
    Wang, Lizhe
    Zomaya, Albert Y.
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 74 : 168 - 178
  • [4] Network Analysis of Scientific Workflows: A Gateway to Reuse
    Tan, Wei
    Zhang, Jia
    Foster, Ian
    COMPUTER, 2010, 43 (09) : 54 - 61
  • [5] Improving the reuse of scientific workflows and their by-products
    Xiang, Xiaorong
    Madey, Gregory
    2007 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, PROCEEDINGS, 2007, : 792 - +
  • [6] Experiment Line: Software Reuse in Scientific Workflows
    Ogasawara, Eduardo
    Paulino, Carlos
    Murta, Leonardo
    Werner, Claudia
    Mattoso, Marta
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2009, 5566 : 264 - +
  • [7] On the reuse of scientific data
    Pasquetto I.V.
    Randles B.M.
    Borgman C.L.
    Pasquetto, Irene V. (irenepasquetto@ucla.edu), 1600, Committee on Data for Science and Technology (16):
  • [8] RESTful Open Workflows for Data Provenance and Reuse
    Eckert, Kai
    Ritze, Dominique
    Baierer, Konstantin
    Bizer, Christian
    WWW'14 COMPANION: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 259 - 260
  • [9] When History Matters - Assessing Reliability for the Reuse of Scientific Workflows
    Gomez-Perez, Jose Manuel
    Garcia-Cuesta, Esteban
    Garrido, Aleix
    Ruiz, Jose Enrique
    Zhao, Jun
    Klyne, Graham
    SEMANTIC WEB - ISWC 2013, PART II, 2013, 8219 : 81 - 97
  • [10] Scheduling of Scientific Workflows on Data Grids
    Pandey, Suraj
    Buyya, Rajkumar
    CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS, 2008, : 548 - 553