The Bounded Data Reuse Problem in Scientific Workflows

被引：6

作者：

Zohrevandi, Mohsen ^{[1
]}

Bazzi, Rida A. ^{[1
]}

机构：

[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA

来源：

IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013) | 2013年

关键词：

Scientific Workflows; Intermediate Data; Data Reuse; Series-Parallel;

D O I：

10.1109/IPDPS.2013.71

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Large datasets and time-consuming processes have become the norm in scientific computing applications. The exploration phase in the development of scientific workflows involves trial-and-error with workflow components, which can take a lot of time given the time-consuming nature of the workflow tasks. These facts suggest the possibility of reducing the development time by reusing intermediate data whenever possible. However the storage space is always limited. This introduces a problem: which intermediate datasets from one workflow should be kept to be reused in another workflow, with a limited amount of storage. For the general class of series parallel graphs, we model this problem using a non-linear integer programming formulation and show that it is NP-Hard. We provide a branch and bound optimal algorithm as well as efficient heuristics. We conducted experiments over a large set of randomly-generated workflows as well as a smaller set of synthetic workflows which are based on real-world workflows used by scientists in different disciplines. Our experiments show that the best solution produced by the heuristics only differs from the optimal value by less than 1% on average.

引用

页码：1051 / 1062

页数：12

共 50 条

[41] Scheduling Data-Intensive Scientific Workflows with Reduced Communication
Pietri, Ilia
Sakellariou, Rizos
30TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM 2018), 2018,
[42] End-to-End Scientific Data Management using Workflows
Simmhan, Yogesh
IEEE CONGRESS ON SERVICES 2008, PT I, PROCEEDINGS, 2008, : 472 - 473
[43] New Execution Paradigm for Data-Intensive Scientific Workflows
El-Gayyar, Mahmoud
Leng, Yan
Shumilov, Serge
Cremers, Armin
2009 IEEE CONGRESS ON SERVICES (SERVICES-1 2009), VOLS 1 AND 2, 2009, : 334 - 339
[44] Adaptive Caching for Data-Intensive Scientific Workflows in the Cloud
Heidsieck, Gaetan
de Oliveira, Daniel
Pacitti, Esther
Pradal, Christophe
Tardieu, Francois
Valduriez, Patrick
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT II, 2019, 11707 : 452 - 466
[45] A Data Placement Strategy Based on Genetic Algorithm for Scientific Workflows
Zhao Er-Dun
Qi Yong-Qiang
Xiang Xing-Xing
Chen Yi
PROCEEDINGS OF THE 2012 EIGHTH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS 2012), 2012, : 146 - 149
[46] From data to knowledge to discoveries: Artificial intelligence and scientific workflows
Gil, Yolanda
SCIENTIFIC PROGRAMMING, 2009, 17 (03) : 231 - 246
[47] Addressing Scientific Rigor in Data Analytics Using Semantic Workflows
Erickson, John S.
Sheehan, John
Bennett, Kristin P.
McGuinness, Deborah L.
Provenance and Annotation of Data and Processes, IPAW 2016, 2016, 9672 : 187 - 190
[48] An Ontology-Driven Framework for Data Transformation in Scientific Workflows
Bowers, Shawn
Ludäscher, Bertram
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, 2994 : 1 - 16
[49] Lifecycle Support for Scientific Investigations: Integrating Data, Computing, and Workflows
Catlin, Ann Christine
HewaNadungodage, Chandima
Bejarano, Andres
COMPUTING IN SCIENCE & ENGINEERING, 2019, 21 (04) : 49 - 61
[50] Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows
Romanus, Melissa
Zhang, Fan
Jin, Tong
Sun, Qian
Bui, Hoang
Parashar, Manish
Choi, Jong
Janhunen, Saloman
Hager, Robert
Klasky, Scott
Chang, Choong-Seock
Rodero, Ivan
DIDC'16: PROCEEDINGS OF THE ACM INTERNATIONAL WORKSHOP ON DATA-INTENSIVE DISTRIBUTED COMPUTING, 2016, : 37 - 44

← 1 2 3 4 5 →