PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

被引:1
|
作者
Han, Runzhou [1 ]
Zheng, Mai [1 ]
Byna, Suren [2 ]
Tang, Houjun [3 ]
Dong, Bin [3 ]
Dai, Dong [4 ]
Chen, Yong [5 ]
Kim, Dongkyun [5 ]
Hassoun, Joseph [5 ]
Thorsley, David [5 ]
机构
[1] Iowa State Univ, Dept Elect & Comp Engn, Ames, IA 50014 USA
[2] Ohio State Univ, Columbus, OH 43210 USA
[3] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[4] Univ North Carolina Charlotte, Charlotte, NC 28223 USA
[5] Samsung Res Labs, Mountain View, CA 94043 USA
关键词
Data provenance; HPC I/O libraries; high performance computing (HPC); scientific data management; workflows; DETECTING DATA RACES;
D O I
10.1109/TPDS.2024.3374555
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins of data products, usage patterns of datasets). Unfortunately, existing provenance solutions cannot address the challenges due to their incompatible provenance models and/or system implementations. In this paper, we analyze four representative scientific workflows in collaboration with the domain scientists to identify concrete provenance needs. Based on the first-hand analysis, we propose a provenance framework called PROV-IO+, which includes an I/O-centric provenance model for describing scientific data and the associated I/O operations and environments precisely. Moreover, we build a prototype of PROV-IO+ to enable end-to-end provenance support on real HPC systems with little manual effort. The PROV-IO+ framework can support both containerized and non-containerized workflows on different HPC platforms with flexibility in selecting various classes of provenance. Our experiments with realistic workflows show that PROV-IO+ can address the provenance needs of the domain scientists effectively with reasonable performance (e.g., less than 3.5% tracking overhead for most experiments). Moreover, PROV-IO+ outperforms a state-of-the-art system (i.e., ProvLake) in our experiments.
引用
收藏
页码:844 / 861
页数:18
相关论文
共 50 条
  • [21] Cross-platform analysis of longitudinal data in metabolomics
    Nevedomskaya, Ekaterina
    Mayboroda, Oleg A.
    Deelder, Andre M.
    MOLECULAR BIOSYSTEMS, 2011, 7 (12) : 3214 - 3222
  • [22] Streamlining Data for Cross-Platform Web Delivery
    Watkins, Sean
    Battles, Jason
    Vacek, Rachel
    JOURNAL OF WEB LIBRARIANSHIP, 2013, 7 (01) : 95 - 108
  • [23] ABMOM for Cross-platform Communication in SOA Systems
    Ibrahim, Najhan M.
    Hassan, Mohd Fadzil
    Abdullah, M. Hussin
    2013 INTERNATIONAL CONFERENCE ON RESEARCH AND INNOVATION IN INFORMATION SYSTEMS (ICRIIS), 2013, : 107 - 112
  • [24] RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems
    Sebastian Kruse
    Zoi Kaoudi
    Bertty Contreras-Rojas
    Sanjay Chawla
    Felix Naumann
    Jorge-Arnulfo Quiané-Ruiz
    The VLDB Journal, 2020, 29 : 1287 - 1310
  • [25] Trueno: A Cross-Platform Machine Learning Model Serving Framework in Heterogeneous Edge Systems
    Song, Danyang
    Zhu, Yifei
    Zhang, Cong
    Liu, Jiangchuan
    IEEE INFOCOM 2022 - IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (INFOCOM WKSHPS), 2022,
  • [26] RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems
    Kruse, Sebastian
    Kaoudi, Zoi
    Contreras-Rojas, Bertty
    Chawla, Sanjay
    Naumann, Felix
    Quiane-Ruiz, Jorge-Arnulfo
    VLDB JOURNAL, 2020, 29 (06): : 1287 - 1310
  • [27] Locosim: An Open-Source Cross-Platform Robotics Framework
    Focchi, Michele
    Roscia, Francesco
    Semini, Claudio
    SYNERGETIC COOPERATION BETWEEN ROBOTS AND HUMANS, VOL 2, CLAWAR 2023, 2024, 811 : 395 - 406
  • [28] Fusality: An Open Framework for Cross-platform Mirror World Installations
    Polys, Nicholas F.
    Knapp, Benjamin
    Bock, Matthew
    Lidwin, Christina
    Webster, Dane
    Waggoner, Nathan
    Bukvic, Ivica
    WEB3D 2015, 2015, : 171 - 179
  • [29] LTR: Linear Cross-Platform Integration of Microarray Data
    Boutros, Paul C.
    CANCER INFORMATICS, 2010, 9 : 197 - 208
  • [30] FusionLearn: a biomarker selection algorithm on cross-platform data
    Gao, Xin
    Zhong, Yuan
    BIOINFORMATICS, 2019, 35 (21) : 4465 - 4468