Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

被引:6
|
作者
Ahmad, Tanveer [1 ]
Ahmed, Nauman [1 ]
Al-Ars, Zaid [1 ]
Hofstee, H. Peter [1 ,2 ]
机构
[1] Delft Univ Technol, Quantum & Comp Engn Dept, Accelerated Big Data Syst Grp, Delft, Netherlands
[2] IBM Res Austin, Austin, TX USA
关键词
Genomics; Whole Genome; Exome Sequencing; Big Data; Apache Arrow; In-Memory Data; GATK Best Practices; READ ALIGNMENT; MAPREDUCE; CANCER;
D O I
10.1186/s12864-020-07013-y
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: .
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework
    Tanveer Ahmad
    Nauman Ahmed
    Zaid Al-Ars
    H. Peter Hofstee
    BMC Genomics, 21
  • [2] In-Memory Performance for Big Data
    Graefe, Goetz
    Volos, Haris
    Kimura, Hideaki
    Kuno, Harumi
    Tucek, Joseph
    Lillibridge, Mark
    Veitch, Alistair
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
  • [3] Weather data analysis using Spark - An In-memory Computing framework
    Jayanthi, D.
    Sumathi, G.
    2017 INNOVATIONS IN POWER AND ADVANCED COMPUTING TECHNOLOGIES (I-PACT), 2017,
  • [4] Optimizing Performance and Computing Resource Management of in-memory Big Data Analytics with Disaggregated Persistent Memory
    Chen, Shouwei
    Wang, Wensheng
    Wu, Xueyang
    Fan, Zhen
    Huang, Kunwu
    Zhuang, Peiyu
    Li, Yue
    Rodero, Ivan
    Parashar, Manish
    Weng, Dennis
    2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 21 - 30
  • [5] Apache Nemo: A Framework for Optimizing Distributed Data Processing
    Song, Won Wook
    Yang, Youngseok
    Eo, Jeongyoon
    Seo, Jangho
    Kim, Joo Yeon
    Lee, Sanha
    Lee, Gyewon
    Um, Taegeon
    Cho, Haeyoon
    Chun, Byung-Gon
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2021, 38 (3-4):
  • [6] Performance Prediction for Data-driven Workflows on Apache Spark
    Gulino, Andrea
    Canakoglu, Arif
    Ceri, Stefano
    Ardagna, Danilo
    2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 167 - +
  • [7] CHOPPER: Optimizing Data Partitioning for In-Memory Data Analytics Frameworks
    Paul, Arnab Kumar
    Zhuang, Wenjie
    Xu, Luna
    Li, Min
    Rafique, M. Mustafa
    Butt, Ali R.
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 110 - 119
  • [8] CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows
    Duan, Shaohua
    Subedi, Pradeep
    Davis, Philip
    Teranishi, Keita
    Kolla, Hemanth
    Gamell, Marc
    Parashar, Manish
    ACM TRANSACTIONS ON PARALLEL COMPUTING, 2020, 7 (02)
  • [9] An In-Memory based Framework for Scientific Data Analytics
    Elia, Donatello
    Fiore, Sandro
    D'Anca, Alessandro
    Palazzo, Cosimo
    Foster, Ian
    Williams, Dean N.
    PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 424 - 429
  • [10] In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model
    Huang, Wei
    Meng, Lingkui
    Zhang, Dongying
    Zhang, Wen
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2017, 10 (01) : 3 - 19