Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

被引:0
|
作者
Wolfrath, Joel [1 ]
Chandra, Abhishek [1 ]
机构
[1] Univ Minnesota, Minneapolis, MN 55417 USA
关键词
Join Algorithms; Distributed Systems; Query Optimization; Wide Area Network; SAMPLES; BIG;
D O I
10.1145/3620678.3624643
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling-computing a uniform sample from the join results-is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 50 条
  • [31] Unicorn: Unified Resource Orchestration for Multi-Domain, Geo-Distributed Data Analytics
    Xiang, Qiao
    Chen, Shenshen
    Gao, Kai
    Newman, Harvey
    Taylor, Ian
    Zhang, Jingxuan
    Yang, Yang Richard
    2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,
  • [32] Think Before You Shuffle: Data-Driven Shuffles for Geo-Distributed Analytics
    Goyal, Maruth
    Akella, Aditya
    PROCEEDINGS OF THE INTERNATIONAL WORKSHOP ON BIGIG DATA IN EMERGENT DISTRIBUTED ENVIRONMENTS (BIDEDE 2022), 2022,
  • [33] A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
    Bergui, Mohammed
    Najah, Said
    Nikolov, Nikola S.
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [34] ran-GJS']JS: Orchestrating Data Analytics for Heterogeneous Geo-distributed Edges
    Jin, Yibo
    Qian, Zhuzhong
    Guo, Song
    Zhang, Sheng
    Wang, Xiaoliang
    Lu, Sanglu
    PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
  • [35] SNR: Network-aware Geo-Distributed Stream Analytics
    Mostafaei, Habib
    Afridi, Shafi
    Abawajy, Jemal H.
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 820 - 827
  • [36] Renewable Energy-Aware Big Data Analytics in Geo-Distributed Data Centers with Reinforcement Learning
    Xu, Chenhan
    Wang, Kun
    Li, Peng
    Xia, Rui
    Guo, Song
    Guo, Minyi
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (01): : 205 - 215
  • [37] Efficient Geo-Distributed Data Processing with Rout
    Jayalath, Chamikara
    Eugster, Patrick
    2013 IEEE 33RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2013, : 470 - 480
  • [38] runData: Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics Over Edges
    Jin, Yibo
    Qian, Zhuzhong
    Guo, Song
    Zhang, Sheng
    Jiao, Lei
    Lu, Sanglu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (01) : 40 - 55
  • [39] Compliant Geo-distributed Data Processing in Action
    Beedkar, Kaustubh
    Brekardin, David
    Quiane-Ruiz, Jorge-Anulfo
    Markl, Volker
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2843 - 2846
  • [40] Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster
    Du, Haizhou
    Zhang, Keke
    Yang, Zhenchen
    2018 5TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2018, : 490 - 495