Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

被引:0
|
作者
Wolfrath, Joel [1 ]
Chandra, Abhishek [1 ]
机构
[1] Univ Minnesota, Minneapolis, MN 55417 USA
关键词
Join Algorithms; Distributed Systems; Query Optimization; Wide Area Network; SAMPLES; BIG;
D O I
10.1145/3620678.3624643
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling-computing a uniform sample from the join results-is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 50 条
  • [21] Towards WAN-Aware Join Sampling over Geo-Distributed Data
    Kumar, Dhruv
    Wolfrath, Joel
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    PROCEEDINGS OF THE 5TH INTERNATIONAL WORKSHOP ON EDGE SYSTEMS, ANALYTICS AND NETWORKING (EDGESYS'22), 2022, : 13 - 18
  • [22] Optimizing Geo-Distributed Data Processing with Resource Heterogeneity over the Internet
    Marzuni, Saeed mirpour
    Toosi, Adel
    Savadi, Abdorreza
    Naghibzadeh, Mahmud
    Taniar, David
    ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2025, 25 (01)
  • [23] Geo-Distributed IoT Data Analytics With Deadline Constraints Across Network Edge
    Chen, Yiting
    Luo, Lailong
    Ren, Bangbang
    Guo, Deke
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (22) : 22914 - 22929
  • [24] A TTL-based Approach for Data Aggregation in Geo-distributed Streaming Analytics
    Kumar, Dhruv
    Li, Jian
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS, 2019, 3 (02)
  • [25] Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics
    Heintz, Benjamin
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    PROCEEDINGS OF THE SEVENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC 2016), 2016, : 361 - 373
  • [26] Optimizing Network Transfers for Data Analytic Jobs Across Geo-Distributed Datacenters
    Chen, Li
    Liu, Shuhao
    Li, Baochun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (02) : 403 - 414
  • [27] A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
    Mohammed Bergui
    Said Najah
    Nikola S. Nikolov
    Journal of Big Data, 8
  • [28] Run Data Run! Re-distributing Data via Piggybacking for Geo-distributed Data Analytics
    Li, Yefei
    Jin, Yibo
    Chen, Haiyang
    Xi, Wenchao
    Ji, Mingtao
    Zhang, Sheng
    Qian, Zhuzhong
    Lu, Sanglu
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 356 - 363
  • [29] Unicorn: Unified resource orchestration for multi-domain, geo-distributed data analytics
    Xiang, Qiao
    Wang, X. Tony
    Zhang, J. Jensen
    Newman, Harvey
    Yang, Y. Richard
    Liu, Y. Jace
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 93 : 188 - 197
  • [30] Adaptive Partitioning for Large-Scale Graph Analytics in Geo-Distributed Data Centers
    Zhou, Amelie Chi
    Luo, Juanyun
    Qiu, Ruibo
    Tan, Haobin
    He, Bingsheng
    Mao, Rui
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2818 - 2830