Load Balancing in MapReduce Based on Scalable Cardinality Estimates

被引:72
|
作者
Gufler, Benjamin [1 ]
Augsten, Nikolaus [2 ]
Reiser, Angelika [1 ]
Kemper, Alfons [1 ]
机构
[1] Tech Univ Munich, Boltzmannstr 3, D-85748 Garching, Germany
[2] Free Univ Bozen Bolzano, I-39100 Bolzano, Italy
关键词
D O I
10.1109/ICDE.2012.58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
MapReduce has emerged as a popular tool for distributed and scalable processing of massive data sets and is increasingly being used in e-science applications. Unfortunately, the performance of MapReduce systems strongly depends on an even data distribution, while scientific data sets are often highly skewed. The resulting load imbalance, which raises the processing time, is even amplified by the high runtime complexities of the reducer tasks. An adaptive load balancing strategy is required for appropriate skew handling. In this paper, we address the problem of estimating the cost of the tasks that are distributed to the reducers based on a given cost model. A realistic cost estimation is the basis for adaptive load balancing algorithms and requires to gather statistics from the mappers. This is challenging: (a) Since the statistics from all mappers must be integrated, the mapper statistics must be small. (b) Although each mapper sees only a small fraction of the data, the integrated statistics must capture the global data distribution. (c) The mappers terminate after sending the statistics to the controller, and no second round is possible. Our solution to these challenges consists of two components. First, a monitoring component executed on every mapper captures the local data distribution and identifies its most relevant subset for cost estimation. Second, an integration component aggregates these subsets and approximates the global data distribution.
引用
收藏
页码:522 / 533
页数:12
相关论文
共 50 条
  • [1] Scalable Load Balancing for MapReduce-based Record Linkage
    Yan, Wei
    Xue, Yuan
    Malin, Bradley
    2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2013,
  • [2] Load Balancing in MapReduce Based on Data Locality
    Chen, Yi
    Liu, Zhaobin
    Wang, Tingting
    Wang, Lu
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT I, 2014, 8630 : 229 - 241
  • [3] Scalable and Robust Key Group Size Estimation For Reducer Load Balancing in MapReduce
    Yan, Wei
    Xue, Yuan
    Malin, Bradley
    2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
  • [4] Load Balancing for MapReduce-based Entity Resolution
    Kolb, Lars
    Thor, Andreas
    Rahm, Erhard
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 618 - 629
  • [5] The Research of MapReduce Load Balancing Based on Multiple Partition Algorithm
    Wang, Suzhen
    Zhou, Haowei
    2016 IEEE/ACM 9TH INTERNATIONAL CONFERENCE ON UTILITY AND CLOUD COMPUTING (UCC), 2016, : 339 - 342
  • [6] Improving Load Balancing for MapReduce-based Entity Matching
    Mestre, Demetrio Gomes
    Santos Pires, Carlos Eduardo
    2013 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATIONS (ISCC), 2013,
  • [7] Load Balancing in Heterogeneous MapReduce Environments
    Fan, Yuanquan
    Wu, Weiguo
    Qian, Depei
    Xu, Yunlong
    Wei, Wei
    2013 IEEE 15TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS & 2013 IEEE INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (HPCC_EUC), 2013, : 1480 - 1489
  • [8] An Efficient Load Balancing Strategy Based on MapReduce for Public Cloud
    Ragmani, Awatif
    El Omri, Amina
    Abghour, Noreddine
    Moussaid, Khalid
    Rida, Mohamed
    PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, DATA AND CLOUD COMPUTING (ICC 2017), 2017,
  • [9] Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling
    Elaheh Gavagsaz
    Ali Rezaee
    Hamid Haj Seyyed Javadi
    The Journal of Supercomputing, 2018, 74 : 3415 - 3440
  • [10] Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling
    Gavagsaz, Elaheh
    Rezaee, Ali
    Javadi, Hamid Haj Seyyed
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (07): : 3415 - 3440