Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

被引:0
|
作者
Yu, Yuan [1 ]
Gunda, Pradeep Kumar [1 ]
Isard, Michael [1 ]
机构
[1] Microsoft Res, Mountain View, CA 94043 USA
关键词
Distributed programming; cloud computing; concurrency;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the Map Reduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.
引用
收藏
页码:247 / 260
页数:14
相关论文
共 50 条
  • [1] Data-parallel computing
    Boyd, Chas.
    2008, Association for Computing Machinery, New York, NY 10036-5701, United States (06):
  • [2] A comparison of implicitly parallel multithreaded and data-parallel implementations of an ocean model
    Shaw, A
    Arvind
    Cho, KC
    Hill, C
    Johnson, RP
    Marshall, J
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1998, 48 (01) : 1 - 51
  • [3] Distributed Data-Parallel Computing Using a High-Level Programming Language
    Isard, Michael
    Yu, Yuan
    ACM SIGMOD/PODS 2009 CONFERENCE, 2009, : 987 - 994
  • [4] A Comparison of Implicitly Parallel Multithreaded and Data-Parallel Implementations of an Ocean Model
    Shaw, A.
    Arvind
    Cho, K.-C.
    Hill, C.
    Journal of Parallel and Distributed Computing, 48 (01):
  • [5] Synthesizing MPI Implementations from Functional Data-Parallel Programs
    Tristan Aubrey-Jones
    Bernd Fischer
    International Journal of Parallel Programming, 2016, 44 : 552 - 573
  • [6] Synthesizing MPI Implementations from Functional Data-Parallel Programs
    Aubrey-Jones, Tristan
    Fischer, Bernd
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2016, 44 (03) : 552 - 573
  • [7] Scalable Random Forest with Data-Parallel Computing
    Vazquez-Novoa, Fernando
    Conejero, Javier
    Tatu, Cristian
    Badia, Rosa M.
    EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 397 - 410
  • [8] Efficient Data-parallel Computations on Distributed Systems
    曾志勇
    High Technology Letters, 2002, (03) : 92 - 96
  • [9] Implementations of grid-based distributed parallel computing
    Lin, Weiwei
    Gu, Changgeng
    Qi, Deyu
    Chen, Yuehong
    Zhang Zhilil
    FIRST INTERNATIONAL MULTI-SYMPOSIUMS ON COMPUTER AND COMPUTATIONAL SCIENCES (IMSCCS 2006), PROCEEDINGS, VOL 1, 2006, : 312 - +
  • [10] Resource Allocation for Data-Parallel Computing in Networks with Data Locality
    Wang, Weina
    Ying, Lei
    2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 933 - 939