Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations

被引:0
|
作者
Yu, Yuan [1 ]
Gunda, Pradeep Kumar [1 ]
Isard, Michael [1 ]
机构
[1] Microsoft Res, Mountain View, CA 94043 USA
关键词
Distributed programming; cloud computing; concurrency;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the Map Reduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.
引用
收藏
页码:247 / 260
页数:14
相关论文
共 50 条
  • [41] DATA-PARALLEL PROGRAM DESIGN
    LEWIS, TG
    CURREY, R
    LIU, J
    LECTURE NOTES IN COMPUTER SCIENCE, 1992, 591 : 37 - 53
  • [42] DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms
    Zhu, Guanghui
    Guo, Chen
    Lu, Le
    Huang, Zhi
    Yuan, Chunfeng
    Gu, Rong
    Huang, Yihua
    PARALLEL COMPUTING, 2019, 87 : 87 - 102
  • [43] Dynamic Load Balancing for Ordered Data-Parallel Regions in Distributed Streaming Systems
    Schneider, Scott
    Wolf, Joel
    Hildrum, Kirsten
    Khandekar, Rohit
    Wu, Kun-Lung
    MIDDLEWARE '16: PROCEEDINGS OF THE 17TH INTERNATIONAL MIDDLEWARE CONFERENCE, 2016,
  • [44] Data-parallel programming on a reconfigurable parallel computer
    Sen, RK
    Rajesh, K
    Periswamy, M
    Selvakumar, S
    IETE TECHNICAL REVIEW, 1998, 15 (03) : 181 - 189
  • [45] A General-purpose Distributed Programming System using Data-parallel Streams
    Huang, Tsung-Wei
    Lin, Chun-Xun
    Guo, Guannan
    Wong, Martin D. F.
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1360 - 1363
  • [46] DYNAMIC LOAD-BALANCING STRATEGIES FOR DATA-PARALLEL IMPLEMENTATIONS OF REACTION-EVOLUTION-MIGRATION SYSTEMS
    SMITH, M
    INTERNATIONAL JOURNAL OF MODERN PHYSICS C-PHYSICS AND COMPUTERS, 1993, 4 (01): : 107 - 119
  • [47] A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning
    Rojas, Elvis
    Perez, Diego
    Meneses, Esteban
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 190
  • [48] Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing
    Gowanlock, Michael
    Gallet, Benoit
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 350 - 357
  • [49] A LINEAR-TIME ALGORITHM FOR COMPUTING THE MEMORY ACCESS SEQUENCE IN DATA-PARALLEL PROGRAMS
    KENNEDY, K
    NEDELJKOVIC, N
    SETHI, A
    SIGPLAN NOTICES, 1995, 30 (08): : 102 - 111
  • [50] A DATA-PARALLEL SCIENTIFIC MODELING LANGUAGE
    FRANCIS, RS
    MATHIESON, ID
    WHITING, PG
    DIX, MR
    DAVIES, HL
    ROTSTAYN, LD
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1994, 21 (01) : 46 - 60