A Comparison of Approaches to Large-Scale Data Analysis

被引:0
|
作者
Pavlo, Andrew [1 ]
Paulson, Erik
Rasin, Alexander [1 ]
Abadi, Daniel J.
DeWitt, David J.
Madden, Samuel
Stonebraker, Michael
机构
[1] Brown Univ, Providence, RI 02912 USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
There is currently considerable enthusiasm around the Map Reduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
引用
收藏
页码:165 / 178
页数:14
相关论文
共 50 条
  • [41] Large-Scale Analysis of Genetic and Clinical Patient Data
    Ritchie, Marylyn D.
    ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 1, 2018, 1 : 263 - 274
  • [42] PheWAS analysis on large-scale biobank data with PheTK
    Tran, Tam C.
    Schlueter, David J.
    Zeng, Chenjie
    Mo, Huan
    Carroll, Robert J.
    Denny, Joshua C.
    BIOINFORMATICS, 2025, 41 (01)
  • [43] Kernel methods for large-scale genomic data analysis
    Wang, Xuefeng
    Xing, Eric P.
    Schaid, Daniel J.
    BRIEFINGS IN BIOINFORMATICS, 2015, 16 (02) : 183 - 192
  • [44] Statistical analysis of large-scale neuronal recording data
    Reed, Jamie L.
    Kaas, Jon H.
    NEURAL NETWORKS, 2010, 23 (06) : 673 - 684
  • [45] The HaLoop approach to large-scale iterative data analysis
    Yingyi Bu
    Bill Howe
    Magdalena Balazinska
    Michael D. Ernst
    The VLDB Journal, 2012, 21 : 169 - 190
  • [46] Large-scale data visualization
    Ma, KL
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2001, 21 (04) : 22 - 23
  • [47] Geographically distributed data management to support large-scale data analysis
    Emara, Tamer Z.
    Trinh, Thanh
    Huang, Joshua Zhexue
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [48] Data Services for Carpooling Based on Large-scale Traffic Data Analysis
    Zhang, Zhongmei
    Wang, Guiling
    Cao, Bo
    Han, Yanbo
    2015 IEEE 12TH INTERNATIONAL CONFERENCE ON SERVICES COMPUTING (SCC 2015), 2015, : 672 - 679
  • [49] A distributed data management system to support large-scale data analysis
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 148 : 105 - 115
  • [50] Nonparametric Data Reduction Approach for Large-Scale Survival Data Analysis
    Sadeghzadeh, Keivan
    Fard, Nasser
    2015 61ST ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM (RAMS 2015), 2015,