Join processing with threshold-based filtering in MapReduce

被引:0
|
作者
Taewhi Lee
Hye-Chan Bae
Hyoung-Joo Kim
机构
[1] Electronics and Telecommunications Research Institute,BigData Software Platform Research Department
[2] Samsung Electronics Co.,Media Solution Center
[3] Ltd.,Department of Computer Science and Engineering
[4] Seoul National University,undefined
来源
关键词
Join processing; Threshold-based filtering; MapReduce; Hadoop;
D O I
暂无
中图分类号
学科分类号
摘要
Data analytics, in particular those involving heterogeneous data, often require join operations on datasets collected from different sources. MapReduce, one of the most popular frameworks for large-scale data processing, is not suited for joining multiple datasets. This is because MapReduce often produces a large number of redundant intermediate results, irrespective of the size of the joined records. Although several existing approaches attempt to reduce the number of such redundant results using Bloom filters, they may be inefficient if large portions of records are joined or the number of distinct keys is large. To alleviate this problem, we propose a join processing method with threshold-based filtering in MapReduce, called TMFR-Join, which is an abbreviation for “Threshold-based Map-Filter-Reduce Join”. TMFR-Join applies filters according to their performance, which is estimated in terms of false-positive rates. It also provides a general framework for exploiting various filtering techniques that support certain desired operations. The experimental results indicate that the performance of TMFR-Join is close to that of the better of existing join processing techniques, both with and without filters.
引用
收藏
页码:793 / 813
页数:20
相关论文
共 50 条
  • [41] Threshold-based Power Grid Fault Diagnosis
    Chen, Weiqiang
    Ulatowski, Artur
    Bazzi, Ali M.
    2015 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, 2015,
  • [42] Efficient Multi-way Theta-Join Processing Using MapReduce
    Zhang, Xiaofei
    Chen, Lei
    Wang, Min
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (11): : 1184 - 1195
  • [43] Towards Efficient Join Processing over Large RDF Graph Using MapReduce
    Zhang, Xiaofei
    Chen, Lei
    Wang, Min
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, SSDBM 2012, 2012, 7338 : 250 - 259
  • [44] RHJoin: A Fast and Space-efficient Join Method for Log Processing in MapReduce
    Tang, Dixin
    Liu, Taoying
    Liu, Hong
    Li, Wei
    2014 20TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2014, : 975 - 980
  • [45] A threshold-based dynamic data replication strategy
    Bsoul, Mohammad
    Al-Khasawneh, Ahmad
    Kilani, Yousef
    Obeidat, Ibrahim
    JOURNAL OF SUPERCOMPUTING, 2012, 60 (03): : 301 - 310
  • [46] Threshold-based admission policies for video services
    Chan, SHG
    Tobagi, FA
    GLOBECOM'99: SEAMLESS INTERCONNECTION FOR UNIVERSAL SERVICES, VOL 1-5, 1999, : 2076 - 2080
  • [47] Threshold-Based Distributed Continuous Top-k Query Processing for Minimizing Communication Overhead
    Udomlamlert, Kamalas
    Hara, Takahiro
    Nishio, Shojiro
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (02): : 383 - 396
  • [48] Iterative threshold-based Naive bayes classifier
    Romano, Maurizio
    Zammarchi, Gianpaolo
    Conversano, Claudio
    STATISTICAL METHODS AND APPLICATIONS, 2024, 33 (01): : 235 - 265
  • [49] A threshold-based dynamic data replication strategy
    Mohammad Bsoul
    Ahmad Al-Khasawneh
    Yousef Kilani
    Ibrahim Obeidat
    The Journal of Supercomputing, 2012, 60 : 301 - 310
  • [50] A threshold-based hybrid routing protocol for MANET
    Xie, Jing
    Quesada, Luis Girones
    Jiang, Yuming
    2007 FOURTH INTERNATIONAL SYMPOSIUM ON WIRELESS COMMUNICATION SYSTEMS, VOLS 1 AND 2, 2007, : 514 - 518