MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

被引:0
|
作者
Deng, Dong [1 ]
Li, Guoliang [1 ]
Hao, Shuang [1 ]
Wang, Jiannan [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing 100084, Peoples R China
关键词
EFFICIENT ALGORITHM; TRIE-JOIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate "light-weight" filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
引用
收藏
页码:340 / 351
页数:12
相关论文
共 50 条
  • [1] An Experimental Survey of MapReduce-Based Similarity Joins
    Silva, Yasin N.
    Reed, Jason
    Brown, Kyle
    Wadsworth, Adelbert
    Rong, Chuitian
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2016, 2016, 9939 : 181 - 195
  • [2] Efficient and Scalable Graph Similarity Joins in MapReduce
    Chen, Yifan
    Zhao, Xiang
    Xiao, Chuan
    Zhang, Weiming
    Tang, Jiuyang
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [3] Fast and scalable vector similarity joins with MapReduce
    Yang, Byoungju
    Kim, Hyun Joon
    Shim, Junho
    Lee, Dongjoo
    Lee, Sang-goo
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2016, 46 (03) : 473 - 497
  • [4] Practising Scalable Graph Similarity Joins in MapReduce
    Chen, Yifan
    Zhao, Xiang
    Ge, Bin
    Xiao, Chuan
    Chi, Chi-Hung
    2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 112 - 119
  • [5] Fast and scalable vector similarity joins with MapReduce
    Byoungju Yang
    Hyun Joon Kim
    Junho Shim
    Dongjoo Lee
    Sang-goo Lee
    Journal of Intelligent Information Systems, 2016, 46 : 473 - 497
  • [6] ScaDiPaSi: An Effective Scalable and Distributable MapReduce-Based Method to Find Patient Similarity on Huge Healthcare Networks
    Barkhordari, Mohammadhossein
    Niamanesh, Mahdi
    BIG DATA RESEARCH, 2015, 2 (01) : 19 - 27
  • [7] MapReduce-based Similarity Measurement for Business Processes
    Gao, Juntao
    Wang, Xueshan
    Wang, Yongan
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 95 - 99
  • [8] Scalable Load Balancing for MapReduce-based Record Linkage
    Yan, Wei
    Xue, Yuan
    Malin, Bradley
    2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2013,
  • [9] A scalable MapReduce-based design of an unsupervised entity resolution system
    Hagan, Nicholas Kofi Akortia
    Talburt, John R.
    Anderson, Kris E.
    Hagan, Deasia
    FRONTIERS IN BIG DATA, 2024, 7
  • [10] A MapReduce-Based Distributed SVM for Scalable Data Type Classification
    Jiang, Chong
    Wu, Ting
    Xu, Jian
    Zheng, Ning
    Xu, Ming
    Yang, Tao
    COLLABORATE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, COLLABORATECOM 2016, 2017, 201 : 115 - 126