MassJoin: A MapReduce-based Method for Scalable String Similarity Joins

被引:0
|
作者
Deng, Dong [1 ]
Li, Guoliang [1 ]
Hao, Shuang [1 ]
Wang, Jiannan [1 ]
Feng, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing 100084, Peoples R China
关键词
EFFICIENT ALGORITHM; TRIE-JOIN;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate "light-weight" filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
引用
收藏
页码:340 / 351
页数:12
相关论文
共 50 条
  • [31] String Similarity Joins: An Experimental Evaluation
    Jiang, Yu
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (08): : 625 - 636
  • [32] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin He
    Haoyu Tan
    Wuman Luo
    Shengzhong Feng
    Jianping Fan
    Frontiers of Computer Science, 2014, 8 : 83 - 99
  • [33] MR-DBSCAN:a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    Yaobin HE
    Haoyu TAN
    Wuman LUO
    Shengzhong FENG
    Jianping FAN
    Frontiers of Computer Science, 2014, 8 (01) : 83 - 99
  • [34] Scalable MapReduce-based Fuzzy Min-Max Neural Network for Pattern Classification
    Ilager, Shashikant
    Prasad, P. S. V. S. Sai
    18TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING (ICDCN 2017), 2017,
  • [35] Scalable Similarity Joins of Tokenized Strings
    Metwally, Ahmed
    Huang, Chun-Heng
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1766 - 1777
  • [36] An efficient MapReduce-based rule matching method for production system
    Li, Ying
    Liu, Weiwei
    Cao, Bin
    Yin, Jianwei
    Yao, Min
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 54 : 478 - 489
  • [37] QJoin: A Q-sample-based Method for Large-scale String Similarity Joins
    Wang, Xiaoxia
    Sun, Decai
    2018 11TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 1, 2018, : 45 - 48
  • [38] MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
    He, Yaobin
    Tan, Haoyu
    Luo, Wuman
    Feng, Shengzhong
    Fan, Jianping
    FRONTIERS OF COMPUTER SCIENCE, 2014, 8 (01) : 83 - 99
  • [39] Metric Similarity Joins Using MapReduce (Extended abstract)
    Chen, Gang
    Yang, Keyu
    Chen, Lu
    Gao, Yunjun
    Zheng, Baihua
    Chen, Chun
    2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, : 1787 - 1788
  • [40] MapReduce-Based Warehouse Systems: A Survey
    Sureshrao, Gore Sumit
    Ambulgekar, H. P.
    2014 INTERNATIONAL CONFERENCE ON ADVANCES IN ENGINEERING AND TECHNOLOGY RESEARCH (ICAETR), 2014,