Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

被引:0
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Fe, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011) | 2011年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
引用
收藏
页码:458 / 469
页数:12
相关论文
共 50 条
  • [41] SVS-JOIN: Efficient Spatial Visual Similarity Join for Geo-Multimedia
    Zhu, Lei
    Yu, Weiren
    Zhang, Chengyuan
    Zhang, Zuping
    Huang, Fang
    Yu, Hao
    IEEE ACCESS, 2019, 7 : 158389 - 158408
  • [42] GFSF: A Novel Similarity Join Method Based on Frequency Vector
    Lin, Ziyu
    Luo, Daowen
    Lai, Yongxuan
    WEB-AGE INFORMATION MANAGEMENT, PT II, 2016, 9659 : 506 - 518
  • [43] On Link-based Similarity Join
    Sun, Liwen
    Cheng, Reynold
    Li, Xiang
    Cheung, David W.
    Han, Jiawei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (11): : 714 - 725
  • [44] Efficient Top-K SimRank-based Similarity Join
    Tao, Wenbo
    SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1603 - 1604
  • [45] FrepJoin: an efficient partition-based algorithm for edit similarity join
    Ji-zhou Luo
    Sheng-fei Shi
    Hong-zhi Wang
    Jian-zhong Li
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 1499 - 1510
  • [46] Fast similarity join for multi-dimensional data
    Kalashnikov, Dmitri V.
    Prabhakar, Sunil
    INFORMATION SYSTEMS, 2007, 32 (01) : 160 - 177
  • [47] FrepJoin:an efficient partition-based algorithm for edit similarity join
    Ji-zhou LUO
    Sheng-fei SHI
    Hong-zhi WANG
    Jian-zhong LI
    FrontiersofInformationTechnology&ElectronicEngineering, 2017, 18 (10) : 1499 - 1510
  • [48] Efficient Top-K SimRank-based Similarity Join
    Tao, Wenbo
    Yu, Minghe
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 317 - 328
  • [49] Efficient SimRank-based Similarity Join Over Large Graphs
    Zheng, Weiguo
    Zou, Lei
    Feng, Yansong
    Chen, Lei
    Zhao, Dongyan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (07): : 493 - 504
  • [50] Fast top-k similarity join for SimRank
    Li, Ruiqi
    Zhao, Xiang
    Shang, Haichuan
    Chen, Yifan
    Xiao, Weidong
    INFORMATION SCIENCES, 2017, 381 : 1 - 19