Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

被引:0
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Fe, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011) | 2011年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
引用
收藏
页码:458 / 469
页数:12
相关论文
共 50 条
  • [1] Fast-join: An efficient method for fuzzy token matching based string similarity join
    Wang, Jiannan
    Li, Guoliang
    Fe, Jianhua
    Proceedings - International Conference on Data Engineering, 2011, : 458 - 469
  • [2] Extending String Similarity Join to Tolerant Fuzzy Token Matching
    Wang, Jiannan
    Li, Guoliang
    Feng, Jianhua
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2014, 39 (01):
  • [3] MF-Join: Efficient Fuzzy String Similarity Join with Multi-level Filtering
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 386 - 397
  • [4] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [5] Efficient and Scalable Processing of String Similarity Join
    Rong, Chuitian
    Lu, Wei
    Wang, Xiaoli
    Du, Xiaoyong
    Chen, Yueguo
    Tung, Anthony K. H.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2013, 25 (10) : 2217 - 2230
  • [6] Trie-join: a trie-based method for efficient string similarity joins
    Jianhua Feng
    Jiannan Wang
    Guoliang Li
    The VLDB Journal, 2012, 21 : 437 - 461
  • [7] Trie-join: a trie-based method for efficient string similarity joins
    Feng, Jianhua
    Wang, Jiannan
    Li, Guoliang
    VLDB JOURNAL, 2012, 21 (04): : 437 - 461
  • [8] BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching
    Rozinek, Ondrej
    Borkovcova, Monika
    Mares, Jan
    GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 171 - 180
  • [9] Hashed-Join: Approximate String Similarity Join with Hashing
    Yuan, Peisen
    Sha, Chaofeng
    Sun, Yi
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, 2014, 8505 : 217 - 229
  • [10] LS-Join: Local Similarity Join on String Collections
    Wang, Jiaying
    Yang, Xiaochun
    Wang, Bin
    Liu, Chengfei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (09) : 1928 - 1942