Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join

被引:0
|
作者
Wang, Jiannan [1 ]
Li, Guoliang [1 ]
Fe, Jianhua [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
来源
IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011) | 2011年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called "fuzzy token matching based similarity", which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
引用
收藏
页码:458 / 469
页数:12
相关论文
共 50 条
  • [31] QSJoin: a new string similarity join method based on Q-sample and statistical features
    Wang, Xiaoxia
    Sun, Decai
    Wu, Bo
    Ji, Puzhao
    INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2019, 11 (03) : 285 - 308
  • [32] Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes
    Yu C.-Y.
    Wang W.-H.
    Wen X.-J.
    Zhao Y.-H.
    Dongbei Daxue Xuebao/Journal of Northeastern University, 2022, 43 (03): : 321 - 327
  • [33] State-of-the-art in String Similarity Search and Join
    Wandelt, Sebastian
    Deng, Dong
    Gerdjikov, Stefan
    Mishra, Shashwat
    Mitankin, Petar
    Patil, Manish
    Siragusa, Enrico
    Tiskin, Alexander
    Wang, Wei
    Wang, Jiaying
    Leser, Ulf
    SIGMOD RECORD, 2014, 43 (01) : 64 - 76
  • [34] TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite Matching
    Zeakis, Alexandros
    Skoutas, Dimitrios
    Sacharidis, Dimitris
    Papapetrou, Odysseas
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 16 (04): : 790 - 802
  • [35] An Efficient Similarity Join Algorithm with Cosine Similarity Predicate
    Lee, Dongjoo
    Park, Jaehui
    Shim, Junho
    Lee, Sang-goo
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT 2, 2010, 6262 : 422 - +
  • [36] Efficient Privacy Preserving Protocols for Similarity Join
    Hawashin, Bilal
    Fotouhi, Farshad
    Truta, Traian Marius
    Grosky, William
    TRANSACTIONS ON DATA PRIVACY, 2012, 5 (01) : 297 - 330
  • [37] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230
  • [38] I/O-Efficient Similarity Join
    Paghl, Rasmus
    Phaml, Ninh
    Silvestril, Francesco
    Stockel, Morten
    ALGORITHMS - ESA 2015, 2015, 9294 : 941 - 952
  • [39] I/O-Efficient Similarity Join
    Pagh, Rasmus
    Pham, Ninh
    Silvestri, Francesco
    Stockel, Morten
    ALGORITHMICA, 2017, 78 (04) : 1263 - 1283
  • [40] I/O-Efficient Similarity Join
    Rasmus Pagh
    Ninh Pham
    Francesco Silvestri
    Morten Stöckel
    Algorithmica, 2017, 78 : 1263 - 1283