Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [1] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230
  • [2] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [3] Invariance of edit-distance to tempo in rhythm similarity
    Moritz, Matthew
    Heard, Matthew
    Kim, Hyun-Woong
    Lee, Yune S.
    PSYCHOLOGY OF MUSIC, 2021, 49 (06) : 1671 - 1685
  • [4] A Partition-Based Method for String Similarity Joins with Edit-Distance Constraints
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2013, 38 (02):
  • [5] Ed-Join: An Efficient Algorithm for Similarity Joins With Edit Distance Constraints
    Xiao, Chuan
    Wang, Wei
    Lin, Xuemin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01): : 933 - 944
  • [6] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [7] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    The VLDB Journal, 2017, 26 : 249 - 274
  • [8] Summarization of Multidimensional Process Traces for Analysis under Edit-distance Constraints
    Nguyen, Phuong
    Isahagian, Vatche
    Muthusamy, Vinod
    Slominski, Aleksander
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON SERVICES COMPUTING (SCC 2020), 2020, : 466 - 468
  • [9] Edit-distance of weighted automata
    Mohri, M
    IMPLEMENTATION AND APPLICATION OF AUTOMATA, 2003, 2608 : 1 - 23
  • [10] Malleable Coding with Edit-Distance Cost
    Varshney, Lav R.
    Kusuma, Julius
    Goyal, Vivek K.
    2009 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, VOLS 1- 4, 2009, : 204 - +