Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [31] EFFICIENT STRING EDIT SIMILARITY JOIN ALGORITHM
    Gouda, Karam
    Rashad, Metwally
    COMPUTING AND INFORMATICS, 2017, 36 (03) : 683 - 704
  • [32] Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm
    Aguado-Puig, Quim
    Marco-Sola, Santiago
    Moure, Juan Carlos
    Castells-Rufas, David
    Alvarez, Lluc
    Espinosa, Antonio
    Moreto, Miquel
    IEEE ACCESS, 2022, 10 : 63782 - 63796
  • [33] Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering
    Andrea Torsello
    Antonio Robles-Kelly
    Edwin R. Hancock
    International Journal of Computer Vision, 2007, 72 : 259 - 285
  • [34] Top-down tree edit-distance of regular tree languages
    Sang-Ki Ko
    Yo-Sub Han
    Kai Salomaa
    International Journal of Advances in Engineering Sciences and Applied Mathematics, 2019, 11 : 2 - 10
  • [35] A Survey on Tree Edit Distance Lower Bound Estimation Techniques for Similarity Join on XML Data
    Li, Fei
    Wang, Hongzhi
    Li, Jianzhong
    Gao, Hong
    SIGMOD RECORD, 2013, 42 (04) : 29 - 39
  • [36] Forced-Alignment and Edit-Distance Scoring for Vocabulary Tutoring Applications
    Pakhomov, Serouei
    Richardson, Jayson
    Finholt-Daniel, Matt
    Sales, Gregory
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 443 - +
  • [37] Graph Similarity Search with Edit Distance Constraint in Large Graph Databases
    Zheng, Weiguo
    Zou, Lei
    Lian, Xiang
    Wang, Dong
    Zhao, Dongyan
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1595 - 1600
  • [38] Discovering shape classes using tree edit-distance and pairwise clustering
    Torsello, Andrea
    Robles-Kelly, Antonio
    Hancock, Edwin R.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2007, 72 (03) : 259 - 285
  • [39] Top-Down Tree Edit-Distance of Regular Tree Languages
    Ko, Sang-Ki
    Han, Yo-Sub
    Salomaa, Kai
    LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS (LATA 2014), 2014, 8370 : 466 - 477
  • [40] Top-down tree edit-distance of regular tree languages
    Ko, Sang-Ki
    Han, Yo-Sub
    Salomaa, Kai
    INTERNATIONAL JOURNAL OF ADVANCES IN ENGINEERING SCIENCES AND APPLIED MATHEMATICS, 2019, 11 (01) : 2 - 10