Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [21] Efficient large-scale distance-based join queries in spatialhadoop
    Garcia-Garcia, Francisco
    Corral, Antonio
    Iribarne, Luis
    Vassilakopoulos, Michael
    Manolopoulos, Yannis
    GEOINFORMATICA, 2018, 22 (02) : 171 - 209
  • [22] Efficient large-scale distance-based join queries in spatialhadoop
    Francisco García-García
    Antonio Corral
    Luis Iribarne
    Michael Vassilakopoulos
    Yannis Manolopoulos
    GeoInformatica, 2018, 22 : 171 - 209
  • [23] An Edit-Distance Model for the Approximate Matching of Timed Strings
    Dobrisek, Simon
    Zibert, Janez
    Pavesic, Nikola
    Mihelic, France
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2009, 31 (04) : 736 - 741
  • [24] Indexing based on edit-distance matching of shape graphs
    Tirthapura, S
    Sharvit, D
    Klein, P
    Kimia, BB
    MULTIMEDIA STORAGE AND ARCHIVING SYSTEMS III, 1998, 3527 : 25 - 36
  • [25] Distributed Entity Resolution Based on Similarity Join for Large-Scale Data Clustering
    Nie, Tiezheng
    Lee, Wang-chien
    Shen, Derong
    Yu, Ge
    Kou, Yue
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 138 - 149
  • [26] Unified Compression-Based Acceleration of Edit-Distance Computation
    Hermelin, Danny
    Landau, Gad M.
    Landau, Shir
    Weimann, Oren
    ALGORITHMICA, 2013, 65 (02) : 339 - 353
  • [27] Computing the Shortest String and the Edit-Distance for Parsing Expression Languages
    Cheon, Hyunjoon
    Han, Yo-Sub
    DEVELOPMENTS IN LANGUAGE THEORY, DLT 2020, 2020, 12086 : 43 - 54
  • [28] Parameter-specific FPGA implementation of edit-distance calculation
    Kent, Kenneth B.
    Proudfoot, Ryan B.
    Zhao, Yong
    SEVENTEENTH IEEE INTERNATIONAL WORKSHOP ON RAPID SYSTEM PROTOTYPING, 2006, : 209 - +
  • [29] Unified Compression-Based Acceleration of Edit-Distance Computation
    Danny Hermelin
    Gad M. Landau
    Shir Landau
    Oren Weimann
    Algorithmica, 2013, 65 : 339 - 353
  • [30] THE EDIT-DISTANCE BETWEEN A REGULAR LANGUAGE AND A CONTEXT-FREE LANGUAGE
    Han, Yo-Sub
    Ko, Sang-Ki
    Salomaa, Kai
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2013, 24 (07) : 1067 - 1082