Large-Scale Similarity Join with Edit-Distance Constraints

被引:0
|
作者
Lin, Chen [1 ,2 ]
Yu, Haiyang [1 ]
Weng, Wei [3 ]
He, Xianmang [4 ]
机构
[1] Xiamen Univ, Sch Informat Sci & Technol, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[3] Xiamen Univ Technol, Sch Comp & Informat Engn, Xiamen 361024, Peoples R China
[4] Ningbo Univ, Sch Informat & Technol, Ningbo 315122, Peoples R China
来源
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2014, PT II | 2014年 / 8422卷
关键词
Similarity join; big data; Map Reduce; data cleaning;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on Map-Reduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.
引用
收藏
页码:328 / 342
页数:15
相关论文
共 50 条
  • [41] The Relative Edit-Distance Between Two Input-Driven Languages
    Cheon, Hyunjoon
    Han, Yo-Sub
    Ko, Sang-Ki
    Salomaa, Kai
    DEVELOPMENTS IN LANGUAGE THEORY, DLT 2019, 2019, 11647 : 127 - 139
  • [42] A LOWER BOUND FOR THE EDIT-DISTANCE PROBLEM UNDER AN ARBITRARY COST FUNCTION
    HUANG, XQ
    INFORMATION PROCESSING LETTERS, 1988, 27 (06) : 319 - 321
  • [43] A Practical Edit-Distance Model for RNA Secondary-Structure Comparison
    Wu, Chan-Shuo
    Huang, Guan-Shieng
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING, 2009, : 176 - 183
  • [44] Phrase similarity through the edit distance
    Vilares, M
    Ribadas, FJ
    Vilares, J
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 306 - 317
  • [45] Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints
    Koide, Satoshi
    Xiao, Chuan
    Ishikawa, Yoshiharu
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2188 - 2201
  • [46] An experimental Tagalog Finite State Automata spellchecker with Levenshtein edit-distance feature
    Imperial, Joseph Marvin R.
    Ya-On, Czeritonnie Gail, V
    Ureta, Jennifer C.
    PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 240 - 243
  • [47] EMS3: An Improved Algorithm for Finding Edit-distance Based Motifs
    Xiao, Peng
    Cai, Xingyu
    Rajasekaran, Sanguthevar
    2018 IEEE 8TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2018,
  • [48] EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs
    Xiao, Peng
    Cai, Xingyu
    Rajasekaran, Sanguthevar
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2021, 18 (01) : 27 - 37
  • [49] An efficient similarity join approach on large-scale high-dimensional data using random projection
    Ma, Youzhong
    Zhang, Ruiling
    Jia, Shijie
    Zhang, Yongxin
    Meng, Xiaofeng
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (20):
  • [50] Large-Scale Local Online Similarity/Distance Learning Framework Based on Passive/Aggressive
    Hamdan, Baida
    Zabihzadeh, Davood
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2021, 35 (15)