A unified framework for string similarity search with edit-distance constraint

被引:0
|
作者
Minghe Yu
Jin Wang
Guoliang Li
Yong Zhang
Dong Deng
Jianhua Feng
机构
[1] Tsinghua University,Department of Computer Science and Technology
来源
The VLDB Journal | 2017年 / 26卷
关键词
Similarity search; Edit distance; Top-; Disk-based method; Partition;
D O I
暂无
中图分类号
学科分类号
摘要
String similarity search is a fundamental operation in data cleaning and integration. It has two variants: threshold-based string similarity search and top-k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} string similarity search. Existing algorithms are efficient for either the former or the latter; most of them cannot support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$\end{document}) on top of the segments. Then, we utilize the HS-Tree\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\textsf {HS}}{\text {-}}{\textsf {Tree}}$$\end{document} to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings and propose an algorithm (HS-Topk). We develop effective pruning techniques to further improve the performance. To support large data sets, we extend our techniques to support the disk-based setting. Experimental results on real-world data sets show that our method achieves high performance on the two problems and outperforms state-of-the-art algorithms by 5–10 times.
引用
收藏
页码:249 / 274
页数:25
相关论文
共 50 条
  • [21] Compressed String Dictionary Search with Edit Distance One
    Djamal Belazzougui
    Rossano Venturini
    Algorithmica, 2016, 74 : 1099 - 1122
  • [22] Shape matching using edit-distance: an implementation
    Klein, PN
    Sebastian, TB
    Kimia, BB
    PROCEEDINGS OF THE TWELFTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2001, : 781 - 790
  • [23] Edit-Distance Between Visibly Pushdown Languages
    Han, Yo-Sub
    Ko, Sang-Ki
    SOFSEM 2017: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2017, 10139 : 387 - 401
  • [24] Towards a Unified Framework for String Similarity Joins
    Xu, Pengfei
    Lu, Jiaheng
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (11): : 1289 - 1302
  • [25] Extending the Bag Distance for String Similarity Search
    Mergen S.
    SN Computer Science, 4 (2)
  • [26] Edit Distance Based Similarity Search of Heterogeneous Information Networks
    Lu, Jianhua
    Lu, Ningyun
    Ma, Sipei
    Zhang, Baili
    DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 195 - 202
  • [27] MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
    Zhang, Haoyu
    Zhang, Qin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 566 - 576
  • [28] Approximating tree edit distance through string edit distance
    Akutsu, Tatsuya
    Fukagawa, Daiji
    Takasu, Atsuhiro
    ALGORITHMS AND COMPUTATION, PROCEEDINGS, 2006, 4288 : 90 - +
  • [29] Approximating Tree Edit Distance through String Edit Distance
    Akutsu, Tatsuya
    Fukagawa, Daiji
    Takasu, Atsuhiro
    ALGORITHMICA, 2010, 57 (02) : 325 - 348
  • [30] Approximating Tree Edit Distance through String Edit Distance
    Tatsuya Akutsu
    Daiji Fukagawa
    Atsuhiro Takasu
    Algorithmica, 2010, 57 : 325 - 348