Extending the Bag Distance for String Similarity Search

被引:0
|
作者
Mergen S. [1 ]
机构
[1] Departamento de Linguagens e Sistemas de Computação, Universidade Federal de Santa Maria, Avenida Roraima, Rio Grande do Sul, Santa Maria
关键词
Bag Distance; Edit Distance; Metric spaces; String similarity;
D O I
10.1007/s42979-022-01502-5
中图分类号
学科分类号
摘要
In the context of string similarity search, the Edit Distance is the preferred choice for indexes based on a metric space. However, the high distances between strings lead to indexes with low pruning factors. Besides, computing the distances is time consuming. An alternative is the Bag Distance, whose computational cost is lower. In this paper, we propose an extension of the Bag Distance (The Anagram Distance) that allows non-uniform costs. The extension is more compatible to the Edit Distance and its applications. We also transform the index space into one that uses an Anagram Distance as the metric function, leaving the Edit Distance computation to a validation phase. As we describe, the transformation increases the pruning factor of in-memory indexes, specially when the costs are non-uniform. Experiments report the improvements achieved during search, both in terms of execution time and the number of distance computations. © 2022, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [1] A unified framework for string similarity search with edit-distance constraint
    Yu, Minghe
    Wang, Jin
    Li, Guoliang
    Zhang, Yong
    Deng, Dong
    Feng, Jianhua
    VLDB JOURNAL, 2017, 26 (02): : 249 - 274
  • [2] minIL: A Simple and Small Index for String Similarity Search with Edit Distance
    Yang, Zhong
    Zheng, Bolong
    Wang, Xianzhi
    Li, Guohui
    Zhou, Xiaofang
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 565 - 577
  • [3] A unified framework for string similarity search with edit-distance constraint
    Minghe Yu
    Jin Wang
    Guoliang Li
    Yong Zhang
    Dong Deng
    Jianhua Feng
    The VLDB Journal, 2017, 26 : 249 - 274
  • [4] Top-k String Similarity Search with Edit-Distance Constraints
    Deng, Dong
    Li, Guoliang
    Feng, Jianhua
    Li, Wen-Syan
    2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 925 - 936
  • [5] String similarity search and join: a survey
    Minghe Yu
    Guoliang Li
    Dong Deng
    Jianhua Feng
    Frontiers of Computer Science, 2016, 10 : 399 - 417
  • [6] String similarity search and join: a survey
    Yu, Minghe
    Li, Guoliang
    Deng, Dong
    Feng, Jianhua
    FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (03) : 399 - 417
  • [7] Fast similarity search in string databases
    Sheu, S
    Chang, A
    Huang, W
    19TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 1, PROCEEDINGS: AINA 2005, 2005, : 617 - 622
  • [8] String similarity search and join:a survey
    Minghe YU
    Guoliang LI
    Dong DENG
    Jianhua FENG
    Frontiers of Computer Science, 2016, 10 (03) : 399 - 417
  • [9] Efficiently Supporting Edit Distance Based String Similarity Search Using B+-Trees
    Lu, Wei
    Du, Xiaoyong
    Hadjieleftheriou, Marios
    Ooi, Beng Chin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (12) : 2983 - 2996
  • [10] Extending String Similarity Join to Tolerant Fuzzy Token Matching
    Wang, Jiannan
    Li, Guoliang
    Feng, Jianhua
    ACM TRANSACTIONS ON DATABASE SYSTEMS, 2014, 39 (01):