Extending the Bag Distance for String Similarity Search

被引:0
|
作者
Mergen S. [1 ]
机构
[1] Departamento de Linguagens e Sistemas de Computação, Universidade Federal de Santa Maria, Avenida Roraima, Rio Grande do Sul, Santa Maria
关键词
Bag Distance; Edit Distance; Metric spaces; String similarity;
D O I
10.1007/s42979-022-01502-5
中图分类号
学科分类号
摘要
In the context of string similarity search, the Edit Distance is the preferred choice for indexes based on a metric space. However, the high distances between strings lead to indexes with low pruning factors. Besides, computing the distances is time consuming. An alternative is the Bag Distance, whose computational cost is lower. In this paper, we propose an extension of the Bag Distance (The Anagram Distance) that allows non-uniform costs. The extension is more compatible to the Edit Distance and its applications. We also transform the index space into one that uses an Anagram Distance as the metric function, leaving the Edit Distance computation to a validation phase. As we describe, the transformation increases the pruning factor of in-memory indexes, specially when the costs are non-uniform. Experiments report the improvements achieved during search, both in terms of execution time and the number of distance computations. © 2022, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [41] Distance and Time Sensitive Filters for Similarity Search in Trajectory Datasets
    Bhat, Madhav Narayan
    Cesaretti, Paul
    Goswami, Mayank
    Pandey, Prashant
    2023 SYMPOSIUM ON ALGORITHMIC PRINCIPLES OF COMPUTER SYSTEMS, APOCS, 2023, : 51 - 63
  • [42] Histogram Distance for Similarity Search in Large Time Series Database
    Ouyang, Yicun
    Zhang, Feng
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2010, 2010, 6283 : 170 - 177
  • [43] Distance-Based Index Structures for Fast Similarity Search
    Rachkovskij D.A.
    Cybernetics and Systems Analysis, 2017, 53 (04) : 636 - 658
  • [44] Edit Distance Based Similarity Search of Heterogeneous Information Networks
    Lu, Jianhua
    Lu, Ningyun
    Ma, Sipei
    Zhang, Baili
    DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA 2018), PT II, 2018, 11030 : 195 - 202
  • [45] A Similarity Search System based on the Hamming Distance of Social Profiles
    Villaca, Rodolfo da Silva
    de Paula, Luciano Bernardes
    Pasquini, Rafael
    Magalhaes, Mauricio Ferreira
    2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 90 - 93
  • [46] Hamming Distance based Approximate Similarity Text Search Algorithm
    Hu, Haifeng
    Zhang, Liang
    Wu, Jianshen
    2015 SEVENTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE (ICACI), 2015, : 1 - 6
  • [47] Earth Mover's Distance based Similarity Search at Scale
    Tang, Yu
    Hou, Leong U.
    Cai, Yilun
    Mamoulis, Nikos
    Cheng, Reynold
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 7 (04): : 313 - 324
  • [48] On optimizing distance-based similarity search for biological databases
    Mao, R
    Xu, WJ
    Ramakrishnan, S
    Nuckolls, G
    Miranker, DP
    2005 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS, 2005, : 351 - 361
  • [49] MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance
    Zhang, Haoyu
    Zhang, Qin
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 566 - 576
  • [50] MinJoin++: a fast algorithm for string similarity joins under edit distance
    Nikolai Karpov
    Haoyu Zhang
    Qin Zhang
    The VLDB Journal, 2024, 33 : 281 - 299