Extending the Bag Distance for String Similarity Search

被引:0
|
作者
Mergen S. [1 ]
机构
[1] Departamento de Linguagens e Sistemas de Computação, Universidade Federal de Santa Maria, Avenida Roraima, Rio Grande do Sul, Santa Maria
关键词
Bag Distance; Edit Distance; Metric spaces; String similarity;
D O I
10.1007/s42979-022-01502-5
中图分类号
学科分类号
摘要
In the context of string similarity search, the Edit Distance is the preferred choice for indexes based on a metric space. However, the high distances between strings lead to indexes with low pruning factors. Besides, computing the distances is time consuming. An alternative is the Bag Distance, whose computational cost is lower. In this paper, we propose an extension of the Bag Distance (The Anagram Distance) that allows non-uniform costs. The extension is more compatible to the Edit Distance and its applications. We also transform the index space into one that uses an Anagram Distance as the metric function, leaving the Edit Distance computation to a validation phase. As we describe, the transformation increases the pruning factor of in-memory indexes, specially when the costs are non-uniform. Experiments report the improvements achieved during search, both in terms of execution time and the number of distance computations. © 2022, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 50 条
  • [31] Isomorphism Distance in Multidimensional Time Series and Similarity Search
    Guo Wensheng
    Ji Lianen
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 : 209 - 217
  • [32] Double Distance-Calculation-Pruning for Similarity Search
    Venturini Pola, Ives Rene
    Barbosa Pola, Fernanda Paula
    Eler, Danilo Medeiros
    INFORMATION, 2018, 9 (05):
  • [33] Bounded Occurrence Edit Distance: A New Metric for String Similarity Joins with Edit Distance Constraints
    Komatsu, Tomoki
    Okuta, Ryosuke
    Narisawa, Kazuyuki
    Shinohara, Ayumi
    SOFSEM 2014: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2014, 8327 : 363 - 374
  • [34] The bag and the string: Are they opposed?
    Kosyakov, B. P.
    Popov, E. Yu.
    Vronskii, M. A.
    PHYSICS LETTERS B, 2015, 744 : 28 - 33
  • [35] String distance metrics for reference matching and search query correction
    Piskorski, Jakub
    Sydow, Marcin
    BUSINESS INFORMATION SYSTEMS, PROCEEDINGS, 2007, 4439 : 353 - +
  • [36] Highly Efficient String Similarity Search and Join over Compressed Indexes
    Xiao, Guorui
    Wang, Jin
    Lin, Chunbin
    Zaniolo, Carlo
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 232 - 244
  • [37] Experiments in CLIR Using Fuzzy String Search Based on Surface Similarity
    Sethuramalingam, S.
    Singh, Anil Kumar
    Dasigi, Pradeep
    Varma, Vasudeva
    PROCEEDINGS 32ND ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2009, : 682 - 683
  • [39] Leveraging Deletion Neighborhoods and Trie for Efficient String Similarity Search and Join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 1 - 13
  • [40] Leveraging deletion neighborhoods and trie for efficient string similarity search and join
    Cui, Jia
    Meng, Dan
    Chen, Zhong-Tao
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8870