A Trie Based Set Similarity Query Algorithm

被引:1
|
作者
Jia, Lianyin [1 ,2 ]
Tang, Junzhuo [1 ]
Li, Mengjuan [3 ]
Li, Runxin [1 ]
Ding, Jiaman [1 ]
Chen, Yinong [4 ]
机构
[1] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming 650500, Peoples R China
[2] Kunming Univ Sci & Technol, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Peoples R China
[3] Yunnan Normal Univ, Lib, Kunming 650500, Peoples R China
[4] Arizona State Univ, Sch Comp & Augmented Intelligence, Tempe, AZ 85287 USA
基金
中国国家自然科学基金;
关键词
set similarity query; T-starTrie; FMNodes; TT-SSQ; EFFICIENT;
D O I
10.3390/math11010229
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Set similarity query is a primitive for many applications, such as data integration, data cleaning, and gene sequence alignment. Most of the existing algorithms are inverted index based, they usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to low efficiency. To solve this problem, this paper designs T-starTrie, an efficient trie based index for set similarity query, which can naturally group sets with the same prefix into one node, and can filter all sets corresponding to the node at a time, thereby significantly improving the candidates generation efficiency. In this paper, we find that the set similarity query problem can be transformed into matching nodes of the first-layer (FMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on this, an efficient set similarity query algorithm, TT-SSQ, is implemented by developing a variety of filtering techniques. Experimental results show that TT-SSQ can be up to 3.10x faster than existing algorithms.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Similarity of Query Results in Similarity-Based Databases
    Belohlavek, Radim
    Urbanova, Lucie
    Vychodil, Vilem
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2011, 6954 : 258 - 267
  • [22] Almost optimal query algorithm for hitting set using a subset query
    Bishnu, Arijit
    Ghosh, Arijit
    Kolay, Sudeshna
    Mishra, Gopinath
    Saurabh, Saket
    JOURNAL OF COMPUTER AND SYSTEM SCIENCES, 2023, 137 : 50 - 65
  • [23] Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints
    Wang, Jiannan
    Feng, Jianhua
    Li, Guoliang
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 1219 - 1230
  • [24] An algorithm for sequence similarity query with optimized multiple filtering
    Dai, Dongbo
    Tang, Chunlei
    Qiu, Boren
    Xiong, Yun
    Zhu, Yangyong
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2010, 47 (10): : 1785 - 1796
  • [25] Similarity-based query caching
    Stuckenschmidt, H
    FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2004, 3055 : 295 - 306
  • [26] Similarity-based Link Prediction Algorithm with Fuzzy Set Approach
    Li, Yu-Zeng
    Yu, Xiao-Fei
    Wang, Bai-Xiang
    INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMMUNICATION ENGINEERING (CSCE 2015), 2015, : 6 - 9
  • [27] An Efficient Query Scheme for Hybrid Storage Blockchains Based on Merkle Semantic Trie
    Pei, Qingqi
    Zhou, Enyuan
    Xiao, Yang
    Zhang, Deyu
    Zhao, Dongxiao
    2020 INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2020), 2020, : 51 - 60
  • [28] An Advanced Trie-Based HTTP Parsing Algorithm
    Li, Anqi
    He, Dazhong
    Wang, Huan
    2016 SIXTH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2016, : 79 - 83
  • [29] Trie-based algorithm for IP lookup problem
    Yilmaz, PA
    Belenkiy, A
    Uzun, N
    GLOBECOM '00: IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, VOLS 1- 3, 2000, : 593 - 598
  • [30] A Fuzzy-set based Semantic Similarity Matching Algorithm for Web Service
    Bai, Li
    Liu, Min
    2008 IEEE INTERNATIONAL CONFERENCE ON SERVICES COMPUTING, PROCEEDINGS, VOL 2, 2008, : 529 - +