A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 50 条
  • [41] LOAD: LSH-Based l0-Sampling over Stream Data with Near-Duplicates
    Lurong, Dingzhu
    Wen, Yanlong
    Zhang, Jiangwei
    Yuan, Xiaojie
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2020, PT I, 2021, 12457 : 473 - 489
  • [42] A PG-LSH similarity search method for cloud storage
    Zheng, Jie
    Luo, Jun
    2013 9TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2013, : 594 - 600
  • [43] C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
    Li, Hangyu
    Nutanong, Sarana
    Xu, Hong
    Yu, Chenyun
    Ha, Foryu
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (03) : 423 - 436
  • [44] C2Net: A Network-Efficient Approach to Collision Counting LSH Similarity Join
    Li, Hangyu
    Nutanong, Sarana
    Xu, Hong
    Yu, Chenyun
    Ha, Foryu
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 2121 - 2122
  • [45] Pass-Join: A Partition-based Method for Similarity Joins
    Li, Guoliang
    Deng, Dong
    Wang, Jiannan
    Feng, Jianhua
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 5 (03): : 253 - 264
  • [46] LSH-based private data protection for service quality with big range in distributed educational service recommendations
    Yan, Chao
    Chen, Xuening
    Kong, Qinglei
    EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2019, 2019 (1)
  • [47] On Link-based Similarity Join
    Sun, Liwen
    Cheng, Reynold
    Li, Xiang
    Cheung, David W.
    Han, Jiawei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (11): : 714 - 725
  • [48] A Prefix-Filter based Method for Spatio-Textual Similarity Join
    Liu, Sitong
    Li, Guoliang
    Feng, Jianhua
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (10) : 2354 - 2367
  • [49] A fast similarity join algorithm using graphics processing units
    Lieberman, Michael D.
    Sankaranarayanan, Jagan
    Samet, Hanan
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 1111 - +
  • [50] Semi-Stream Similarity Join Processing in a Distributed Environment
    Kim, Hong-Ji
    Lee, Ki-Hoon
    IEEE ACCESS, 2020, 8 : 130194 - 130204