A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 50 条
  • [11] An LSH-Based Model-Words-Driven Product Duplicate Detection Method
    Hartveld, Aron
    van Keulen, Max
    Mathol, Diederik
    van Noort, Thomas
    Plaatsman, Thomas
    Frasincar, Flavius
    Schouten, Kim
    ADVANCED INFORMATION SYSTEMS ENGINEERING, CAISE 2018, 2018, 10816 : 409 - 423
  • [12] NetSHa: In-Network Acceleration of LSH-Based Distributed Search
    Zhang, Penghao
    Pan, Heng
    Li, Zhenyu
    Cui, Penglai
    Jia, Ru
    He, Peng
    Zhang, Zhibin
    Tyson, Gareth
    Xie, Gaogang
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (09) : 2213 - 2229
  • [13] Fusion feature for LSH-based image retrieval in a cloud datacenter
    Liao, Jianxin
    Yang, Di
    Li, Tonghong
    Qi, Qi
    Wang, Jingyu
    Sun, Haifeng
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (23) : 15405 - 15427
  • [14] An LSH-based Offloading Method for IoMT Services in Integrated Cloud-Edge Environment
    Xu, Xiaolong
    Huang, Qihe
    Zhang, Yiwen
    Li, Shancang
    Qi, Lianyong
    Dou, Wanchun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 16 (03)
  • [15] LSH-Based Large Scale Chinese Calligraphic Character Recognition
    Lin, Yuan
    Wu, Jiangqin
    Gao, Pengcheng
    Xia, Yang
    Mao, Tianjiao
    JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 323 - 329
  • [16] Towards a Scalable Set Similarity Join Using MapReduce and LSH
    Rivault, Sebastien
    Bamha, Mostafa
    Limet, Sebastien
    Robert, Sophie
    COMPUTATIONAL SCIENCE - ICCS 2022, PT I, 2022, : 569 - 583
  • [17] Fusion feature for LSH-based image retrieval in a cloud datacenter
    Jianxin Liao
    Di Yang
    Tonghong Li
    Qi Qi
    Jingyu Wang
    Haifeng Sun
    Multimedia Tools and Applications, 2016, 75 : 15405 - 15427
  • [18] Towards Load Balancing for LSH-based Distributed Similarity Indexing in High-dimensional Space<bold> </bold>
    Shen, Lu
    Wu, Jiagao
    Wang, Yongrong
    Liu, Linfeng
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 384 - 391
  • [19] A fast and efficient Hamming LSH-based scheme for accurate linkage
    Karapiperis, Dimitrios
    Verykios, Vassilios S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 49 (03) : 861 - 884
  • [20] A fast and efficient Hamming LSH-based scheme for accurate linkage
    Dimitrios Karapiperis
    Vassilios S. Verykios
    Knowledge and Information Systems, 2016, 49 : 861 - 884