A Generic Method for Accelerating LSH-Based Similarity Join Processing

被引:15
|
作者
Yu, Chenyun [1 ]
Nutanong, Sarana [1 ]
Li, Hangyu [1 ]
Wang, Cong [1 ]
Yuan, Xingliang [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
关键词
Similarity join; locality sensitive hashing; query processing; representative selection;
D O I
10.1109/TKDE.2016.2638838
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 50 条
  • [1] A Generic Method for Accelerating LSH-based Similarity Join Processing (Extended abstract)
    Yu, Chenyun
    Nutanong, Sarana
    Li, Hangyu
    Wang, Cong
    Yuan, Xingliang
    2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 29 - 30
  • [2] A fast LSH-based similarity search method for multivariate time series
    Yu, Chenyun
    Luo, Lintong
    Chan, Leanne Lai-Hang
    Rakthanmanon, Thanawin
    Nutanong, Sarana
    INFORMATION SCIENCES, 2019, 476 : 337 - 356
  • [3] Accelerating LSH-based Distributed Search with In-network Computation
    Zhang, Penghao
    Pan, Heng
    Li, Zhenyu
    He, Peng
    Zhang, Zhibin
    Tyson, Gareth
    Xie, Gaogang
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,
  • [4] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Rivault, Sebastien
    Bamha, Mostafa
    Limet, Sebastien
    Robert, Sophie
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2022, 50 (3-4) : 360 - 380
  • [5] A Scalable Similarity Join Algorithm Based on MapReduce and LSH
    Sébastien Rivault
    Mostafa Bamha
    Sébastien Limet
    Sophie Robert
    International Journal of Parallel Programming, 2022, 50 : 360 - 380
  • [6] LSH-Based Graph Partitioning Algorithm
    Zhang, Weidong
    Zhang, Mingyue
    ARTIFICIAL INTELLIGENCE (ICAI 2018), 2018, 888 : 55 - 68
  • [7] LSH-based Collaborative Recommendation Method with Privacy-Preservation
    Xu, Jiangmin
    Li, Xuansong
    Wang, Hao
    Dai, Hong-Ning
    Meng, Shunmei
    2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 566 - 573
  • [8] LSH-based distributed similarity indexing with load balancing in high-dimensional space
    Jiagao Wu
    Lu Shen
    Linfeng Liu
    The Journal of Supercomputing, 2020, 76 : 636 - 665
  • [9] LSH-based distributed similarity indexing with load balancing in high-dimensional space
    Wu, Jiagao
    Shen, Lu
    Liu, Linfeng
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (01): : 636 - 665
  • [10] An LSH-based k-representatives clustering method for large categorical data
    Mau, Toan Nguyen
    Huynh, Van-Nam
    NEUROCOMPUTING, 2021, 463 : 29 - 44