Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

被引:0
|
作者
Khalil, Rana [1 ]
Shawish, Ahmed [2 ,3 ]
Elzanfaly, Doaa [1 ,4 ]
机构
[1] British Univ Egypt, Fac Informat, Cairo, Egypt
[2] Arab Open Univ, Fac Comp Studies, Kuwait, Kuwait
[3] Ain Shams Univ, Cairo, Egypt
[4] Helwan Univ, Helwan, Egypt
来源
关键词
Entity resolution; Blocking techniques; Hashing; Canopy clustering; Scalability; Efficiency; Effectiveness; Big-data;
D O I
10.1007/978-3-030-01174-1_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into "Blocks" of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.
引用
收藏
页码:427 / 441
页数:15
相关论文
共 50 条
  • [41] A Noise Tolerant and Schema-agnostic Blocking Technique for Entity Resolution
    Araujo, Tiago Brasileiro
    Santos Pires, Carlos Eduardo
    Mestre, Demetrio Gomes
    da Nobrega, Thiago Pereira
    do Nascimento, Dimas Cassimiro
    Stefanidis, Kostas
    SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 422 - 430
  • [42] Low Power and Scalable Many-Core Architecture for Big-Data Stream Computing
    Kanoun, Karim
    Ruggiero, Martino
    Atienza, David
    van der Schaar, Mihaela
    2014 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI), 2014, : 469 - 474
  • [43] Mapping big brains at subcellular resolution in the era of big data in zoology
    Shen, Yan
    Ding, Lu-Feng
    Yang, Chao-Yu
    Xu, Fang
    Lau, Pak-Ming
    Bi, Guo-Qiang
    ZOOLOGICAL RESEARCH, 2022, 43 (04) : 597 - 599
  • [44] An Overview of End-to-End Entity Resolution for Big Data
    Christophides, Vassilis
    Efthymiou, Vasilis
    Palpanas, Themis
    Papadakis, George
    Stefanidis, Kostas
    ACM COMPUTING SURVEYS, 2021, 53 (06)
  • [45] Big-Data in Climate Change Models - A novel approach with Hadoop MapReduce
    Loaiza, Juan Manuel Carmona
    Giuliani, Graziano
    Fiameni, Giuseppe
    2017 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), 2017, : 45 - 50
  • [46] Big Data Entity Resolution: From Highly to Somehow Similar Entity Descriptions in the Web
    Efthymiou, Vasilis
    Stefanidis, Kostas
    Christophides, Vassilis
    PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 401 - 410
  • [47] Ordeal by innocence in the big-data era: Intended data breach disclosure, unintended real activities manipulation
    Liu, Jinyu
    Ni, Xiaoran
    EUROPEAN FINANCIAL MANAGEMENT, 2024, 30 (01) : 129 - 163
  • [48] DeepBlock: A Novel Blocking Approach for Entity Resolution using Deep Learning
    Javdani, Delaram
    Rahmani, Hossein
    Allahgholi, Milad
    Karimkhani, Fatemeh
    2019 5TH INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2019, : 41 - 44
  • [49] Trends in High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB 2017)
    Escudero Sahuquillo, Jesus
    Javier Garcia, Pedro
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2019, 31 (02):
  • [50] Special issue on trends in high-performance interconnection networks in the exascale and big-data era
    Escudero-Sahuquillo, Jesus
    Javier Garcia, Pedro
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (13):