Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

被引:0
|
作者
Khalil, Rana [1 ]
Shawish, Ahmed [2 ,3 ]
Elzanfaly, Doaa [1 ,4 ]
机构
[1] British Univ Egypt, Fac Informat, Cairo, Egypt
[2] Arab Open Univ, Fac Comp Studies, Kuwait, Kuwait
[3] Ain Shams Univ, Cairo, Egypt
[4] Helwan Univ, Helwan, Egypt
来源
关键词
Entity resolution; Blocking techniques; Hashing; Canopy clustering; Scalability; Efficiency; Effectiveness; Big-data;
D O I
10.1007/978-3-030-01174-1_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into "Blocks" of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.
引用
收藏
页码:427 / 441
页数:15
相关论文
共 50 条
  • [21] High-performance interconnection networks in the Exascale and Big-Data Era
    Escudero-Sahuquillo, Jesus
    Javier Garcia, Pedro
    JOURNAL OF SUPERCOMPUTING, 2016, 72 (12): : 4415 - 4417
  • [22] A novel big-data perspective on earth system evolution
    EL Bilali, Hafida
    Ernst, Richard E. E.
    Lyons, Timothy W. W.
    Bekker, Andrey
    INTERNATIONAL GEOLOGY REVIEW, 2023, 65 (21) : 3377 - 3387
  • [23] High-performance interconnection networks in the Exascale and Big-Data Era
    Jesús Escudero-Sahuquillo
    Pedro Javier Garcia
    The Journal of Supercomputing, 2016, 72 : 4415 - 4417
  • [24] GSM: A generalized approach to Supervised Meta-blocking for scalable entity resolution
    Gagliardelli, Luca
    Papadakis, George
    Simonini, Giovanni
    Bergamaschi, Sonia
    Palpanas, Themis
    INFORMATION SYSTEMS, 2024, 120
  • [25] Scalable automatic sleep staging in the era of Big Data
    Nakamura, Takashi
    Davies, Harry J.
    Mandic, Danilo P.
    2019 41ST ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2019, : 2265 - 2268
  • [26] Case of Small-Data Analysis for Ion Implanters in the Era of Big-Data FDC
    Hui, Keung
    Mou, Jason
    2013 24TH ANNUAL SEMI ADVANCED SEMICONDUCTOR MANUFACTURING CONFERENCE (ASMC), 2013, : 315 - 319
  • [27] Building Data Warehouses in the Era of Big Data An Approach for Scalable and Flexible Big Data Warehouses
    Costa, Carlos
    Santos, Maribel Yasmina
    ADVANCED INFORMATION SYSTEMS ENGINEERING (CAISE 2019), 2019, 11483 : 693 - 695
  • [28] Role of big-data in classification and novel class detection in data streams
    Chandak M.B.
    Journal of Big Data, 3 (1)
  • [29] A Novel Big-Data Processing Framwork for Healthcare Applications Big-Data-Healthcare-in-a-Box
    Rahman, Fuad
    Slepian, Marvin
    Mitra, Ari
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3548 - 3555
  • [30] GpemDB: A Scalable Database Architecture with the Multi-omics Entity-relationship Model to Integrate Heterogeneous Big-data for Precise Crop Breeding
    Gong, Liang
    Lou, Qiaojun
    Yu, Chenrui
    Chen, Yunyu
    Hong, Jun
    Wu, Wei
    Fan, Shengzhe
    Chen, Liang
    Liu, Chengliang
    FRONTIERS IN BIOSCIENCE-LANDMARK, 2022, 27 (05):