Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

被引:0
|
作者
Khalil, Rana [1 ]
Shawish, Ahmed [2 ,3 ]
Elzanfaly, Doaa [1 ,4 ]
机构
[1] British Univ Egypt, Fac Informat, Cairo, Egypt
[2] Arab Open Univ, Fac Comp Studies, Kuwait, Kuwait
[3] Ain Shams Univ, Cairo, Egypt
[4] Helwan Univ, Helwan, Egypt
来源
关键词
Entity resolution; Blocking techniques; Hashing; Canopy clustering; Scalability; Efficiency; Effectiveness; Big-data;
D O I
10.1007/978-3-030-01174-1_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into "Blocks" of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.
引用
收藏
页码:427 / 441
页数:15
相关论文
共 50 条
  • [1] Scalable splitting algorithms for big-data interferometric imaging in the SKA era
    Onose, Alexandru
    Carrillo, Rafael E.
    Repetti, Audrey
    McEwen, Jason D.
    Thiran, Jean-Philippe
    Pesquet, Jean-Christophe
    Wiaux, Yves
    MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY, 2016, 462 (04) : 4314 - 4335
  • [2] Tutorial: Uncertain Entity Resolution Re-evaluating Entity Resolution in the Big Data Era
    Gal, Avigdor
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 7 (13): : 1711 - 1712
  • [3] Sports analytics and the big-data era
    Morgulev E.
    Azar O.H.
    Lidor R.
    International Journal of Data Science and Analytics, 2018, 5 (04) : 213 - 222
  • [4] Galaxy morphologies in the era of big-data surveys
    Huertas-Company, M.
    GALAXIES AT HIGH REDSHIFT AND THEIR EVOLUTION OVER COSMIC TIME, 2016, 11 (S319): : 118 - 125
  • [5] Photometric Redshift Techniques in Big-data Era
    Zhang, Yan-Xia
    Zhao, Yong-Heng
    GALAXIES AT HIGH REDSHIFT AND THEIR EVOLUTION OVER COSMIC TIME, 2016, 11 (S319): : 57 - 57
  • [6] SDLER: stacked dedupe learning for entity resolution in big data era
    Alladoumbaye Ngueilbaye
    Hongzhi Wang
    Daouda Ahmat Mahamat
    Ibrahim A. Elgendy
    The Journal of Supercomputing, 2021, 77 : 10959 - 10983
  • [7] SDLER: stacked dedupe learning for entity resolution in big data era
    Ngueilbaye, Alladoumbaye
    Wang, Hongzhi
    Mahamat, Daouda Ahmat
    Elgendy, Ibrahim A.
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (10): : 10959 - 10983
  • [8] Qualitative Research Ethics in the Big-Data Era
    Glenna, Leland
    Hesse, Arielle
    Hinrichs, Clare
    Chiles, Robert
    Sachs, Carolyn
    AMERICAN BEHAVIORAL SCIENTIST, 2019, 63 (05) : 555 - 559
  • [9] Discussion on Library Reform in Big-data Era
    Qin, Lisheng
    2015 The 5th International Conference on Information, Communication and Education Application (ICEA 2015), 2015, 85 : 323 - 326
  • [10] Mapping collective behavior in the big-data era
    Bentley, R. Alexander
    O'Brien, Michael J.
    Brock, William A.
    BEHAVIORAL AND BRAIN SCIENCES, 2014, 37 (01) : 63 - +