Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

被引:0
|
作者
Khalil, Rana [1 ]
Shawish, Ahmed [2 ,3 ]
Elzanfaly, Doaa [1 ,4 ]
机构
[1] British Univ Egypt, Fac Informat, Cairo, Egypt
[2] Arab Open Univ, Fac Comp Studies, Kuwait, Kuwait
[3] Ain Shams Univ, Cairo, Egypt
[4] Helwan Univ, Helwan, Egypt
来源
关键词
Entity resolution; Blocking techniques; Hashing; Canopy clustering; Scalability; Efficiency; Effectiveness; Big-data;
D O I
10.1007/978-3-030-01174-1_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution is a critical process to enable big data integration. It aims to identify records that refer to the same real-world entity over one or several data sources. By time entity resolution processing has become more problematic and very challenging process due to the continuous increases in the data volume and variety. Therefore, blocking techniques have been developed to solve entity resolution limitations through partitioning datasets into "Blocks" of records. This partitioning step allows their processing in parallel for applying entity resolution methods within each block individually. The current blocking techniques are categorized into two main types: efficient or effective. The effective category includes the techniques that target the accuracy and quality of results. On the other hand, the efficient category includes the fast techniques yet report low accuracy. Nevertheless, there is no technique that succeeded to combine efficiency and effectiveness together, which become a crucial requirement especially with the evolution of the big-data area. This paper introduces a novel technique to fulfill the existing gap in order to achieve high efficiency with no cost to effectiveness through combining the core idea of the canopy clustering with the hashing blocking technique. It is worth to mention that the canopy clustering is classified as the most efficient blocking technique, while the hashing is classified as the most effective one. The proposed technique is named overlapped hashing. The extensive simulation studies conducted on benchmark dataset proved the ability to combine both concepts in one technique yet avoiding their drawbacks. The results report an outstanding performance in terms of scalability, efficiency and effectiveness and promise to create a new step forward in the entity resolution field.
引用
收藏
页码:427 / 441
页数:15
相关论文
共 50 条
  • [31] Life Cycle Assessment of Building Energy in Big-data Era: Theory and Framework
    Yuan, Yan
    Jin, Zhonghua
    2015 INTERNATIONAL CONFERENCE ON NETWORK AND INFORMATION SYSTEMS FOR COMPUTERS (ICNISC), 2015, : 601 - 605
  • [32] Knowledge Entity Extraction and Text Mining in the Era of Big Data
    Zhang, Chengzhi
    Mayr, Philipp
    Lu, Wei
    Zhang, Yi
    Data and Information Management, 2021, 5 (03): : 309 - 311
  • [33] Incremental Blocking for Entity Resolution over Web Streaming Data
    Araujo, Tiago Brasileiro
    Stefanidis, Kostas
    Santos Pires, Carlos Eduardo
    Nummenmaa, Jyrki
    da Nobrega, Thiago Pereira
    2019 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2019), 2019, : 332 - 336
  • [34] A Type-Based Blocking Technique for Efficient Entity Resolution over Large-Scale Data
    Zhu, Hui-Juan
    Zhu, Zheng-Wei
    Jiang, Tong-Hai
    Cheng, Li
    Shi, Wei-Lei
    Zhou, Xi
    Zhao, Fan
    Ma, Bo
    JOURNAL OF SENSORS, 2018, 2018
  • [35] Opportunities and challenges of clinical research in the big-data era: from RCT to BCT
    Wang, Stephen D.
    JOURNAL OF THORACIC DISEASE, 2013, 5 (06) : 721 - 723
  • [36] Stochastic matrix-function estimators: Scalable Big-Data kernels with high performance
    Staar, Peter W. J.
    Barkoutsos, Panagiotis Kl
    Istrate, Roxana
    Malossi, A. Cristiano I.
    Tavernelli, Ivano
    Moll, Nikolaj
    Giefers, Heiner
    Hagleitner, Christoph
    Bekas, Costas
    Curioni, Alessandro
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 812 - 821
  • [37] Unsupervised learning blocking keys technique for indexing Arabic entity resolution
    Alian, Marwah
    Awajan, Arafat
    Ramadan, Bandan
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2019, 22 (03) : 621 - 628
  • [38] Unsupervised learning blocking keys technique for indexing Arabic entity resolution
    Marwah Alian
    Arafat Awajan
    Bandan Ramadan
    International Journal of Speech Technology, 2019, 22 : 621 - 628
  • [39] Androgen Deprivation Therapy and Dementia: New Opportunities and Challenges in the Big-Data Era
    Nead, Kevin T.
    JOURNAL OF CLINICAL ONCOLOGY, 2017, 35 (30) : 3380 - +
  • [40] Smart Service: On the Innovation and Development of Modern Libraries' Services in the Big-data Era
    Zhou Yan
    Zhang Bin
    PROCEEDINGS OF THE 23RD INTERNATIONAL BUSINESS ANNUAL CONFERENCE (2016), BKS ONE AND TWO, 2016, : 134 - 138