An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

被引:0
|
作者
Elouataoui, Widad [1 ]
El Mendili, Saida [1 ]
El Alaoui, Imane [2 ]
Gahi, Youssef [1 ]
机构
[1] Ibn Tofail Univ, Natl Sch Appl Sci, Lab Engn Sci, Kenitra, Morocco
[2] Ibn Tofail Univ, Telecommun Syst & Decis Engn Lab, Kenitra, Morocco
关键词
Big data deduplication; online continual learning; big data; entity resolution; record linkage; duplicates detection;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data pre-processing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.
引用
收藏
页码:281 / 291
页数:11
相关论文
共 50 条
  • [21] Online Continual Learning of End-to-End Speech Recognition Models
    Yang, Muqiao
    Lane, Ian
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 2668 - 2672
  • [22] Online End-to-End Learning-Based Predictive Control for Microgrid Energy Management
    Casagrande, Vittorio
    Ferianc, Martin
    Rodrigues, Miguel R. D.
    Boem, Francesca
    IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, 2025, 33 (02) : 463 - 478
  • [23] End-to-End Learning Based on Autoencoder for Fronthaul
    Nie, Junyuan
    Zhang, Jing
    Jiang, Wenshan
    Qiu, Kun
    Dai, Xiaoxiao
    Yang, Qi
    2022 ASIA COMMUNICATIONS AND PHOTONICS CONFERENCE, ACP, 2022, : 953 - 956
  • [24] Promotheus: An End-to-End Machine Learning Framework for Optimizing Markdown in Online Fashion E-commerce
    Loh, Eleanor
    Khandelwal, Jalaj
    Regan, Brian
    Little, Duncan A.
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 3447 - 3457
  • [25] End-to-end deep learning framework for digital holographic reconstruction
    Zhenbo Ren
    Zhimin Xu
    Edmund Y.Lam
    Advanced Photonics, 2019, (01) : 76 - 87
  • [26] End-to-end deep learning framework for digital holographic reconstruction
    Ren, Zhenbo
    Xu, Zhimin
    Lam, Edmund Y.
    ADVANCED PHOTONICS, 2019, 1 (01):
  • [27] An End-to-End Deep Learning Framework for Wideband Signal Recognition
    Vagollari, Adela
    Hirschbeck, Martin
    Gerstacker, Wolfgang
    IEEE ACCESS, 2023, 11 : 52899 - 52922
  • [28] End-to-End Deep Learning Proactive Content Caching Framework
    Bakr, Eslam Mohamed
    Ben-Ammar, Hamza
    Eraqi, Hesham M.
    Aly, Sherif G.
    Elbatt, Tamer
    Ghamri-Doudane, Yacine
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 1043 - 1048
  • [29] Pontryagin Differentiable Programming: An End-to-End Learning and Control Framework
    Jin, Wanxin
    Wang, Zhaoran
    Yang, Zhuoran
    Mou, Shaoshuai
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [30] A Deep Learning Framework for End-to-End Control of Powered Prostheses
    Nuesslein, Christoph P. O.
    Young, Aaron J.
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (05) : 3988 - 3994