Entity Matching on Unstructured Data: An Active Learning Approach

被引:9
|
作者
Brunner, Ursin [1 ]
Stockinger, Kurt [1 ]
机构
[1] ZHAW Zurich Univ Appl Sci, Zurich, Switzerland
关键词
D O I
10.1109/SDS.2019.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured real-word data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.
引用
收藏
页码:97 / 102
页数:6
相关论文
共 50 条
  • [1] Information Fusion for Entity Matching in Unstructured Data
    Ali, Omar
    Cristianini, Nello
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, 2010, 339 : 162 - 169
  • [2] Deep entity matching with adversarial active learning
    Jiacheng Huang
    Wei Hu
    Zhifeng Bao
    Qijin Chen
    Yuzhong Qu
    The VLDB Journal, 2023, 32 : 229 - 255
  • [3] Deep entity matching with adversarial active learning
    Huang, Jiacheng
    Hu, Wei
    Bao, Zhifeng
    Chen, Qijin
    Qu, Yuzhong
    VLDB JOURNAL, 2023, 32 (01): : 229 - 255
  • [4] Entity Matching by Pool-Based Active Learning
    Han, Youfang
    Li, Chunping
    ELECTRONICS, 2024, 13 (03)
  • [5] Deep Indexed Active Learning for Matching Heterogeneous Entity Representations
    Jain, Arjit
    Sarawagi, Sunita
    Sen, Prithviraj
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 15 (01): : 31 - 45
  • [6] A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
    Meduri, Vamsi
    Popa, Lucian
    Sen, Prithviraj
    Sarwat, Mohamed
    SIGMOD'20: PROCEEDINGS OF THE 2020 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2020, : 1133 - 1147
  • [7] Making Sense of Unstructured Data: An Experiential Learning Approach
    Eybers, Sunet
    Hattingh, Marie J.
    ICT EDUCATION, 2020, 1136 : 181 - 196
  • [8] Active Sampling for Entity Matching with Guarantees
    Bellare, Kedar
    Iyengar, Suresh
    Parameswaran, Aditya
    Rastogi, Vibhor
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2013, 7 (03)
  • [9] Entity Matching with Active Monotone Classification
    Tao, Yufei
    PODS'18: PROCEEDINGS OF THE 37TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2018, : 49 - 62
  • [10] Transfer Learning Approach for Learning of Unstructured Data from Structured Data in Medical Domain
    Wankhade, Nishigandha V.
    Potey, Madhuri A.
    2013 2ND INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT IN THE KNOWLEDGE ECONOMY (IMKE), 2013, : 86 - 91