Entity Matching on Unstructured Data: An Active Learning Approach

被引:9
|
作者
Brunner, Ursin [1 ]
Stockinger, Kurt [1 ]
机构
[1] ZHAW Zurich Univ Appl Sci, Zurich, Switzerland
关键词
D O I
10.1109/SDS.2019.00006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the growing number of data sources in enterprises, entity matching becomes a crucial part of every data integration project. In order to reduce the human effort involved in identifying matching entities between different database tables, typically machine learning algorithms are applied. Moreover, active learning is often combined with supervised machine learning methods to further reduce the effort of labeling entities as true or false matches. However, while state-of-the-art active learning algorithms have proven to work well on structured data sets, unstructured data still poses a challenge in entity matching. This paper proposes an end-to-end entity matching pipeline to minimize the human labeling effort for entity matching on unstructured data sets. We use several natural language processing techniques such as soft tf-idf to pre-process the record pairs before we classify them using a novel Active Learning with Uncertainty Sampling (ALWUS) algorithm. We designed our algorithm as a plugin system to work with any state-of-the-art classifier such as support vector machines, random forests or deep neural networks. Detailed experimental results demonstrate that our end-to-end entity matching pipeline clearly outperforms comparable entity matching approaches on an unstructured real-word data set. Our approach achieves significantly better scores (F1-score) while using 1 to 2 orders of magnitude fewer human labeling efforts than existing state-of-the-art algorithms.
引用
收藏
页码:97 / 102
页数:6
相关论文
共 50 条
  • [11] A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets
    Wu, Jian
    Sefid, Athar
    Ge, Allen C.
    Giles, C. Lee
    K-CAP 2017: PROCEEDINGS OF THE KNOWLEDGE CAPTURE CONFERENCE, 2017,
  • [12] A genetic algorithm based entity resolution approach with active learning
    Sun, Chenchen
    Shen, Derong
    Kou, Yue
    Nie, Tiezheng
    Yu, Ge
    FRONTIERS OF COMPUTER SCIENCE, 2017, 11 (01) : 147 - 159
  • [13] A genetic algorithm based entity resolution approach with active learning
    Chenchen Sun
    Derong Shen
    Yue Kou
    Tiezheng Nie
    Ge Yu
    Frontiers of Computer Science, 2017, 11 : 147 - 159
  • [14] A Variance Based Active Learning Approach for Named Entity Recognition
    Hassanzadeh, Hamed
    Keyvanpour, MohammadReza
    INTELLIGENT COMPUTING AND INFORMATION SCIENCE, PT II, 2011, 135 : 347 - +
  • [15] Address Validation in Transportation and Logistics: A Machine Learning Based Entity Matching Approach
    Guermazi, Yassine
    Sellami, Sana
    Boucelma, Omar
    ECML PKDD 2020 WORKSHOPS, 2020, 1323 : 320 - 334
  • [16] Learning Ontologies for Geographic Entity Matching and Multi-Sources Data Fusion
    Yi, Shanzhen
    2013 21ST INTERNATIONAL CONFERENCE ON GEOINFORMATICS (GEOINFORMATICS), 2013,
  • [17] On Generating Benchmark Data for Entity Matching
    Ioannou, Ekaterini
    Rassadko, Nataliya
    Velegrakis, Yannis
    JOURNAL ON DATA SEMANTICS, 2013, 2 (01) : 37 - 56
  • [18] ALDANER: Active Learning based Data Augmentation for Named Entity Recognition
    Moscato, Vincenzo
    Postiglione, Marco
    Sperli, Giancarlo
    Vignali, Andrea
    KNOWLEDGE-BASED SYSTEMS, 2024, 305
  • [19] A named entity recognition approach for tweet streams using active learning
    Van Cuong Tran
    Dinh Tuyen Hoang
    Ngoc Thanh Nguyen
    Hwang, Dosam
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2017, 32 (02) : 1277 - 1287
  • [20] Combining structured and unstructured data for predictive models: a deep learning approach
    Zhang, Dongdong
    Yin, Changchang
    Zeng, Jucheng
    Yuan, Xiaohui
    Zhang, Ping
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (01)