An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution: A. Zeakis et al.

被引:0
|
作者
Alexandros Zeakis [1 ]
George Papadakis [2 ]
Dimitrios Skoutas [1 ]
Manolis Koubarakis [2 ]
机构
[1] [1,Zeakis, Alexandros
[2] Papadakis, George
[3] Skoutas, Dimitrios
[4] Koubarakis, Manolis
来源
关键词
Deep learning;
D O I
10.1007/s00778-024-00879-4
中图分类号
学科分类号
摘要
Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
引用
收藏
相关论文
共 14 条
  • [1] An in-depth analysis of pre-trained embeddings for entity resolution
    Zeakis, Alexandros
    Papadakis, George
    Skoutas, Dimitrios
    Koubarakis, Manolis
    VLDB JOURNAL, 2025, 34 (01):
  • [2] Pre-trained Embeddings for Entity Resolution: An Experimental Analysis
    Zeakis, Alexandros
    Papadakis, George
    Skoutas, Dimitrios
    Koubarakis, Manolis
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (09): : 2225 - 2238
  • [3] On the Role of Pre-trained Embeddings in Binary Code Analysis
    Maier, Alwin
    Weissberg, Felix
    Rieck, Konrad
    PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024, 2024, : 795 - 810
  • [4] Sentiment analysis based on improved pre-trained word embeddings
    Rezaeinia, Seyed Mahdi
    Rahmani, Rouhollah
    Ghodsi, Ali
    Veisi, Hadi
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
  • [5] Entity Resolution Based on Pre-trained Language Models with Two Attentions
    Zhu, Liang
    Liu, Hao
    Song, Xin
    Wei, Yonggang
    Wang, Yu
    WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 433 - 448
  • [6] A Comparative Study of Pre-trained Word Embeddings for Arabic Sentiment Analysis
    Zouidine, Mohamed
    Khalil, Mohammed
    2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 1243 - 1248
  • [7] DE-ESD: Dual encoder-based entity synonym discovery using pre-trained contextual embeddings
    Huang, Subin
    Chen, Junjie
    Yu, Chengzhen
    Li, Daoyu
    Zhou, Qing
    Liu, Sanmin
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 276
  • [8] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
    Ashi, Mohammed Matuq
    Siddiqui, Muazzam Ahmed
    Nadeem, Farrukh
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
  • [9] An Entity-Level Sentiment Analysis of Financial Text Based on Pre-Trained Language Model
    Huang, Zhihong
    Fang, Zhijian
    2020 IEEE 18TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), VOL 1, 2020, : 391 - 396
  • [10] Evaluating Pre-trained Word Embeddings and Neural Network Architectures for Sentiment Analysis in Spanish Financial Tweets
    Antonio Garcia-Diaz, Jose
    Apolinario-Arzube, Oscar
    Valencia-Garcia, Rafael
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 167 - 178