An in-depth analysis of pre-trained embeddings for entity resolution: An in-depth analysis of pre-trained embeddings for entity resolution: A. Zeakis et al.

被引：0

作者：

Alexandros Zeakis ^{[1
]}

George Papadakis ^{[2
]}

Dimitrios Skoutas ^{[1
]}

Manolis Koubarakis ^{[2
]}

机构：

[1] [1,Zeakis, Alexandros

[2] Papadakis, George

[3] Skoutas, Dimitrios

[4] Koubarakis, Manolis

来源：

Zeakis, Alexandros (alzeakis@di.uoa.gr) | 2025年 / 34卷 / 01期

关键词：

Deep learning;

D O I：

10.1007/s00778-024-00879-4

中图分类号：

学科分类号：

摘要：

Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

引用

共 14 条

[1] An in-depth analysis of pre-trained embeddings for entity resolution
Zeakis, Alexandros
Papadakis, George
Skoutas, Dimitrios
Koubarakis, Manolis
VLDB JOURNAL, 2025, 34 (01):
[2] Pre-trained Embeddings for Entity Resolution: An Experimental Analysis
Zeakis, Alexandros
Papadakis, George
Skoutas, Dimitrios
Koubarakis, Manolis
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (09): : 2225 - 2238
[3] On the Role of Pre-trained Embeddings in Binary Code Analysis
Maier, Alwin
Weissberg, Felix
Rieck, Konrad
PROCEEDINGS OF THE 19TH ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, ACM ASIACCS 2024, 2024, : 795 - 810
[4] Sentiment analysis based on improved pre-trained word embeddings
Rezaeinia, Seyed Mahdi
Rahmani, Rouhollah
Ghodsi, Ali
Veisi, Hadi
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 117 : 139 - 147
[5] Entity Resolution Based on Pre-trained Language Models with Two Attentions
Zhu, Liang
Liu, Hao
Song, Xin
Wei, Yonggang
Wang, Yu
WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 433 - 448
[6] A Comparative Study of Pre-trained Word Embeddings for Arabic Sentiment Analysis
Zouidine, Mohamed
Khalil, Mohammed
2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, : 1243 - 1248
[7] DE-ESD: Dual encoder-based entity synonym discovery using pre-trained contextual embeddings
Huang, Subin
Chen, Junjie
Yu, Chengzhen
Li, Daoyu
Zhou, Qing
Liu, Sanmin
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 276
[8] Pre-trained Word Embeddings for Arabic Aspect-Based Sentiment Analysis of Airline Tweets
Ashi, Mohammed Matuq
Siddiqui, Muazzam Ahmed
Nadeem, Farrukh
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2018, 2019, 845 : 241 - 251
[9] An Entity-Level Sentiment Analysis of Financial Text Based on Pre-Trained Language Model
Huang, Zhihong
Fang, Zhijian
2020 IEEE 18TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), VOL 1, 2020, : 391 - 396
[10] Evaluating Pre-trained Word Embeddings and Neural Network Architectures for Sentiment Analysis in Spanish Financial Tweets
Antonio Garcia-Diaz, Jose
Apolinario-Arzube, Oscar
Valencia-Garcia, Rafael
ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 167 - 178

← 1 2 →