Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

被引:2
|
作者
Zosa, Elaine [1 ]
Mutuvi, Stephen [2 ,3 ]
Granroth-Wilding, Mark [1 ,4 ]
Doucet, Antoine [2 ]
机构
[1] Univ Helsinki, Helsinki, Finland
[2] Univ La Rochelle, L3i Lab, La Rochelle, France
[3] Multimedia Univ Kenya, Nairobi, Kenya
[4] Silo AI, Helsinki, Finland
关键词
Topic modelling; Word embeddings; OCR noise;
D O I
10.1007/978-3-030-91669-5_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
引用
收藏
页码:392 / 400
页数:9
相关论文
共 50 条
  • [21] Embedding-Based Methods for Trilattice Logic
    Kamide, Norihiro
    2013 IEEE 43RD INTERNATIONAL SYMPOSIUM ON MULTIPLE-VALUED LOGIC (ISMVL 2013), 2013, : 237 - 242
  • [22] Binary Embedding-based Retrieval at Tencent
    Gan, Yukang
    Ge, Yixiao
    Zhou, Chang
    Su, Shupeng
    Xu, Zhouchuan
    Xu, Xuyuan
    Hui, Quanchao
    Chen, Xiang
    Wang, Yexin
    Shan, Ying
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 4056 - 4067
  • [23] Trilattice logic: an embedding-based approach
    Kamide, Norihiro
    JOURNAL OF LOGIC AND COMPUTATION, 2015, 25 (03) : 581 - 611
  • [24] Embedding-based search in JetBrains IDEs
    Abramov, Evgeny
    Palchikov, Nikolai
    PROCEEDINGS OF THE 2024 FIRST IDE WORKSHOP, IDE 2024, 2024, : 62 - 65
  • [25] Embedding-based Retrieval in Facebook Search
    Huang, Jui-Ting
    Sharma, Ashish
    Sun, Shuying
    Xia, Li
    Zhang, David
    Pronin, Philip
    Padmanabhan, Janani
    Ottaviano, Giuseppe
    Yang, Linjun
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2553 - 2561
  • [26] Embedding-based Silhouette community detection
    Skrlj, Blaz
    Kralj, Jan
    Lavrac, Nada
    MACHINE LEARNING, 2020, 109 (11) : 2161 - 2193
  • [27] Embedding-based Silhouette community detection
    Blaž Škrlj
    Jan Kralj
    Nada Lavrač
    Machine Learning, 2020, 109 : 2161 - 2193
  • [28] Anatomical Embedding-Based Training Method for Medical Image Segmentation Foundation Models
    Zhuang, Mingrui
    Xu, Rui
    Zhang, Qinhe
    Liu, Ailian
    Fan, Xin
    Wang, Hongkai
    FOUNDATION MODELS FOR GENERAL MEDICAL AI, MEDAGI 2024, 2025, 15184 : 143 - 152
  • [29] An embedding-based distance for temporal graphs
    Dall'Amico, Lorenzo
    Barrat, Alain
    Cattuto, Ciro
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [30] Embedding-based Instance Segmentation in Microscopy
    Lalit, Manan
    Tomancak, Pavel
    Jug, Florian
    MEDICAL IMAGING WITH DEEP LEARNING, VOL 143, 2021, 143 : 399 - 415