Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

被引:2
|
作者
Zosa, Elaine [1 ]
Mutuvi, Stephen [2 ,3 ]
Granroth-Wilding, Mark [1 ,4 ]
Doucet, Antoine [2 ]
机构
[1] Univ Helsinki, Helsinki, Finland
[2] Univ La Rochelle, L3i Lab, La Rochelle, France
[3] Multimedia Univ Kenya, Nairobi, Kenya
[4] Silo AI, Helsinki, Finland
关键词
Topic modelling; Word embeddings; OCR noise;
D O I
10.1007/978-3-030-91669-5_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
引用
收藏
页码:392 / 400
页数:9
相关论文
共 50 条
  • [1] An Embedding-Based Topic Model for Document Classification
    Seifollahi, Sattar
    Piccardi, Massimo
    Jolfaei, Alireza
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (03)
  • [2] Word Embedding-Based Topic Similarity Measures
    Terragni, Silvia
    Fersini, Elisabetta
    Messina, Enza
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2021), 2021, 12801 : 33 - 45
  • [3] Modeling Topic Evolution in Twitter: An Embedding-Based Approach
    Abulaish, Muhammad
    Fazil, Mohd
    IEEE ACCESS, 2018, 6 : 64847 - 64857
  • [4] Evaluating Supervised Topic Models in the Presence of OCR Errors
    Walker, Daniel
    Ringger, Eric
    Seppi, Kevin
    DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
  • [5] Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models
    Shao, Wei
    Huang, Lei
    Liu, Shuqi
    Ma, Shihua
    Song, Linqi
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [6] Analyzing Geographic Questions Using Embedding-based Topic Modeling
    Yang, Jonghyeon
    Jang, Hanme
    Yu, Kiyun
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2023, 12 (02)
  • [7] Unraveling Scientific Evolutionary Paths: An Embedding-Based Topic Analysis
    Jin, Qianqian
    Chen, Hongshu
    Zhang, Yi
    Wang, Xuefeng
    Zhu, Donghua
    IEEE TRANSACTIONS ON ENGINEERING MANAGEMENT, 2024, 71 : 8964 - 8978
  • [8] Embedding-based Automated Assessment of Domain Models
    Chen, Kua
    Chen, Boqi
    Yang, Yujing
    Mussbacher, Gunter
    Varro, Daniel
    ACM/IEEE 27TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS: COMPANION PROCEEDINGS, MODELS 2024, 2024, : 87 - 94
  • [9] A word embedding-based approach to cross-lingual topic modeling
    Chia-Hsuan Chang
    San-Yih Hwang
    Knowledge and Information Systems, 2021, 63 : 1529 - 1555
  • [10] A word embedding-based approach to cross-lingual topic modeling
    Chang, Chia-Hsuan
    Hwang, San-Yih
    KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (06) : 1529 - 1555