Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

被引:2
|
作者
Zosa, Elaine [1 ]
Mutuvi, Stephen [2 ,3 ]
Granroth-Wilding, Mark [1 ,4 ]
Doucet, Antoine [2 ]
机构
[1] Univ Helsinki, Helsinki, Finland
[2] Univ La Rochelle, L3i Lab, La Rochelle, France
[3] Multimedia Univ Kenya, Nairobi, Kenya
[4] Silo AI, Helsinki, Finland
关键词
Topic modelling; Word embeddings; OCR noise;
D O I
10.1007/978-3-030-91669-5_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
引用
收藏
页码:392 / 400
页数:9
相关论文
共 50 条
  • [41] An Embedding-Based Approach to Repairing Question Semantics
    Zhou, Haixin
    Wang, Kewen
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS: DASFAA 2021 INTERNATIONAL WORKSHOPS, 2021, 12680 : 107 - 122
  • [42] EMBEDDING-BASED INTERPOLATION ON THE SPECIAL ORTHOGONAL GROUP
    Gawlik, Evan S.
    Leok, Melvin
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2018, 40 (02): : A721 - A746
  • [43] Embedding-based approximate query for knowledge graph
    Qiu, Jingyi
    Zhang, Duxi
    Song, Aibo
    Wang, Honglin
    Zhang, Tianbo
    Jin, Jiahui
    Fang, Xiaolin
    Li, Yaqi
    Journal of Southeast University (English Edition), 2024, 40 (04) : 417 - 424
  • [44] Embedding-based Product Retrieval in Taobao Search
    Li, Sen
    Lv, Fuyu
    Jin, Taiwei
    Lin, Guli
    Yang, Keping
    Zeng, Xiaoyi
    Wu, Xiao-Ming
    Ma, Qianli
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3181 - 3189
  • [45] An Embedding-based Approach to Recommending SPARQL Queries
    Zhang, Lijing
    Zhang, Xiaowang
    Feng, Zhiyong
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 991 - 998
  • [46] SEMS: Scalable Embedding Memory System for Accelerating Embedding-Based DNNs
    Kim, Sejin
    Kim, Jungwoo
    Jang, Yongjoo
    Kung, Jaeha
    Lee, Sungjin
    IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 157 - 160
  • [47] Neural embedding-based indices for semantic search
    Lashkari, Fatemeh
    Bagheri, Ebrahim
    Ghorbani, Ali A.
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (03) : 733 - 755
  • [48] An Assessment of the Impact of OCR Noise on Language Models
    Todorov, Konstantin
    Colavizza, Giovanni
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 2, 2022, : 674 - 683
  • [49] PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models
    Chang, Wei-Cheng
    Jiang, Jyun-Yu
    Zhang, Jiong
    Al-Darabsah, Mutasem
    Teo, Choon Hui
    Hsieh, Cho-Jui
    Yu, Hsiang-Fu
    Vishwanathan, S. V. N.
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 77 - 86
  • [50] Embedding-Based Comparison of Reaction Networks of Wnt Signaling
    Hernandez, Bryan S.
    Lubenia, Patrick Vincent N.
    Mendoza, Eduardo R.
    Qin, M.
    Li, Z.
    Sun, X.
    Yang, X.
    Izadi, M.
    Ahmad, H.
    Srivastava, H. M.
    Brinkmann, G.
    Buccoliero, F.
    Van den Camp, H.
    Agusfrianto, A.
    Mahatma, Y.
    Ambarwati, L.
    MATCH-COMMUNICATIONS IN MATHEMATICAL AND IN COMPUTER CHEMISTRY, 2025, 93 (01)