Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise

被引:2
|
作者
Zosa, Elaine [1 ]
Mutuvi, Stephen [2 ,3 ]
Granroth-Wilding, Mark [1 ,4 ]
Doucet, Antoine [2 ]
机构
[1] Univ Helsinki, Helsinki, Finland
[2] Univ La Rochelle, L3i Lab, La Rochelle, France
[3] Multimedia Univ Kenya, Nairobi, Kenya
[4] Silo AI, Helsinki, Finland
关键词
Topic modelling; Word embeddings; OCR noise;
D O I
10.1007/978-3-030-91669-5_30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
引用
收藏
页码:392 / 400
页数:9
相关论文
共 50 条
  • [31] Evaluating Thesaurus-Based Topic Models
    Loukachevitch, Natalia
    Ivanov, Kirill
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 364 - 376
  • [32] Evaluating the Morphological and Capitalization Features for Word Embedding-Based POS Tagger in Bahasa Indonesia
    Manik, Lindung Parningotan
    Syafiandini, Arida Ferti
    Mustika, Hani Febri
    Abka, Achmad Fatchuttamam
    Rianto, Yan
    2018 INTERNATIONAL CONFERENCE ON COMPUTER, CONTROL, INFORMATICS AND ITS APPLICATIONS (IC3INA), 2018, : 49 - 53
  • [33] Embedding-Based Recommendations on Scholarly Knowledge Graphs
    Nayyeri, Mojtaba
    Vahdati, Sahar
    Zhou, Xiaotian
    Yazdi, Hamed Shariat
    Lehmann, Jens
    SEMANTIC WEB (ESWC 2020), 2020, 12123 : 255 - 270
  • [34] An Embedding-Based Approach to Repairing OWL Ontologies
    Ji, Qiu
    Qi, Guilin
    Yang, Yinkai
    Li, Weizhuo
    Huang, Siying
    Sheng, Yang
    APPLIED SCIENCES-BASEL, 2022, 12 (24):
  • [35] Explanations for Network Embedding-Based Link Predictions
    Kang, Bo
    Lijffijt, Jefrey
    De Bie, Tijl
    MACHINE LEARNING AND PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021, PT I, 2021, 1524 : 473 - 488
  • [36] Embedding-based News Recommendation for Millions of Users
    Okura, Shumpei
    Tagami, Yukihiro
    Ono, Shingo
    Tajima, Akira
    KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 1933 - 1942
  • [37] Volatility Prediction using Financial Disclosures Sentiments with Word Embedding-based IR Models
    Rekabsaz, Navid
    Lupu, Mihai
    Baklanov, Artem
    Hanbury, Allan
    Dur, Alexander
    Anderson, Linda
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1712 - 1721
  • [38] PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models
    He, Bing
    Ahamad, Mustaque
    Kumar, Srijan
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 575 - 584
  • [39] Word Embedding-Based Biomedical Text Summarization
    Rouane, Oussama
    Belhadef, Hacene
    Bouakkaz, Mustapha
    EMERGING TRENDS IN INTELLIGENT COMPUTING AND INFORMATICS: DATA SCIENCE, INTELLIGENT INFORMATION SYSTEMS AND SMART COMPUTING, 2020, 1073 : 288 - 297
  • [40] MEAL: Manifold Embedding-based Active Learning
    Sreenivasaiah, Deepthi
    Otterbach, Johannes
    Wollmann, Thomas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 1029 - 1037