On Lexical Resources for Digitization of Historical Documents

被引:0
|
作者
Gotscharek, Annette [1 ]
Reffle, Ulrich [1 ]
Ringlstetter, Christoph [1 ]
Schulz, Klaus U. [1 ]
机构
[1] Univ Munich, LMU, CIS, Munich, Bavaria, Germany
关键词
Historical spelling variants; electronic lexica; Information Retrieval;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modem language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.
引用
收藏
页码:193 / 200
页数:8
相关论文
共 50 条
  • [41] Lexical Characteristics Analysis of Chinese Clinical Documents
    Ju, Meizhi
    Duan, Huilong
    Li, Haomin
    2015 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION (ITME), 2015, : 121 - 125
  • [42] The digitization of human resources processes: The project SMILE
    Staltari, Emanuele
    Mondo Digitale, 2022, 21 (95):
  • [43] SME HUMAN RESOURCES MANAGEMENT DIGITIZATION: EVALUATION OF THE LEVEL OF DIGITIZATION AND ESTIMATION OF FUTURE DEVELOPMENTS
    Kmecova, I
    Stuchly, J.
    Sagapova, N.
    Tlusty, M.
    POLISH JOURNAL OF MANAGEMENT STUDIES, 2021, 23 (02): : 232 - 248
  • [44] The accurate digitization of historical sea level records
    Mcloughlin, Patrick J.
    Mccarthy, Gerard D.
    Nolan, Glenn
    Lawlor, Rosemarie
    Hickey, Kieran
    GEOSCIENCE DATA JOURNAL, 2024, 11 (04): : 790 - 805
  • [45] A Bibliometric Analysis of Research on Historical Buildings and Digitization
    Wang, Zhanzhu
    Sun, Hao
    Yang, Liping
    BUILDINGS, 2023, 13 (07)
  • [46] Digitization of Historical Texts at the National Library of Latvia
    Zogla, Arturs
    Skilters, Jurgis
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 177 - 184
  • [47] The disposal of paper public documents in the face of their digitization: what is lost?
    Silva, Josimas Eugenio
    Dutra, Michael David de Souza
    ARCHIVAL SCIENCE, 2024, 24 (03) : 415 - 437
  • [48] The Rutgers Law Library US Congressional Documents Digitization Collection
    Joergensen, John
    KNOWLEDGE OF THE LAW IN THE BIG DATA AGE, 2019, 317 : 237 - 249
  • [49] Historical documents, part II Two documents on mathematics
    Clavius, C
    SCIENCE IN CONTEXT, 2002, 15 (03) : 465 - 470
  • [50] Can Process Digitization Improve Firm Innovation Performance? Process Digitization as Job Resources and Demands
    Qin, Yize
    Shen, Yuqing
    SUSTAINABILITY, 2024, 16 (13)