On Lexical Resources for Digitization of Historical Documents

被引:0
|
作者
Gotscharek, Annette [1 ]
Reffle, Ulrich [1 ]
Ringlstetter, Christoph [1 ]
Schulz, Klaus U. [1 ]
机构
[1] Univ Munich, LMU, CIS, Munich, Bavaria, Germany
关键词
Historical spelling variants; electronic lexica; Information Retrieval;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modem language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.
引用
收藏
页码:193 / 200
页数:8
相关论文
共 50 条
  • [1] Providing access to historical documents through digitization
    Chmielewska, Barbara
    Wrobel, Agnieszka
    LIBRARY MANAGEMENT, 2013, 34 (4-5) : 324 - 334
  • [2] Endangered Archives Programme: archivistic experiences and the digitization of historical documents in Paraiba
    da Silva, Francisco Savio
    Cruz de Oliveira, Henry Poncio
    ACESSO LIVRE, 2018, (09): : 86 - 102
  • [3] Digitization and multispectral analysis of historical books and archival documents: Two exemplary cases
    Maino, Giuseppe
    14TH INTERNATIONAL CONFERENCE ON IMAGE ANALYSIS AND PROCESSING WORKSHOPS, PROCEEDINGS, 2007, : 119 - 124
  • [4] DIGITIZATION OF ELECTROPHYSIOLOGICAL DOCUMENTS
    BERNARD, J
    DELHOMME, M
    TRIGEASSOU, JC
    MARILLAUD, A
    ROUSSEAU, F
    ELECTROENCEPHALOGRAPHY AND CLINICAL NEUROPHYSIOLOGY, 1986, 63 (05): : 497 - 500
  • [5] Automatic Performance Evaluation of Dewarping Methods in Large Scale Digitization of Historical Documents
    Rahnemoonfar, Maryam
    Plale, Beth
    JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 331 - 334
  • [6] The digitization of documents, friend or enemy?
    Gonzalez Mesa, Elda
    BIBLIOTECAS-ANALES DE INVESTIGACION, 2006, (02): : 150 - 154
  • [7] Biocleaning of historical documents: The use and characterization of bacterial enzymatic resources
    Jeszeova, Lenka
    Benzova, Radka
    Glustikova, Marianna
    Siskova, Alena
    Kisova, Zuzana
    Plany, Matej
    Krakova, Lucia
    Bauerova-Hlinkova, Vladena
    Pangallo, Domenico
    INTERNATIONAL BIODETERIORATION & BIODEGRADATION, 2019, 140 : 106 - 112
  • [8] Digitization of Text documents Using PDF/A
    Han, Yan
    Wan, Xueheng
    INFORMATION TECHNOLOGY AND LIBRARIES, 2018, 37 (01) : 52 - 64
  • [9] Digitising history: A guide to creating digital resources from historical documents.
    Landon, DB
    AMERICAN ANTIQUITY, 2002, 67 (02) : 385 - 386
  • [10] Shape from contour for the digitization of curved documents
    Courteille, Frederic
    Durou, Jean-Denis
    Gurdjos, Pierre
    COMPUTER VISION - ACCV 2007, PT II, PROCEEDINGS, 2007, 4844 : 196 - 205