On Lexical Resources for Digitization of Historical Documents

被引:0
|
作者
Gotscharek, Annette [1 ]
Reffle, Ulrich [1 ]
Ringlstetter, Christoph [1 ]
Schulz, Klaus U. [1 ]
机构
[1] Univ Munich, LMU, CIS, Munich, Bavaria, Germany
关键词
Historical spelling variants; electronic lexica; Information Retrieval;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modem language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.
引用
收藏
页码:193 / 200
页数:8
相关论文
共 50 条
  • [31] THE ANALYSIS OF HISTORICAL DOCUMENTS
    DAVIS, TR
    JOURNAL OF THE FORENSIC SCIENCE SOCIETY, 1982, 22 (01): : 84 - 84
  • [32] AN APPROACH FOR INTERCONNECTING LEXICAL RESOURCES
    Scutelnicu, Liviu-Andrei
    Bibiri, Anca-Diana
    Cristea, Dan
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE 'LINQUISTIC RESOURCES AND TOOLS FOR PROCESSING THE ROMANIAN LANGUAGE', 2015, 2015, : 31 - 38
  • [33] Documents: Historical Thinking
    Sutlic, Vanja
    PHAINOMENA, 2016, 25 (96-97): : 265 - +
  • [34] The fate of historical documents
    Merchant, Richard N.
    BRITISH COLUMBIA MEDICAL JOURNAL, 2021, 63 (07): : 272 - 272
  • [35] BOOK OF HISTORICAL DOCUMENTS (PAPERS) - HISTORICAL DOCUMENTS OF SLOVAKIA - CZECH - CELEBI,E
    BLASKOVIC, J
    HISTORICKY CASOPIS, 1980, 28 (02): : 288 - 291
  • [36] A Lexical Approach for Text Categorization of Medical Documents
    Jindal, Rajni
    Taneja, Shweta
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 314 - 320
  • [37] Estimation support by lexical analysis of requirements documents
    Bowden, P
    Hargreaves, M
    Langensiepen, CS
    JOURNAL OF SYSTEMS AND SOFTWARE, 2000, 51 (02) : 87 - 98
  • [38] THE DIGITIZATION OF PRIMARY TEXTUAL RESOURCES - ROBINSON,P
    BROADHURST, RN
    PROGRAM-AUTOMATED LIBRARY AND INFORMATION SYSTEMS, 1994, 28 (04): : 448 - 449
  • [39] Visual Comparison of Multilingual Documents and Lexical Matching
    Nakayama, Minoru
    Hascoet, Mountaz
    2014 18TH INTERNATIONAL CONFERENCE ON INFORMATION VISUALISATION (IV), 2014, : 151 - 156
  • [40] Lexical and semantic innovations in Medieval Latin documents
    Mesa Sanz, Juan Francisco
    ANUARIO DE ESTUDIOS MEDIEVALES, 2016, 46 (02) : 1026 - 1027