Recovering Text from Endangered Languages Corrupted PDF documents

被引:0
|
作者
Stefanovitch, Nicolas [1 ]
机构
[1] European Commiss, Joint Res Ctr, Ispra, Italy
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present an approach to efficiently recover texts from corrupted documents of endangered languages. Textual resources for such languages are scarce, and sometimes the few available resources are corrupted PDF documents. Endangered languages are not supported by standard tools and present even the additional difficulties of not possessing any corpus availabel over which to train language models to assist with the recovery. The approach presented is able to fully recover born digital PDF documents with minimal effort, thereby helping the preservation effort of endangered languages, by extending the range of documents usable for corpus building.
引用
收藏
页码:78 / 82
页数:5
相关论文
共 50 条
  • [1] Intelligent text extraction from PDF documents
    Hassan, Tamir
    Baumgartner, Robert
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 2 - +
  • [2] Extracting Body Text from Academic PDF Documents for Text Mining
    Yu, Changfeng
    Zhang, Cheng
    Wang, Jie
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1, 2020, : 235 - 242
  • [3] Digitization of Text documents Using PDF/A
    Han, Yan
    Wan, Xueheng
    INFORMATION TECHNOLOGY AND LIBRARIES, 2018, 37 (01) : 52 - 64
  • [4] Automatic Text Classification of PDF Documents using NLP Techniques
    Abdoun, Nabil
    Chami, Mohammad
    INCOSE International Symposium, 2022, 32 (01) : 1320 - 1331
  • [5] Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
    Tiedemann, Jorg
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 102 - 112
  • [6] The Use of Historical Documents and Sound Recordings for the Study and Safeguarding of Endangered Languages
    de Graaf, Tjeerd
    ENDANGERED LANGUAGES AND HISTORY, 2009, : 27 - 32
  • [7] Lessons from documented endangered languages
    Bradley, David
    LANGUAGE, 2011, 87 (02) : 402 - 406
  • [8] Endangered Turkic Languages from China
    Olmez, Mehmet
    ENDANGERED LANGUAGES OF THE CAUCASUS AND BEYOND, 2017, 15 : 135 - 150
  • [9] Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-Based Symbol Recognition with Language Model
    Vol, Mark
    Krutsko, Andrew
    Stefanovitch, Nicolas
    Postanogov, Denis
    2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 121 - 126
  • [10] A Benchmark and Evaluation for Text Extraction from PDF
    Bast, Hannah
    Korzen, Claudius
    2017 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2017), 2017, : 99 - 108