Recovering Text from Endangered Languages Corrupted PDF documents

被引：0

作者：

Stefanovitch, Nicolas ^{[1
]}

机构：

[1] European Commiss, Joint Res Ctr, Ispra, Italy

来源：

PROCEEDINGS OF THE FIFTH WORKSHOP ON THE USE OF COMPUTATIONAL METHODS IN THE STUDY OF ENDANGERED LANGUAGES (COMPUTEL-5 2022) | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we present an approach to efficiently recover texts from corrupted documents of endangered languages. Textual resources for such languages are scarce, and sometimes the few available resources are corrupted PDF documents. Endangered languages are not supported by standard tools and present even the additional difficulties of not possessing any corpus availabel over which to train language models to assist with the recovery. The approach presented is able to fully recover born digital PDF documents with minimal effort, thereby helping the preservation effort of endangered languages, by extending the range of documents usable for corpus building.

引用

页码：78 / 82

页数：5

共 50 条

[1] Intelligent text extraction from PDF documents
Hassan, Tamir
Baumgartner, Robert
INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 2, PROCEEDINGS, 2006, : 2 - +
[2] Extracting Body Text from Academic PDF Documents for Text Mining
Yu, Changfeng
Zhang, Cheng
Wang, Jie
PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT (KDIR), VOL 1, 2020, : 235 - 242
[3] Digitization of Text documents Using PDF/A
Han, Yan
Wan, Xueheng
INFORMATION TECHNOLOGY AND LIBRARIES, 2018, 37 (01) : 52 - 64
[4] Automatic Text Classification of PDF Documents using NLP Techniques
Abdoun, Nabil
Chami, Mohammad
INCOSE International Symposium, 2022, 32 (01) : 1320 - 1331
[5] Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
Tiedemann, Jorg
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2014, PT I, 2014, 8403 : 102 - 112
[6] The Use of Historical Documents and Sound Recordings for the Study and Safeguarding of Endangered Languages
de Graaf, Tjeerd
ENDANGERED LANGUAGES AND HISTORY, 2009, : 27 - 32
[7] Lessons from documented endangered languages
Bradley, David
LANGUAGE, 2011, 87 (02) : 402 - 406
[8] Endangered Turkic Languages from China
Olmez, Mehmet
ENDANGERED LANGUAGES OF THE CAUCASUS AND BEYOND, 2017, 15 : 135 - 150
[9] Automatic Recovery of Corrupted Font Encoding in PDF Documents Using CNN-Based Symbol Recognition with Language Model
Vol, Mark
Krutsko, Andrew
Stefanovitch, Nicolas
Postanogov, Denis
2018 13TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS), 2018, : 121 - 126
[10] A Benchmark and Evaluation for Text Extraction from PDF
Bast, Hannah
Korzen, Claudius
2017 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2017), 2017, : 99 - 108

← 1 2 3 4 5 →