Optical Character Recognition and text cleaning in the indigenous South African languages

被引:1
|
作者
Prinsloo, Danie J. [1 ]
Taljard, Elsabe [1 ]
Goosen, Michelle [1 ]
机构
[1] Univ Pretoria, Dept African Languages, Pretoria, South Africa
基金
新加坡国家研究基金会;
关键词
text cleaning; Optical Character Recognition (OCR) tools; `noise' in text-based corpora; scanning errors; text-sourced corpora; granularity of cleanness;
D O I
10.5842/64-1-867
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term "web-sourced material" to refer to digital data sourced from the internet, whereas "text-based material" refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of 'noise' than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
引用
收藏
页码:165 / 187
页数:23
相关论文
共 50 条
  • [1] Optical Character Recognition for South African Languages
    Hocking, Justin
    Puttkammer, Martin
    2016 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2016,
  • [2] Developing Text Resources for Ten South African Languages
    Eiselen, Roald
    Puttkammer, Martin J.
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3698 - 3703
  • [3] Improved Text Language Identification for the South African Languages
    Duvenhage, Bernardt
    Ntini, Mfundo
    Ramonyai, Phala
    2017 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS (PRASA-ROBMECH), 2017, : 214 - 218
  • [4] Optical character recognition of arabic printed text
    Electrical and Electronics Engineering Department, University of Khartoum, Sudan
    SCOReD - IEEE Stud. Conf. Res. Dev., (235-240):
  • [5] Optical Character Recognition of Arabic Printed Text
    Taha, Safwa
    Babiker, Yusra
    Abbas, Mohamed
    2012 IEEE STUDENT CONFERENCE ON RESEARCH AND DEVELOPMENT (SCORED), 2012,
  • [6] Optical character recognition for degraded text documents
    Sanyal, Sudip
    Dhingra, Kapil Dev
    Sharma, Pramod Kumar
    IMECS 2007: INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS, VOLS I AND II, 2007, : 1988 - +
  • [7] Optical Character Recognition for Scene Text Detection, Mining and Recognition
    Nathiya, N.
    Pradeepa, K.
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2013, : 662 - 665
  • [8] Borrowing and Dictionary Compilation: The Case of the Indigenous South African Languages
    Mafela, Munzhedzi James
    LEXIKOS, 2010, 20 : 691 - 699
  • [9] Exploring intellectualisation of South African indigenous languages for academic purposes
    Mabela, Matefu L.
    Ditsele, Thabo
    LITERATOR-JOURNAL OF LITERARY CRITICISM COMPARATIVE LINGUISTICS AND LITERARY STUDIES, 2024, 45 (01):
  • [10] Text-based language identification for South African languages
    Botha, Gerrit
    Zimu, Victor
    Barnard, Etienne
    SAIEE Africa Research Journal, 2007, 98 (04) : 141 - 148