Optical Character Recognition and text cleaning in the indigenous South African languages

被引:1
|
作者
Prinsloo, Danie J. [1 ]
Taljard, Elsabe [1 ]
Goosen, Michelle [1 ]
机构
[1] Univ Pretoria, Dept African Languages, Pretoria, South Africa
基金
新加坡国家研究基金会;
关键词
text cleaning; Optical Character Recognition (OCR) tools; `noise' in text-based corpora; scanning errors; text-sourced corpora; granularity of cleanness;
D O I
10.5842/64-1-867
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term "web-sourced material" to refer to digital data sourced from the internet, whereas "text-based material" refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of 'noise' than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
引用
收藏
页码:165 / 187
页数:23
相关论文
共 50 条
  • [31] Ghost Character Recognition Theory and Arabic Script Based Languages Character Recognition
    Razzak, Muhammad Imran
    Mirza, Abdulrahman A.
    PRZEGLAD ELEKTROTECHNICZNY, 2011, 87 (11): : 234 - 238
  • [32] INDIGENOUS LANGUAGES OF SOUTH WEST AFRICA
    STANLEY, GE
    ANTHROPOLOGICAL LINGUISTICS, 1968, 10 (03) : 5 - 18
  • [33] LANGUAGE POLICY IMPLEMENTATION IN SOUTH AFRICAN UNIVERSITIES VIS-A-VIS THE SPEAKERS OF INDIGENOUS AFRICAN LANGUAGES' PERCEPTION
    Mutasa, Davie Elias
    PER LINGUAM-A JOURNAL OF LANGUAGE LEARNING, 2015, 31 (01): : 46 - 59
  • [34] From object detection to text detection and recognition: A brief evolution history of optical character recognition
    Wang, Haifeng
    Pan, Changzai
    Guo, Xiao
    Ji, Chunlin
    Deng, Ke
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2021, 13 (05)
  • [35] Code-switched automatic speech recognition in five South African languages
    Biswas, Astik
    Yilmaz, Emre
    van der Westhuizen, Ewald
    de Wet, Febe
    Niesler, Thomas
    COMPUTER SPEECH AND LANGUAGE, 2022, 71
  • [36] Assessing Multilinguality of Topic Models on a Short-Text South African Languages Dataset
    Roos, Darren Craig
    Malan, Katherine Mary
    ARTIFICIAL INTELLIGENCE RESEARCH, SACAIR 2024, 2025, 2326 : 38 - 52
  • [37] Autoencoder Image Denoising to Increase Optical Character Recognition Performance in Text Conversion
    Alamsyah, Nur
    Fauzan, Mohamad Nurkamal
    Putrada, Aji Gautama
    Pane, Syafrial Fachri
    2022 INTERNATIONAL CONFERENCE ON ADVANCED CREATIVE NETWORKS AND INTELLIGENT SYSTEMS, ICACNIS, 2022, : 99 - 104
  • [38] Computational modelling of an optical character recognition system for Yoruba printed text images
    Oni, Olalekan Joseph
    Asahiah, Franklin Oladiipo
    SCIENTIFIC AFRICAN, 2020, 9
  • [39] OPTICAL CHARACTER RECOGNITION
    不详
    CONTROL, 1967, 11 (103): : 24 - &
  • [40] Optical character recognition program for images of printed text using a neural network
    Ganapathy, Velappa
    Lean, Charles C. H.
    2006 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY, VOLS 1-6, 2006, : 1174 - +