Optical Character Recognition and text cleaning in the indigenous South African languages

被引:1
|
作者
Prinsloo, Danie J. [1 ]
Taljard, Elsabe [1 ]
Goosen, Michelle [1 ]
机构
[1] Univ Pretoria, Dept African Languages, Pretoria, South Africa
基金
新加坡国家研究基金会;
关键词
text cleaning; Optical Character Recognition (OCR) tools; `noise' in text-based corpora; scanning errors; text-sourced corpora; granularity of cleanness;
D O I
10.5842/64-1-867
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term "web-sourced material" to refer to digital data sourced from the internet, whereas "text-based material" refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of 'noise' than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
引用
收藏
页码:165 / 187
页数:23
相关论文
共 50 条
  • [21] Text Normalisation in Text-to-Speech Synthesis for South African Languages: Native Number Expansion
    Schlunz, Georg I.
    Dlamini, Nkosikhona
    Tshoane, Alfred
    Ramunyisi, Stan
    2017 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS (PRASA-ROBMECH), 2017, : 230 - 235
  • [22] Students' Voices on How Indigenous Languages Are Disfavoured in South African Higher Education
    Makhanya, Thembelihle
    Zibane, Sibonsile
    LANGUAGE MATTERS, 2020, 51 (01) : 22 - 37
  • [23] Human language technology tools for indigenous South African languages and their potential use
    Mlambo, Respect
    Matfunjwa, Muzi
    LITERATOR-JOURNAL OF LITERARY CRITICISM COMPARATIVE LINGUISTICS AND LITERARY STUDIES, 2025, 46 (01):
  • [24] Recognition of Hand written and Printed Text of Cursive Writing Utilizing Optical Character Recognition
    Duth, Sudharshan P.
    Amulya, B.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), 2020, : 576 - 581
  • [25] Collecting and evaluating speech recognition corpora for 11 South African languages
    Badenhorst, Jaco
    van Heerden, Charl
    Davel, Marelie
    Barnard, Etienne
    LANGUAGE RESOURCES AND EVALUATION, 2011, 45 (03) : 289 - 309
  • [26] Collecting and evaluating speech recognition corpora for 11 South African languages
    Jaco Badenhorst
    Charl van Heerden
    Marelie Davel
    Etienne Barnard
    Language Resources and Evaluation, 2011, 45 : 289 - 309
  • [27] Rethinking the place of African indigenous languages in African education
    Bunyi, G
    INTERNATIONAL JOURNAL OF EDUCATIONAL DEVELOPMENT, 1999, 19 (4-5) : 337 - 350
  • [28] Segmentation-free optical character recognition for printed Urdu text
    Din, Israr Ud
    Siddiqi, Imran
    Khalid, Shehzad
    Azam, Tahir
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2017,
  • [29] Segmentation-free optical character recognition for printed Urdu text
    Israr Ud Din
    Imran Siddiqi
    Shehzad Khalid
    Tahir Azam
    EURASIP Journal on Image and Video Processing, 2017
  • [30] Character recognition in a Japanese text recognition system
    Hong, T
    Srikantan, G
    Zandy, VC
    Fang, C
    Srihari, SN
    DOCUMENT RECOGNITION III, 1996, 2660 : 51 - 62