Optical Character Recognition and text cleaning in the indigenous South African languages

被引:1
|
作者
Prinsloo, Danie J. [1 ]
Taljard, Elsabe [1 ]
Goosen, Michelle [1 ]
机构
[1] Univ Pretoria, Dept African Languages, Pretoria, South Africa
基金
新加坡国家研究基金会;
关键词
text cleaning; Optical Character Recognition (OCR) tools; `noise' in text-based corpora; scanning errors; text-sourced corpora; granularity of cleanness;
D O I
10.5842/64-1-867
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
This article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term "web-sourced material" to refer to digital data sourced from the internet, whereas "text-based material" refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of 'noise' than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.
引用
收藏
页码:165 / 187
页数:23
相关论文
共 50 条
  • [41] Optical Character Recognition System Development on Android Platform to Aid Edit Text
    Alencar, D.
    Holanda, G.
    de Souza, J.
    Reboucas, P.
    IEEE LATIN AMERICA TRANSACTIONS, 2018, 16 (11) : 2757 - 2765
  • [42] Camera Trajectory Optimization for Maximizing Optical Character Recognition on Static Scenes with Text
    Zabaldo, Alexander
    Ueda, Jun
    IFAC PAPERSONLINE, 2021, 54 (20): : 801 - 806
  • [43] OPTICAL CHARACTER RECOGNITION
    不详
    DATA PROCESSING, 1967, 9 (03): : 150 - 155
  • [44] OPTICAL CHARACTER RECOGNITION
    SARAGA, P
    WEAVER, JA
    WOOLLONS, DJ
    PHILIPS TECHNICAL REVIEW, 1967, 28 (5-7): : 197 - &
  • [45] OPTICAL CHARACTER RECOGNITION
    EAST, H
    PROGRAM-NEWS OF COMPUTERS IN LIBRARIES, 1978, 12 (02): : 95 - 95
  • [46] OPTICAL CHARACTER RECOGNITION
    BRAUNBECK, J
    INFORMATION DISPLAY-JOURNAL OF THE SOCIETY FOR INFORMATION DISPLAY, 1972, 9 (03): : 18 - +
  • [47] Enhancing Optical Character Recognition on Images with Mixed Text Using Semantic Segmentation
    Patil, Shruti
    Varadarajan, Vijayakumar
    Mahadevkar, Supriya
    Athawade, Rohan
    Maheshwari, Lakhan
    Kumbhare, Shrushti
    Garg, Yash
    Dharrao, Deepak
    Kamat, Pooja
    Kotecha, Ketan
    JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2022, 11 (04)
  • [48] Extracting Table Data from Images Using Optical Character Recognition Text
    Akpinar, Mehmet Yasin
    Emekligil, Erdem
    Arslan, Secil
    2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [49] OPTICAL CHARACTER RECOGNITION
    FREEDMAN, MD
    IEEE SPECTRUM, 1974, 11 (03) : 44 - 52
  • [50] OPTICAL CHARACTER RECOGNITION
    BELL, HA
    CONTROL, 1967, 11 (109): : 328 - &