Typewritten OCR Model for Ethiopic Characters

被引:0
|
作者
Deneke, Bereket Siraw [1 ]
Aga, Rosa Tsegaye [1 ]
Samuel, Mesay [2 ]
Mulat, Abel [1 ]
Mulat, Ashenafi [1 ]
Abebe, Abel [1 ]
Mekonnen, Rahel [1 ]
Mulugeta, Hiwot [1 ]
Debelee, Taye Girma [1 ,3 ]
Gachena, Worku [1 ]
机构
[1] Ethiopian Artificial Intelligence Inst, Addis Ababa, Ethiopia
[2] Arbaminch Univ, Arbaminch, Ethiopia
[3] Addis Ababa Sci & Technol Univ, Dept Elect & Comp Engn, Addis Ababa, Ethiopia
来源
PAN-AFRICAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PT I, PANAFRICON AI 2023 | 2024年 / 2068卷
关键词
CNN; RNN; OCR; Tesseract; Transcribe; RECOGNITION;
D O I
10.1007/978-3-031-57624-9_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical Character Recognition (OCR) is the electronic conversion of images of computer-written, typewritten, handwritten, or printed text into machine-encoded text from a scanned document and a photo of a document. In Ethiopia, documents such as historical, office, and official documents have been documented in handwritten and type-written form, until recently. Thus, a large number of historical and essential documents are still in hard-copy form and at risk of disaster to be lost. Computer-written and handwritten OCR have been developed for different language characters including Ethiopian languages (Ethiopic characters), But not typewriter-written OCR for Ethiopic scripts. Like handwritten documents, large historical documents are typewritten documents in Ethiopia. Thus, the typewritten OCR is mandatory to preserve these documents. This study focuses on building an OCR model for typewritten documents that are written on Ethiopic characters. For the study, different Ethiopic characters have been collected from type-written documents, and 290 distinct characters have been segmented to construct augmented data to form various character variations and simulate the complexities encountered in real-world typewritten Amharic texts and enhance the adaptability of the OCR model. This technique aims to approximate the diversity inherent in the data. The model training framework leverages the capabilities of Tesseract, an open-source OCR engine, in conjunction with the artificially generated training set. The Tesseract's existing Amharic OCR model has been deployed as a base model, and the fine-tuning process has been adopted in a layered approach by employing 45,000 samples and spanning 4,800 iterations. The model has been evaluated using character error rate (CER). As per the evaluation, the model performed with 13% CER on the test set. For this study, the Tesseract model before fine-tuning and the Google Lense platform has been used as a baseline to evaluate the performance of the model. Accordingly, our model has outperformed both baselines by more than 10% margin.
引用
收藏
页码:250 / 261
页数:12
相关论文
共 50 条
  • [41] A generative probabilistic OCR model for NLP applications
    Kolak, O
    Byrne, W
    Resnik, P
    HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 134 - 141
  • [42] Model based restoration of document images for OCR
    Jaisimha, MY
    Riskin, EA
    Ladner, R
    Stuetzle, W
    DOCUMENT RECOGNITION III, 1996, 2660 : 297 - 308
  • [43] AN ESTIMATED QOE MODEL FOR VIDEO TELEPHONE SERVICE ocr
    Wang, Zhe
    Liu, Yitong
    Li, Yuchen
    Yang, Hongwen
    Yang, Dacheng
    PROCEEDINGS OF 2016 5TH IEEE INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC 2016), 2016, : 273 - 278
  • [44] A Spell Correction Model for OCR Errors for Arabic Text
    Muhammad, Mariam
    ELGhazaly, Tarek
    Ezzat, Mostafa
    Gheith, Mervat
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 124 - 136
  • [45] Improvements in Hidden Markov Model Based Arabic OCR
    Prasad, Rohit
    Saleem, Shirin
    Kamali, Matin
    Meermeier, Ralf
    Natarajan, Prem
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 769 - 772
  • [46] A Hybrid Model Reuse Training Approach for Multilingual OCR
    Xie, Zhongwei
    Li, Lin
    Zhong, Xian
    Zhong, Luo
    Xie, Qing
    Xiang, Jianwen
    WEB INFORMATION SYSTEMS ENGINEERING, WISE 2018, PT I, 2018, 11233 : 497 - 512
  • [47] A Reduced Feature-Set OCR System to Recognize Handwritten Tamil Characters using SURF Local Descriptor
    Deepa, R. N. Ashlin
    Narayanan, S. Sankara
    Padthe, Adithya
    Ramannavar, Manjula
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (10) : 331 - 344
  • [48] A Reduced Feature-Set OCR System to Recognize Handwritten Tamil Characters using SURF Local Descriptor
    Deepa R.N.A.
    Narayanan S.S.
    Padthe A.
    Ramannavar M.
    International Journal of Advanced Computer Science and Applications, 2023, 14 (10): : 331 - 344
  • [49] A motivation model for virtual characters
    Liu, Zhen
    Lu, Yu-Sheng
    PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 2712 - 2717
  • [50] A Personality Model of Virtual Characters
    Liu, Zhen
    2008 7TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-23, 2008, : 2497 - 2500