Typewritten OCR Model for Ethiopic Characters

被引:0
|
作者
Deneke, Bereket Siraw [1 ]
Aga, Rosa Tsegaye [1 ]
Samuel, Mesay [2 ]
Mulat, Abel [1 ]
Mulat, Ashenafi [1 ]
Abebe, Abel [1 ]
Mekonnen, Rahel [1 ]
Mulugeta, Hiwot [1 ]
Debelee, Taye Girma [1 ,3 ]
Gachena, Worku [1 ]
机构
[1] Ethiopian Artificial Intelligence Inst, Addis Ababa, Ethiopia
[2] Arbaminch Univ, Arbaminch, Ethiopia
[3] Addis Ababa Sci & Technol Univ, Dept Elect & Comp Engn, Addis Ababa, Ethiopia
来源
PAN-AFRICAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PT I, PANAFRICON AI 2023 | 2024年 / 2068卷
关键词
CNN; RNN; OCR; Tesseract; Transcribe; RECOGNITION;
D O I
10.1007/978-3-031-57624-9_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical Character Recognition (OCR) is the electronic conversion of images of computer-written, typewritten, handwritten, or printed text into machine-encoded text from a scanned document and a photo of a document. In Ethiopia, documents such as historical, office, and official documents have been documented in handwritten and type-written form, until recently. Thus, a large number of historical and essential documents are still in hard-copy form and at risk of disaster to be lost. Computer-written and handwritten OCR have been developed for different language characters including Ethiopian languages (Ethiopic characters), But not typewriter-written OCR for Ethiopic scripts. Like handwritten documents, large historical documents are typewritten documents in Ethiopia. Thus, the typewritten OCR is mandatory to preserve these documents. This study focuses on building an OCR model for typewritten documents that are written on Ethiopic characters. For the study, different Ethiopic characters have been collected from type-written documents, and 290 distinct characters have been segmented to construct augmented data to form various character variations and simulate the complexities encountered in real-world typewritten Amharic texts and enhance the adaptability of the OCR model. This technique aims to approximate the diversity inherent in the data. The model training framework leverages the capabilities of Tesseract, an open-source OCR engine, in conjunction with the artificially generated training set. The Tesseract's existing Amharic OCR model has been deployed as a base model, and the fine-tuning process has been adopted in a layered approach by employing 45,000 samples and spanning 4,800 iterations. The model has been evaluated using character error rate (CER). As per the evaluation, the model performed with 13% CER on the test set. For this study, the Tesseract model before fine-tuning and the Google Lense platform has been used as a baseline to evaluate the performance of the model. Accordingly, our model has outperformed both baselines by more than 10% margin.
引用
收藏
页码:250 / 261
页数:12
相关论文
共 50 条
  • [31] A deep learning model for Ottoman OCR
    Dolek, Ishak
    Kurt, Atakan
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (20):
  • [32] Implicit Language Model in LSTM for OCR
    Sabir, Ekraam
    Rawls, Stephen
    Natarajan, Prem
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2017), VOL 7, 2017, : 27 - 31
  • [33] A document retrieval method from handwritten characters based on OCR and character shape information
    Kameshiro, T
    Hirano, T
    Okada, Y
    Yoda, F
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 597 - 601
  • [34] A comprehensive handwritten image corpus of isolated Persin/Arabic characters for OCR development and evaluation
    Khosravi, Sara
    Razzazi, Farbod
    Rezaei, Hamideh
    Sadigh, Mohammad Reza
    2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 1170 - 1173
  • [35] Recognition of Offline Handwritten Chinese Characters Using the Tesseract Open Source OCR Engine
    Li, Qi
    An, Weihua
    Zhou, Anmi
    Ma, Lehui
    2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 2, 2016, : 452 - 456
  • [36] Algorithms to generate partiall damaged characters and readability study for OCR readers in semiconductor manufacturing
    Desrochers, Dave
    Jin, Yufang
    Qu, Zhihua
    Saengdeejing, Apiwat
    International Journal of Computers and Applications, 2004, 26 (04) : 223 - 228
  • [37] Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books
    Kichuk, Diana
    PORTAL-LIBRARIES AND THE ACADEMY, 2015, 15 (01) : 59 - 91
  • [38] Transcription Free LSTM OCR Model Evaluation
    Jenckel, Martin
    Bukhari, Syed Saqib
    Dengel, Andreas
    PROCEEDINGS 2018 16TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2018, : 122 - 126
  • [39] Deep Learning-Aided OCR Techniques for Chinese Uppercase Characters in the Application of Internet of Things
    Yin, Yue
    Zhang, Wei
    Hong, Sheng
    Yang, Jie
    Xiong, Jian
    Gui, Guan
    IEEE ACCESS, 2019, 7 : 47043 - 47049
  • [40] OCR for Unreadable Damaged Characters on PCBs Using Principal Component Analysis and Bayesian Discriminant Functions
    Nava-Duenas, Carlos F.
    Gonzalez-Navarro, Felix F.
    2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2015, : 535 - 538