Typewritten OCR Model for Ethiopic Characters

被引：0

作者：

Deneke, Bereket Siraw ^{[1
]}

Aga, Rosa Tsegaye ^{[1
]}

Samuel, Mesay ^{[2
]}

Mulat, Abel ^{[1
]}

Mulat, Ashenafi ^{[1
]}

Abebe, Abel ^{[1
]}

Mekonnen, Rahel ^{[1
]}

Mulugeta, Hiwot ^{[1
]}

Debelee, Taye Girma ^{[1
,3
]}

Gachena, Worku ^{[1
]}

机构：

[1] Ethiopian Artificial Intelligence Inst, Addis Ababa, Ethiopia

[2] Arbaminch Univ, Arbaminch, Ethiopia

[3] Addis Ababa Sci & Technol Univ, Dept Elect & Comp Engn, Addis Ababa, Ethiopia

来源：

PAN-AFRICAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PT I, PANAFRICON AI 2023 | 2024年 / 2068卷

关键词：

CNN; RNN; OCR; Tesseract; Transcribe; RECOGNITION;

D O I：

10.1007/978-3-031-57624-9_14

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Optical Character Recognition (OCR) is the electronic conversion of images of computer-written, typewritten, handwritten, or printed text into machine-encoded text from a scanned document and a photo of a document. In Ethiopia, documents such as historical, office, and official documents have been documented in handwritten and type-written form, until recently. Thus, a large number of historical and essential documents are still in hard-copy form and at risk of disaster to be lost. Computer-written and handwritten OCR have been developed for different language characters including Ethiopian languages (Ethiopic characters), But not typewriter-written OCR for Ethiopic scripts. Like handwritten documents, large historical documents are typewritten documents in Ethiopia. Thus, the typewritten OCR is mandatory to preserve these documents. This study focuses on building an OCR model for typewritten documents that are written on Ethiopic characters. For the study, different Ethiopic characters have been collected from type-written documents, and 290 distinct characters have been segmented to construct augmented data to form various character variations and simulate the complexities encountered in real-world typewritten Amharic texts and enhance the adaptability of the OCR model. This technique aims to approximate the diversity inherent in the data. The model training framework leverages the capabilities of Tesseract, an open-source OCR engine, in conjunction with the artificially generated training set. The Tesseract's existing Amharic OCR model has been deployed as a base model, and the fine-tuning process has been adopted in a layered approach by employing 45,000 samples and spanning 4,800 iterations. The model has been evaluated using character error rate (CER). As per the evaluation, the model performed with 13% CER on the test set. For this study, the Tesseract model before fine-tuning and the Google Lense platform has been used as a baseline to evaluate the performance of the model. Accordingly, our model has outperformed both baselines by more than 10% margin.

引用

页码：250 / 261

页数：12

共 50 条

[21] Discrimination between printed and handwritten characters for check OCR system
Xu, WR
Zhang, HG
Guo, J
Chen, G
2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1048 - 1053
[22] Retrieval methods for English-text with misrecognized OCR characters
Ohta, M
Takasu, A
Adachi, J
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 950 - 956
[23] Classification and Interpretation of Characters in Multi-Application OCR System
Jain, Anubhav
Sharma, Jaya
2014 INTERNATIONAL CONFERENCE ON DATA MINING AND INTELLIGENT COMPUTING (ICDMIC), 2014,
[24] Hybrid off-line OCR for isolated handwritten Greek characters
Vamvakas, G.
Gatos, B.
Pratikakis, I.
Stamatopoulos, N.
Roniotis, A.
Perantonis, S. J.
PROCEEDINGS OF THE FOURTH IASTED INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PATTERN RECOGNITION, AND APPLICATIONS, 2007, : 197 - +
[25] OCR FOR PRINTED KANJI CHARACTERS USING FEATURE DISTRIBUTION METHOD.
Miyahara, Sueharu
Yamashina, Masaki
Yamada, Yasuhiro
Denki Tsushin Kenkyujo kenkyu jitsuyoka hokoku, 1985, 34 (08): : 1255 - 1263
[26] Identification of Matra Region and Overlapping Characters for OCR of Printed Bengali Scripts
Goswami, Subhra Sundar
INTELLIGENT COMPUTING AND INFORMATION SCIENCE, PT II, 2011, 135 : 606 - 612
[27] Smart OCR for Recognizing Bangla Characters with CRAFT and Deep Learning Models
Hasan, Md Rakibul
Pew, Anamika Basak
Alam, Sanzida
Rifha, Nafisa Tasnim
Shams, Shamin Yeaser
Shahriar, Farhan
Rahman, Rashedur M.
2022 IEEE 13TH ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON), 2022, : 573 - 577
[28] An optimal approach towards recognizing broken Thai characters in OCR systems
Sumetphong, Chaivatna
Tangwongsan, Supachai
2012 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING TECHNIQUES AND APPLICATIONS (DICTA), 2012,
[29] Towards a ptolemaic model for OCR
Veeramachaneni, S
Nagy, G
SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 1060 - 1064
[30] A binary-tree-based OCR technique for machine-printed characters
Gatos, B
Papamarkos, N
Chamzas, C
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 1997, 10 (04) : 403 - 412

← 1 2 3 4 5 →