Typewritten OCR Model for Ethiopic Characters

被引:0
|
作者
Deneke, Bereket Siraw [1 ]
Aga, Rosa Tsegaye [1 ]
Samuel, Mesay [2 ]
Mulat, Abel [1 ]
Mulat, Ashenafi [1 ]
Abebe, Abel [1 ]
Mekonnen, Rahel [1 ]
Mulugeta, Hiwot [1 ]
Debelee, Taye Girma [1 ,3 ]
Gachena, Worku [1 ]
机构
[1] Ethiopian Artificial Intelligence Inst, Addis Ababa, Ethiopia
[2] Arbaminch Univ, Arbaminch, Ethiopia
[3] Addis Ababa Sci & Technol Univ, Dept Elect & Comp Engn, Addis Ababa, Ethiopia
关键词
CNN; RNN; OCR; Tesseract; Transcribe; RECOGNITION;
D O I
10.1007/978-3-031-57624-9_14
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Optical Character Recognition (OCR) is the electronic conversion of images of computer-written, typewritten, handwritten, or printed text into machine-encoded text from a scanned document and a photo of a document. In Ethiopia, documents such as historical, office, and official documents have been documented in handwritten and type-written form, until recently. Thus, a large number of historical and essential documents are still in hard-copy form and at risk of disaster to be lost. Computer-written and handwritten OCR have been developed for different language characters including Ethiopian languages (Ethiopic characters), But not typewriter-written OCR for Ethiopic scripts. Like handwritten documents, large historical documents are typewritten documents in Ethiopia. Thus, the typewritten OCR is mandatory to preserve these documents. This study focuses on building an OCR model for typewritten documents that are written on Ethiopic characters. For the study, different Ethiopic characters have been collected from type-written documents, and 290 distinct characters have been segmented to construct augmented data to form various character variations and simulate the complexities encountered in real-world typewritten Amharic texts and enhance the adaptability of the OCR model. This technique aims to approximate the diversity inherent in the data. The model training framework leverages the capabilities of Tesseract, an open-source OCR engine, in conjunction with the artificially generated training set. The Tesseract's existing Amharic OCR model has been deployed as a base model, and the fine-tuning process has been adopted in a layered approach by employing 45,000 samples and spanning 4,800 iterations. The model has been evaluated using character error rate (CER). As per the evaluation, the model performed with 13% CER on the test set. For this study, the Tesseract model before fine-tuning and the Google Lense platform has been used as a baseline to evaluate the performance of the model. Accordingly, our model has outperformed both baselines by more than 10% margin.
引用
收藏
页码:250 / 261
页数:12
相关论文
共 50 条
  • [1] MEASUREMENTS OF THE ALIGNMENTS OF TYPEWRITTEN CHARACTERS - PRELIMINARY FINDINGS
    HARDCASTLE, RA
    PATEL, P
    JOURNAL OF THE FORENSIC SCIENCE SOCIETY, 1990, 30 (04): : 225 - 241
  • [2] Structural and syntactic techniques for recognition of Ethiopic characters
    Assabie, Yaregal
    Bigun, Josef
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, PROCEEDINGS, 2006, 4109 : 118 - 126
  • [3] Arabic calligraphy, typewritten and handwritten using optical character recognition (OCR) system
    Al-Barhamtoshy, Hassanin M.
    Jambi, Kamal M.
    Ahmed, Hany
    Mohamed, Shaimaa
    Abdo, Sherif M.
    Rashwan, Mohsen A.
    BIOSCIENCE BIOTECHNOLOGY RESEARCH COMMUNICATIONS, 2019, 12 (02): : 283 - 296
  • [4] Segmentation in Malayalam OCR - Handling Broken Characters Using Active Contour Model
    Praseetha, M.
    Deepa, S. S.
    2014 INTERNATIONAL CONFERENCE ON CONTROL, INSTRUMENTATION, COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICCICCT), 2014, : 958 - 962
  • [5] A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-Level Performance
    Belay, Birhanu Hailu
    Guyon, Isabelle
    Mengiste, Tadele
    Tilahun, Bezawork
    Liwicki, Marcus
    Tegegne, Tesfa
    Egele, Romain
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT III, 2024, 14806 : 23 - 38
  • [6] OCR of Kannada Characters Using Deep Learning
    Kashyap, Abhishek
    Kumara B, Aruna
    International Conference on Trends in Electrical, Electronics, Computer Engineering, TEECCON 2022, 2022, : 35 - 38
  • [7] Offline Pashto Characters Dataset for OCR Systems
    Khan, Sulaiman
    Khan, Habib Ullah
    Nazir, Shah
    SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
  • [8] A complete OCR system for continuous Bengali characters
    Mahmud, JU
    Raihan, MF
    Rahman, CM
    IEEE TENCON 2003: CONFERENCE ON CONVERGENT TECHNOLOGIES FOR THE ASIA-PACIFIC REGION, VOLS 1-4, 2003, : 1372 - 1376
  • [9] Identification of typewritten and handwritten Conjunct Gujarati characters using artificial neural network
    Patel, Bharat C.
    INTERNATIONAL JOURNAL OF APPLIED PATTERN RECOGNITION, 2022, 7 (01) : 24 - 40
  • [10] REMOTE OCR SYSTEM EMPLOYING FEATURE EXTRACTION OF CHARACTERS
    INOSE, H
    SAITO, T
    TAKANOHARA, K
    KATO, M
    ELECTRONICS & COMMUNICATIONS IN JAPAN, 1974, 57 (06): : 10 - 19