QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

被引:6
|
作者
Stahlberg, Felix [1 ]
Vogel, Stephan [1 ]
机构
[1] HBKU, Qatar Comp Res Inst, Tornado Tower,18th Floor, Doha, Qatar
来源
PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016) | 2016年
关键词
D O I
10.1109/DAS.2016.81
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
引用
收藏
页码:168 / 173
页数:6
相关论文
共 50 条
  • [31] Primitive Printed Arabic Optical Character Recognition using Statistical Features
    Dahi, Mohamed
    Semary, Noura A.
    Hadhoud, Mohiy M.
    2015 IEEE SEVENTH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INFORMATION SYSTEMS (ICICIS), 2015, : 567 - 571
  • [32] Printed Arabic Optical Character Recognition using Support vector machine
    Yamina, Ouled Jaafri
    El Mamoun, Mamouni
    Kaddour, Sadouni
    PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON MATHEMATICS AND INFORMATION TECHNOLOGY (ICMIT), 2017, : 134 - 140
  • [33] Synthesis of Common Arabic Handwritings to Aid Optical Character Recognition Research
    Dinges, Laslo
    Al-Hamadi, Ayoub
    Elzobi, Moftah
    El-Etriby, Sherif
    SENSORS, 2016, 16 (03)
  • [34] A Novel Arabic Optical Character Recognition Approach Based on Levenshtein Distance
    Fakhet, Walid
    El Khediri, Salim
    Zidi, Salah
    AUTOMATIC CONTROL AND COMPUTER SCIENCES, 2024, 58 (05) : 519 - 529
  • [35] Font Recognition for Persian Optical Character Recognition System
    Eghbali, Koorosh
    Veisi, Hadi
    Mirzaie, Mohsen
    Behbahani, Yasser Mohseni
    2017 10TH IRANIAN CONFERENCE ON MACHINE VISION AND IMAGE PROCESSING (MVIP), 2017, : 252 - 257
  • [36] An evaluation of methods for arabic character recognition
    Lawgali, A., 1600, Science and Engineering Research Support Society (07):
  • [37] Radon Transformation for Arabic character recognition
    Abdulaziz, Eman
    Alsaif, Khalil I.
    2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 433 - 438
  • [38] A database for Arabic printed character recognition
    AbdeRaouf, Ashraf
    Higgins, Colin A.
    Khalil, Mahmoud
    IMAGE ANALYSIS AND RECOGNITION, PROCEEDINGS, 2008, 5112 : 567 - +
  • [39] A Database for Arabic Handwritten Character Recognition
    AlKhateeb, Jawad H.
    INTERNATIONAL CONFERENCE ON COMMUNICATIONS, MANAGEMENT, AND INFORMATION TECHNOLOGY (ICCMIT'2015), 2015, 65 : 556 - 561
  • [40] Arabic character recognition based on MCR
    Zidouri, A
    Chinveeraphan, S
    Sato, M
    IMAGE ANALYSIS APPLICATIONS AND COMPUTER GRAPHICS, 1995, 1024 : 512 - 513