QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

被引:6
|
作者
Stahlberg, Felix [1 ]
Vogel, Stephan [1 ]
机构
[1] HBKU, Qatar Comp Res Inst, Tornado Tower,18th Floor, Doha, Qatar
来源
PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016) | 2016年
关键词
D O I
10.1109/DAS.2016.81
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
引用
收藏
页码:168 / 173
页数:6
相关论文
共 50 条
  • [41] A Survey on Arabic Handwritten Character Recognition
    Ali A.A.A.
    Suresha M.
    Ahmed H.A.M.
    SN Computer Science, 2020, 1 (3)
  • [42] ISSUES ON ARABIC CHARACTER-RECOGNITION
    AMIN, A
    ALSADOUN, HB
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 1993, 18 (03): : 319 - 341
  • [43] Character recognition of Arabic and Latin scripts
    Hussain, F
    Cowell, J
    2000 IEEE INTERNATIONAL CONFERENCE ON INFORMATION VISUALISATION, PROCEEDINGS, 2000, : 51 - 56
  • [44] Offline Isolated Arabic Handwriting Character Recognition System Based on SVM
    Salam, Mustafa
    Hassan, Alia Abdul
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2019, 16 (03) : 467 - 472
  • [45] A Comparative Study of Different Approaches of Primitive Printed Arabic Optical Character Recognition
    Dahi, Mohamed
    Semary, Noura A.
    Hadhoud, Mohiy M.
    2015 11TH INTERNATIONAL COMPUTER ENGINEERING CONFERENCE (ICENCO), 2015, : 105 - 110
  • [46] An artificial immune system for offline isolated handwritten arabic character recognition
    Boufenar, Chaouki
    Batouche, Mohamed
    Schoenauer, Marc
    EVOLVING SYSTEMS, 2018, 9 (01) : 25 - 41
  • [47] An artificial immune system for offline isolated handwritten arabic character recognition
    Chaouki Boufenar
    Mohamed Batouche
    Marc Schoenauer
    Evolving Systems, 2018, 9 : 25 - 41
  • [48] A complete Tamil Optical Character Recognition System
    Aparna, KG
    Ramakrishnan, AG
    DOCUMENT ANALYSIS SYSTEM V, PROCEEDINGS, 2002, 2423 : 53 - 57
  • [49] A PRACTICAL OPTICAL CHARACTER-RECOGNITION SYSTEM
    HALL, PM
    ELECTRONICS AND POWER, 1968, 14 (APR): : 149 - &
  • [50] Valuation of inscribed and signed copies in libraries and heritage collections
    Pedraza-Gracia, Manuel-Jose
    INVESTIGACION BIBLIOTECOLOGICA, 2023, 37 (96): : 65 - 86