QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries

被引:6
|
作者
Stahlberg, Felix [1 ]
Vogel, Stephan [1 ]
机构
[1] HBKU, Qatar Comp Res Inst, Tornado Tower,18th Floor, Doha, Qatar
来源
PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016) | 2016年
关键词
D O I
10.1109/DAS.2016.81
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).
引用
收藏
页码:168 / 173
页数:6
相关论文
共 50 条
  • [21] PRESS COLLECTIONS, DIGITALIZATION AND HERITAGE OF LIBRARIES
    Lallich-Boidin, Genevieve
    Brun-Picard, Celine
    CULTURE ET MUSEES, 2013, (21): : 89 - 152
  • [22] An Arabic character recognition system using neural network
    Sanossian, HYY
    NEURAL NETWORKS FOR SIGNAL PROCESSING VI, 1996, : 340 - 348
  • [23] Implementation of a statistical based Arabic character recognition system
    Cheung, A
    Bennamoun, M
    Bergmann, NW
    IEEE TENCON'97 - IEEE REGIONAL 10 ANNUAL CONFERENCE, PROCEEDINGS, VOLS 1 AND 2: SPEECH AND IMAGE TECHNOLOGIES FOR COMPUTING AND TELECOMMUNICATIONS, 1997, : 531 - 534
  • [24] High- Performance Printed Arabic Optical Character Recognition System Using ANN Classifier
    Al-Sadawi, Basheer
    Hussain, Ahmed
    Ali, Nabeel Salih
    2021 PALESTINIAN INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (PICICT 2021), 2021, : 1 - 6
  • [25] Arabic character recognition: a survey
    Eldin, AS
    Nouh, AS
    OPTICAL PATTERN RECOGNITION IX, 1998, 3386 : 331 - 340
  • [26] A survey on Arabic character recognition
    University of Benghazi, Benghazi, Libya
    Int. J. Signal Process. Image Process. Pattern Recogn., 2 (401-426):
  • [27] OPTICAL CHARACTER-RECOGNITION SYSTEM
    HAMILTON, WB
    DATA PROCESSING, 1962, 4 (07): : 22 - 26
  • [28] Optical character recognition of handwritten Arabic using hidden Markov models
    Aulama, Mohannad M.
    Natsheh, Asem M.
    Abandah, Gheith A.
    Olama, Mohammed M.
    OPTICAL PATTERN RECOGNITION XXII, 2011, 8055
  • [29] Optical Character Recognition of Arabic Handwritten Characters using Neural Network
    Hussien, Rana S.
    Elkhidir, Azza A.
    Elnourani, Mohamed G.
    2015 INTERNATIONAL CONFERENCE ON COMPUTING, CONTROL, NETWORKING, ELECTRONICS AND EMBEDDED SYSTEMS ENGINEERING (ICCNEEE), 2015, : 456 - 461
  • [30] The combination of fuzzy logic and expert system for Arabic character recognition
    Hachour, O.
    2006 3rd International IEEE Conference Intelligent Systems, Vols 1 and 2, 2006, : 185 - 187