Multi-font printed Mongolian document recognition system

被引:22
|
作者
Peng, Liangrui [1 ,2 ,3 ]
Liu, Changsong [1 ,2 ,3 ]
Ding, Xiaoqing [1 ,2 ,3 ]
Jin, Jianming [4 ]
Wu, Youshou [1 ,2 ,3 ]
Wang, Hua [1 ,2 ,3 ]
Bao, Yanhua [5 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China
[3] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China
[4] HP Labs China, Beijing 100084, Peoples R China
[5] Hulunbeier Coll, Mongolian Dept, Hailar 021008, Inner Mongolia, Peoples R China
关键词
Multi-font Mongolian; Character recognition; Character segmentation; Mixed script;
D O I
10.1007/s10032-009-0106-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.
引用
收藏
页码:93 / 106
页数:14
相关论文
共 50 条
  • [41] DeepFont: A System for Font Recognition and Similarity
    Wang, Zhangyang
    Yang, Jianchao
    Jin, Hailin
    Brandt, Jonathan
    Shechtman, Eli
    Agarwala, Aseem
    Wang, Zhaowen
    Song, Yuyan
    Hsieh, Joseph
    Kong, Sarah
    Huang, Thomas S.
    MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 813 - 814
  • [42] Font Recognition for Persian Optical Character Recognition System
    Eghbali, Koorosh
    Veisi, Hadi
    Mirzaie, Mohsen
    Behbahani, Yasser Mohseni
    2017 10TH IRANIAN CONFERENCE ON MACHINE VISION AND IMAGE PROCESSING (MVIP), 2017, : 252 - 257
  • [43] Enhancing the Mongolian Historical Document Recognition System with Multiple Knowledge-Based Strategies
    Su, Xiangdong
    Gao, Guanglai
    Wei, Hongxi
    Bao, Feilong
    NEURAL INFORMATION PROCESSING, PT II, 2015, 9490 : 536 - 544
  • [44] MULTI-SYSTEM FONT GENERATION
    RISTROPH, JH
    COMPUTERS & INDUSTRIAL ENGINEERING, 1988, 15 : 467 - 474
  • [45] End-to-end system for printed Amazigh script recognition in document images
    Aharrane, Nabil
    Dahmouni, Abdellatif
    Ensah, Karim El Moutaouakil
    Satori, Khalid
    2017 3RD INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR SIGNAL AND IMAGE PROCESSING (ATSIP), 2017, : 313 - 318
  • [46] SEGMENTATION OF TOUCHING CHARACTERS IN PRINTED DOCUMENT RECOGNITION
    LIANG, S
    SHRIDHAR, M
    AHMADI, M
    PATTERN RECOGNITION, 1994, 27 (06) : 825 - 840
  • [47] Machine-printed Japanese document recognition
    Srihari, SN
    Hong, T
    Srikantan, G
    PATTERN RECOGNITION, 1997, 30 (08) : 1301 - 1313
  • [48] Robust table recognition for printed document images
    Liang, Qiaokang
    Peng, Jianzhong
    Li, Zhengwei
    Xie, Daqi
    Sun, Wei
    Wang, Yaonan
    Zhang, Dan
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2020, 17 (04) : 3203 - 3223
  • [49] An end-to-end network for irregular printed Mongolian recognition
    Cui, ShaoDong
    Su, YiLa
    Ji, Ren Qing dao er
    Ji, YaTu
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2022, 25 (01) : 41 - 50
  • [50] Structured neural networks for multi-font Chinese character recognition using a newly developed digital neural network chip with adaptive segmentation of quantizer neuron architecture (ASQA)
    Kondo, K
    Imagawa, T
    Maruno, S
    NEURAL NETWORKS FOR SIGNAL PROCESSING VI, 1996, : 330 - 339