Multi-font printed Mongolian document recognition system

被引:22
|
作者
Peng, Liangrui [1 ,2 ,3 ]
Liu, Changsong [1 ,2 ,3 ]
Ding, Xiaoqing [1 ,2 ,3 ]
Jin, Jianming [4 ]
Wu, Youshou [1 ,2 ,3 ]
Wang, Hua [1 ,2 ,3 ]
Bao, Yanhua [5 ]
机构
[1] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China
[3] Tsinghua Univ, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China
[4] HP Labs China, Beijing 100084, Peoples R China
[5] Hulunbeier Coll, Mongolian Dept, Hailar 021008, Inner Mongolia, Peoples R China
关键词
Multi-font Mongolian; Character recognition; Character segmentation; Mixed script;
D O I
10.1007/s10032-009-0106-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.
引用
收藏
页码:93 / 106
页数:14
相关论文
共 50 条
  • [1] Multi-font printed Mongolian document recognition system
    Liangrui Peng
    Changsong Liu
    Xiaoqing Ding
    Jianming Jin
    Youshou Wu
    Hua Wang
    Yanhua Bao
    International Journal on Document Analysis and Recognition (IJDAR), 2010, 13 : 93 - 106
  • [2] Recognition system for printed multi-font and multi-size Arabic characters
    Hamami, L
    Berkani, D
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2002, 27 (1B) : 57 - 72
  • [3] Multi-font recognition of printed Arabic using the BBN Byblos speech recognition system
    LaPre, C
    Zhao, Y
    Raphael, C
    Schwartz, R
    Makhoul, J
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 2136 - 2139
  • [4] A multi-font OCR system for printed Telugu text
    Lakshmi, CV
    Patvardhan, C
    LANGUAGE ENGINEERING CONFERENCE, PROCEEDINGS, 2003, : 7 - 17
  • [5] An approach to multi-font numeral recognition
    Arjun, N. Santosh
    Navaneetha, G.
    Preethi, G. Vishnu
    Babu, T. Karthik
    TENCON 2007 - 2007 IEEE REGION 10 CONFERENCE, VOLS 1-3, 2007, : 459 - 462
  • [6] FNN model for multi-font character recognition
    Wang, L
    Qi, FH
    JOURNAL OF INFRARED AND MILLIMETER WAVES, 1999, 18 (05) : 412 - 416
  • [7] Multi-font Printed Chinese Character Recognition using Multi-pooling Convolutional Neural Network
    Zhong, Zhuoyao
    Jin, Lianwen
    Feng, Ziyong
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 96 - 100
  • [8] Multi-font Rotated Character Recognition using Periodicity
    Hase, Hiroyuki
    Tanabe, Kohei
    Tran, Thi Hong Ha
    Tokai, Shogo
    PROCEEDINGS OF THE 8TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, 2008, : 253 - 260
  • [9] Fuzzy modeling based recognition of multi-font numerals
    Hanmandlu, M
    Yusof, MHM
    Madasu, VK
    PATTERN RECOGNITION, PROCEEDINGS, 2003, 2781 : 204 - 211
  • [10] New statistical method for multi-font printed Tibetan/English OCR
    Wang, H
    Ding, XQ
    DOCUMENT REGOGNITION AND RETRIEVAL XI, 2004, 5296 : 155 - 165