Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection

被引:9
|
作者
Fateh, Amirreza [1 ]
Fateh, Mansoor [2 ]
Abolghasemi, Vahid [3 ]
机构
[1] Iran Univ Sci & Technol IUST, Sch Comp Engn, Tehran, Iran
[2] Shahrood Univ Technol, Fac Comp Engn, Shahrud, Iran
[3] Univ Essex, Sch Comp Sci & Elect Engn, Colchester, England
关键词
connected component; document layout analysis; font size; line detection; Persian printed;
D O I
10.1002/eng2.12832
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In recent years, automatic document and text analysis has gained significant importance, driven by advancements in optical character recognition (OCR) technology and the need for efficient processing of large volumes of printed or handwritten documents. This article specifically focuses on document layout analysis (DLA) and text line detection (TLD), both of which are crucial components of OCR systems. Our objective is to develop an effective method for extracting both textual and non-textual regions, addressing challenges unique to the Persian (and Persian-like) language(s). In the DLA stage, we employ deep learning models and a voting system to accurately determine the regions of interest. Additionally, we introduce methods such as optimum font size concepts, angle correction, and a line curvature elimination algorithm in the TLD process to enhance OCR accuracy. Comparative evaluations against state-of-the-art methods demonstrate the superiority of our approach, showcasing a 2.8% improvement in the accuracy of Tesseract-OCR 5.1.0 (a well-established commercial OCR system) on the official Iranian newspapers dataset. These findings underscore the importance of addressing DLA and TLD challenges to advance OCR technology for Persian language documents and provide a solid foundation for future research in this domain. Our proposed method introduces several key novelties that contribute to the advancement of optical character recognition (OCR) systems. We collected and presented a valuable dataset for training and evaluating OCR models. Our proposed method successfully addresses challenges associated with document layout analysis (DLA) and text line detection in OCR systems, particularly for the Persian language. We significantly improve the accuracy of OCR systems by employing deep learning models in the DLA stage and implementing a voting system, as well as introducing angle correction methods, optimum font size concepts, and an efficient algorithm to eliminate line curvature.image
引用
收藏
页数:26
相关论文
共 50 条
  • [31] Language independent optical character recognition for hand written text
    Ali, A
    Ahmad, M
    Rafiq, N
    Akber, J
    Ahmad, U
    Akmal, S
    INMIC 2004: 8th International Multitopic Conference, Proceedings, 2004, : 79 - 84
  • [32] Optical Character Recognition for printed Tamil text using Unicode
    SEETHALAKSHMI R.
    SREERANJANI T.R.
    BALACHANDAR T.
    Abnikant Singh
    Markandey Singh
    Ritwaj Ratan
    Sarvesh Kumar
    Journal of Zhejiang University Science A(Science in Engineering), 2005, (11) : 131 - 139
  • [33] Optical character recognition of typeset Coptic text with neural networks
    Miyagawa, So
    Bulert, Kirill
    Buechler, Marco
    Behlmer, Heike
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2019, 34 : 135 - 141
  • [34] A FEATURE DETECTION METHOD FOR OPTICAL CHARACTER RECOGNITION
    HOSKING, KH
    MARCONI REVIEW, 1969, 32 (172): : 3 - &
  • [35] Robust Text Line, Word And Character Extraction From Telugu Document Image
    Koppula, Vijaya Kumar
    Atul, Negi
    Garain, Utpal
    2009 SECOND INTERNATIONAL CONFERENCE ON EMERGING TRENDS IN ENGINEERING AND TECHNOLOGY (ICETET 2009), 2009, : 24 - +
  • [36] A Chinese Document Layout Analysis Based on Non-text Images
    Fu Xiaoling
    Li Xiaofeng
    2009 INTERNATIONAL FORUM ON COMPUTER SCIENCE-TECHNOLOGY AND APPLICATIONS, VOL 1, PROCEEDINGS, 2009, : 326 - 328
  • [37] Efficient Optical Character Recognition on Graphics Processing Unit
    Arianyan, Ehsan
    Motamedi, Seyed Ahmad
    Arianyan, Iman
    2012 SIXTH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2012, : 789 - 793
  • [38] Design of Integrated Latext: Halal Detection Text using OCR (Optical Character Recognition) and Web Service
    Yuniarti, Anny
    Kuswardayan, Imam
    Hariadi, Ridho Rahman
    Arifiani, Siska
    Mursidah, Eva
    2017 INTERNATIONAL SEMINAR ON APPLICATION FOR TECHNOLOGY OF INFORMATION AND COMMUNICATION (ISEMANTIC), 2017, : 137 - 141
  • [39] Experimental analysis on old Bulgarian text character recognition
    Geortchev, V
    Krusteva, R
    Boneva, A
    Stanishev, K
    MANUFACTURING, MODELING, MANAGEMENT AND CONTROL, PROCEEDINGS, 2001, : 133 - 136
  • [40] Improved Tesseract optical character recognition performance on Thai document datasets
    Anakpluek, Noppol
    Pasanta, Watcharakorn
    Chantharasukha, Latthawan
    Chokratansombat, Pattanawong
    Kanjanakaew, Pajaya
    Siriborvornratanakul, Thitirat
    BIG DATA RESEARCH, 2025, 39