Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection

被引:9
|
作者
Fateh, Amirreza [1 ]
Fateh, Mansoor [2 ]
Abolghasemi, Vahid [3 ]
机构
[1] Iran Univ Sci & Technol IUST, Sch Comp Engn, Tehran, Iran
[2] Shahrood Univ Technol, Fac Comp Engn, Shahrud, Iran
[3] Univ Essex, Sch Comp Sci & Elect Engn, Colchester, England
关键词
connected component; document layout analysis; font size; line detection; Persian printed;
D O I
10.1002/eng2.12832
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In recent years, automatic document and text analysis has gained significant importance, driven by advancements in optical character recognition (OCR) technology and the need for efficient processing of large volumes of printed or handwritten documents. This article specifically focuses on document layout analysis (DLA) and text line detection (TLD), both of which are crucial components of OCR systems. Our objective is to develop an effective method for extracting both textual and non-textual regions, addressing challenges unique to the Persian (and Persian-like) language(s). In the DLA stage, we employ deep learning models and a voting system to accurately determine the regions of interest. Additionally, we introduce methods such as optimum font size concepts, angle correction, and a line curvature elimination algorithm in the TLD process to enhance OCR accuracy. Comparative evaluations against state-of-the-art methods demonstrate the superiority of our approach, showcasing a 2.8% improvement in the accuracy of Tesseract-OCR 5.1.0 (a well-established commercial OCR system) on the official Iranian newspapers dataset. These findings underscore the importance of addressing DLA and TLD challenges to advance OCR technology for Persian language documents and provide a solid foundation for future research in this domain. Our proposed method introduces several key novelties that contribute to the advancement of optical character recognition (OCR) systems. We collected and presented a valuable dataset for training and evaluating OCR models. Our proposed method successfully addresses challenges associated with document layout analysis (DLA) and text line detection in OCR systems, particularly for the Persian language. We significantly improve the accuracy of OCR systems by employing deep learning models in the DLA stage and implementing a voting system, as well as introducing angle correction methods, optimum font size concepts, and an efficient algorithm to eliminate line curvature.image
引用
收藏
页数:26
相关论文
共 50 条
  • [41] An Efficient FPGA Implementation of Optical Character Recognition for License Plate Recognition
    Jing, Yuan
    Youssefi, Bahar
    Mirhassani, Mitra
    Muscedere, Roberto
    2017 IEEE 30TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2017,
  • [42] Optical Chinese character recognition for low-quality document images
    Chou, TR
    Chang, F
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 608 - 611
  • [43] Fast optical character recognition through glyph hashing for document conversion
    Chellapilla, K
    Simard, P
    Nickolov, R
    EIGHTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, PROCEEDINGS, 2005, : 829 - 833
  • [44] Automated Text Detection and Character Recognition in Natural Scenes Based on Local Image Features and Contour Processing Techniques
    Baran, Remigiusz
    Partila, Pavol
    Wilk, Rafal
    INTELLIGENT HUMAN SYSTEMS INTEGRATION, IHSI 2018, 2018, 722 : 42 - 48
  • [45] Recognition of Hand written and Printed Text of Cursive Writing Utilizing Optical Character Recognition
    Duth, Sudharshan P.
    Amulya, B.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS 2020), 2020, : 576 - 581
  • [46] Ancient Document Analysis Based on Text Line Extraction
    Kleber, Florian
    Sablatnig, Robert
    Gau, Melanie
    Miklas, Heinz
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 1893 - +
  • [47] Segmentation-free optical character recognition for printed Urdu text
    Din, Israr Ud
    Siddiqi, Imran
    Khalid, Shehzad
    Azam, Tahir
    EURASIP JOURNAL ON IMAGE AND VIDEO PROCESSING, 2017,
  • [48] Segmentation-free optical character recognition for printed Urdu text
    Israr Ud Din
    Imran Siddiqi
    Shehzad Khalid
    Tahir Azam
    EURASIP Journal on Image and Video Processing, 2017
  • [49] Optical Character Recognition and text cleaning in the indigenous South African languages
    Prinsloo, Danie J.
    Taljard, Elsabe
    Goosen, Michelle
    STELLENBOSCH PAPERS IN LINGUISTICS PLUS-SPIL PLUS, 2022, 64 : 165 - 187
  • [50] A proposed approach for character recognition using Document Analysis with OCR
    Singh, Harneet
    Sachan, Anmol
    PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 190 - 195