Scene Text Recognition with Transformer using Multi-patches

被引:0
|
作者
Wang Y. [1 ]
Ha J.-E. [2 ]
机构
[1] Graduate School of Automotive Engineering, Seoul National University of Science and Technology
[2] Department of Mechanical and Automotive Engineering, Seoul National University of Science and Technology
关键词
Deep learning; Scene text recognition; Transformer;
D O I
10.5302/J.ICROS.2022.22.0107
中图分类号
学科分类号
摘要
In this paper, we explore the application of Vision transformer (ViT) to the scene text recognition task. As a popular research direction in computer vision, Scene text recognition enables computers to recognize or read the text in natural scenes, such as object labels, text descriptions, and road text signs. At present, the traditional convolutional neural network-based model has better performance. Still, in the face of complex backgrounds and irregular scene text pictures, the performance of the convolutional neural network-based model is challenging to improve in curved text, diverse fonts, distortions, etc. With the application of transformers in computer vision, the model structure based on transformers has also significantly been developed. Although the current transformer-based model can obtain the performance of the model structure similar to CNN, it is currently in the early stage of application, and there is much room for research and improvement. We propose a multi-scale vertical rectangular patch model (MSVSTR) for transformer-based feature extractor to be more suitable for text images. By only arranging the patches in a single direction, when the image is cropped through the patch, it can be more suitable for the distribution form of the text in the text image. At the same time, to be suitable for different numbers of characters in other texts and more robust feature extraction, vertical rectangular patches of different scales are applied to crop the image. Our structure performs better through various ablation experiments than similar transformer-based STR models. At the same time, experiments show that our structure can perform seven benchmarks well. © ICROS 2022.
引用
收藏
页码:862 / 867
页数:5
相关论文
共 50 条
  • [31] Rethinking text rectification for scene text recognition
    Ke, Wenjun
    Wei, Jianguo
    Hou, Qingzhi
    Feng, Hui
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 219
  • [32] CATALIST: CAmera TrAnsformations for Multi-LIngual Scene Text Recognition
    Sood, Shivam
    Saluja, Rohit
    Ramakrishnan, Ganesh
    Chaudhuri, Parag
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021 WORKSHOPS, PT I, 2021, 12916 : 213 - 228
  • [33] Strokelets: A Learned Multi-Scale Representation for Scene Text Recognition
    Yao, Cong
    Bai, Xiang
    Shi, Baoguang
    Liu, Wenyu
    2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 4042 - 4049
  • [34] MEAN: Multi-Element Attention Network for Scene Text Recognition
    Yan, Ruijie
    Peng, Liangrui
    Xiao, Shanyu
    Yao, Gang
    Min, Jaesik
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6850 - 6857
  • [35] 2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition
    Raisi, Zobeir
    Naiel, Mohamed A.
    Younes, Georges
    Wardell, Steven
    Zelek, John
    2021 18TH CONFERENCE ON ROBOTS AND VISION (CRV 2021), 2021, : 119 - 126
  • [36] Augmented Scene Text Recognition Using Crosswise Feature Extraction
    Cinu C Kiliroor
    S. Shrija
    R. Ajay
    Wireless Personal Communications, 2022, 123 : 421 - 436
  • [37] Scene Text Character Recognition Using Spatiality Embedded Dictionary
    Gao, Song
    Wang, Chunheng
    Xiao, Baihua
    Shi, Cunzhao
    Zhou, Wen
    Zhang, Zhong
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2014, E97D (07): : 1942 - 1946
  • [38] Scene Text Recognition using Higher Order Language Priors
    Mishra, Anand
    Alahari, Karteek
    Jawahar, C. V.
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2012, 2012,
  • [39] SCENE TEXT RECOGNITION MODELS EXPLAINABILITY USING LOCAL FEATURES
    Ty, Mark Vincent
    Atienza, Rowel
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 645 - 649
  • [40] Using A Probabilistic Syllable Model to Improve Scene Text Recognition
    Feild, Jacqueline L.
    Learned-Miller, Erik G.
    Smith, David A.
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 897 - 901