Scene Text Recognition with Transformer using Multi-patches

被引:0
|
作者
Wang Y. [1 ]
Ha J.-E. [2 ]
机构
[1] Graduate School of Automotive Engineering, Seoul National University of Science and Technology
[2] Department of Mechanical and Automotive Engineering, Seoul National University of Science and Technology
关键词
Deep learning; Scene text recognition; Transformer;
D O I
10.5302/J.ICROS.2022.22.0107
中图分类号
学科分类号
摘要
In this paper, we explore the application of Vision transformer (ViT) to the scene text recognition task. As a popular research direction in computer vision, Scene text recognition enables computers to recognize or read the text in natural scenes, such as object labels, text descriptions, and road text signs. At present, the traditional convolutional neural network-based model has better performance. Still, in the face of complex backgrounds and irregular scene text pictures, the performance of the convolutional neural network-based model is challenging to improve in curved text, diverse fonts, distortions, etc. With the application of transformers in computer vision, the model structure based on transformers has also significantly been developed. Although the current transformer-based model can obtain the performance of the model structure similar to CNN, it is currently in the early stage of application, and there is much room for research and improvement. We propose a multi-scale vertical rectangular patch model (MSVSTR) for transformer-based feature extractor to be more suitable for text images. By only arranging the patches in a single direction, when the image is cropped through the patch, it can be more suitable for the distribution form of the text in the text image. At the same time, to be suitable for different numbers of characters in other texts and more robust feature extraction, vertical rectangular patches of different scales are applied to crop the image. Our structure performs better through various ablation experiments than similar transformer-based STR models. At the same time, experiments show that our structure can perform seven benchmarks well. © ICROS 2022.
引用
收藏
页码:862 / 867
页数:5
相关论文
共 50 条
  • [41] Augmented Scene Text Recognition Using Crosswise Feature Extraction
    Kiliroor, Cinu C.
    Shrija, S.
    Ajay, R.
    WIRELESS PERSONAL COMMUNICATIONS, 2022, 123 (01) : 421 - 436
  • [42] SCENE TEXT RECOGNITION USING SPARSE CODING BASED FEATURES
    Zhang, Dong
    Wang, Da-Han
    Wang, Hanzi
    2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2014, : 1066 - 1070
  • [43] A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding
    Qiao, Zhi
    Ji, Zhilong
    Yuan, Ye
    Bai, Jinfeng
    FRONTIERS IN HANDWRITING RECOGNITION, ICFHR 2022, 2022, 13639 : 198 - 212
  • [44] RMFPN: End-to-End Scene Text Recognition Using Multi-Feature Pyramid Network
    Mahadshetti, Ruturaj
    Lee, Guee-Sang
    Choi, Deok-Jai
    IEEE ACCESS, 2023, 11 : 61892 - 61900
  • [45] GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition
    Zhang, Hui
    Luo, Guiyang
    Kang, Jian
    Huang, Shan
    Wang, Xiao
    Wang, Fei-Yue
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (07) : 10145 - 10158
  • [46] IsoGeometric Analysis with non-conforming multi-patches for the hull structural mechanical analysis
    Yu, Yanyun
    Wang, Yao
    Lin, Yan
    THIN-WALLED STRUCTURES, 2023, 187
  • [47] MULTI-PATCHES COOPERATIVE POINT CLOUD DENOISING ALGORITHM BASED ON LOCALLY LINEAR EMBEDDING
    Li, Jiapeng
    Hu, Yuxuan
    Liu, Wei
    Ye, Ming
    Chen, Feng
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [48] RESEARCH ON THE DRIVING MECHANISM WITH PLZT MULTI-PATCHES COMBINATION FOR OPTICAL ACTUATION RESPONSE IMPROVEMENT
    Wang, X. J.
    Huang, J. H.
    Tang, Y. J.
    PROCEEDINGS OF THE ASME INTERNATIONAL MECHANICAL ENGINEERING CONGRESS AND EXPOSITION, 2015, VOL 4B, 2016,
  • [49] Improving Text Recognition by Distinguishing Scene and Overlay Text
    Quehl, Bernhard
    Yang, Haojin
    Sack, Harald
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2014), 2015, 9445
  • [50] Emotion recognition in Hindi text using multilingual BERT transformer
    Kumar, Tapesh
    Mahrishi, Mehul
    Sharma, Girish
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42373 - 42394