2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition

被引:9
|
作者
Raisi, Zobeir [1 ]
Naiel, Mohamed A. [1 ]
Younes, Georges [1 ]
Wardell, Steven [2 ]
Zelek, John [1 ]
机构
[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada
[2] ATS Automat Tooling Syst Inc, Cambridge, ON, Canada
关键词
Transformer; 2D Learnable Sinusoidal Positional Encoding; Irregular Text; Scene Text Recognition; NETWORK;
D O I
10.1109/CRV52889.2021.00024
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Positional Encoding (PE) plays a vital role in a Transformer's ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-the-art RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.
引用
收藏
页码:119 / 126
页数:8
相关论文
共 50 条
  • [1] Transformer-based multiple instance learning network with 2D positional encoding for histopathology image classification
    Bin Yang
    Lei Ding
    Jianqiang Li
    Yong Li
    Guangzhi Qu
    Jingyi Wang
    Qiang Wang
    Bo Liu
    Complex & Intelligent Systems, 2025, 11 (5)
  • [2] Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition
    Chi, Hongmei
    Cai, Jiaxin
    Li, Xinran
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7817 - 7827
  • [3] Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition
    Hongmei Chi
    Jiaxin Cai
    Xinran Li
    Neural Computing and Applications, 2024, 36 : 7817 - 7827
  • [4] 2D and 3D Video Scene Text Classification
    Xu, Jiamin
    Shivakumara, Palaiahnakote
    Lu, Tong
    Tan, Chew Lim
    2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2932 - 2937
  • [5] Text Recognition for 2D Bridge Plans Using OCR-Algorithms
    Peng, Mengyan
    Kang, Chongjie
    Marx, Steffen
    EUROPEAN ASSOCIATION ON QUALITY CONTROL OF BRIDGES AND STRUCTURES, EUROSTRUCT 2023, VOL 6, ISS 5, 2023, : 661 - 666
  • [6] A Supervisory Hierarchical Control Approach for Text to 2D Scene Generation
    Cheng, Yu
    Sun, Zhiyong
    Bi, Sheng
    Li, Congjian
    Xi, Ning
    2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 2261 - 2266
  • [7] Calibration of mobile manipulators using 2D positional features
    Shah, Mili
    Bostelman, Roger
    Legowik, Steven
    Hong, Tsai
    MEASUREMENT, 2018, 124 : 322 - 328
  • [8] USING 2D TENSOR VOTING IN TEXT DETECTION
    Toan Nguyen
    Park, Jonghyun
    Lee, Gueesang
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 818 - 821
  • [9] Partial discharge recognition system for current transformer using neural network and 2D wavelet transform
    Chang, Hong-Chan
    Kuo, Ying-Piao
    Lin, Han-Wei
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2012, 7 (02) : 144 - 151
  • [10] Deep transformer: A framework for 2D text image rectification from planar transformations
    Yan, Chengzhe
    Hu, Jie
    Zhang, Changshui
    NEUROCOMPUTING, 2018, 289 : 32 - 43