2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition

被引：9

作者：

Raisi, Zobeir ^{[1
]}

Naiel, Mohamed A. ^{[1
]}

Younes, Georges ^{[1
]}

Wardell, Steven ^{[2
]}

Zelek, John ^{[1
]}

机构：

[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada

[2] ATS Automat Tooling Syst Inc, Cambridge, ON, Canada

来源：

2021 18TH CONFERENCE ON ROBOTS AND VISION (CRV 2021) | 2021年

关键词：

Transformer; 2D Learnable Sinusoidal Positional Encoding; Irregular Text; Scene Text Recognition; NETWORK;

D O I：

10.1109/CRV52889.2021.00024

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Positional Encoding (PE) plays a vital role in a Transformer's ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-the-art RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.

引用

页码：119 / 126

页数：8

共 50 条

[1] Transformer-based multiple instance learning network with 2D positional encoding for histopathology image classification
Bin Yang
Lei Ding
Jianqiang Li
Yong Li
Guangzhi Qu
Jingyi Wang
Qiang Wang
Bo Liu
Complex & Intelligent Systems, 2025, 11 (5)
[2] Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition
Chi, Hongmei
Cai, Jiaxin
Li, Xinran
NEURAL COMPUTING & APPLICATIONS, 2024, 36 (14): : 7817 - 7827
[3] Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition
Hongmei Chi
Jiaxin Cai
Xinran Li
Neural Computing and Applications, 2024, 36 : 7817 - 7827
[4] 2D and 3D Video Scene Text Classification
Xu, Jiamin
Shivakumara, Palaiahnakote
Lu, Tong
Tan, Chew Lim
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, : 2932 - 2937
[5] Text Recognition for 2D Bridge Plans Using OCR-Algorithms
Peng, Mengyan
Kang, Chongjie
Marx, Steffen
EUROPEAN ASSOCIATION ON QUALITY CONTROL OF BRIDGES AND STRUCTURES, EUROSTRUCT 2023, VOL 6, ISS 5, 2023, : 661 - 666
[6] A Supervisory Hierarchical Control Approach for Text to 2D Scene Generation
Cheng, Yu
Sun, Zhiyong
Bi, Sheng
Li, Congjian
Xi, Ning
2017 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (IEEE ROBIO 2017), 2017, : 2261 - 2266
[7] Calibration of mobile manipulators using 2D positional features
Shah, Mili
Bostelman, Roger
Legowik, Steven
Hong, Tsai
MEASUREMENT, 2018, 124 : 322 - 328
[8] USING 2D TENSOR VOTING IN TEXT DETECTION
Toan Nguyen
Park, Jonghyun
Lee, Gueesang
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 818 - 821
[9] Partial discharge recognition system for current transformer using neural network and 2D wavelet transform
Chang, Hong-Chan
Kuo, Ying-Piao
Lin, Han-Wei
IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2012, 7 (02) : 144 - 151
[10] Deep transformer: A framework for 2D text image rectification from planar transformations
Yan, Chengzhe
Hu, Jie
Zhang, Changshui
NEUROCOMPUTING, 2018, 289 : 32 - 43

← 1 2 3 4 5 →