2LSPE: 2D Learnable Sinusoidal Positional Encoding using Transformer for Scene Text Recognition

被引：9

作者：

Raisi, Zobeir ^{[1
]}

Naiel, Mohamed A. ^{[1
]}

Younes, Georges ^{[1
]}

Wardell, Steven ^{[2
]}

Zelek, John ^{[1
]}

机构：

[1] Univ Waterloo, Waterloo, ON N2L 3G1, Canada

[2] ATS Automat Tooling Syst Inc, Cambridge, ON, Canada

来源：

2021 18TH CONFERENCE ON ROBOTS AND VISION (CRV 2021) | 2021年

关键词：

Transformer; 2D Learnable Sinusoidal Positional Encoding; Irregular Text; Scene Text Recognition; NETWORK;

D O I：

10.1109/CRV52889.2021.00024

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Positional Encoding (PE) plays a vital role in a Transformer's ability to capture the order of sequential information, allowing it to overcome the permutation equivarience property. Recent state-of-the-art Transformer-based scene text recognition methods have leveraged the advantages of the 2D form of PE with fixed sinusoidal frequencies, also known as 2SPE, to better encode the 2D spatial dependencies of characters in a scene text image. These 2SPE-based Transformer frameworks have outperformed Recurrent Neural Networks (RNNs) based methods, mostly on recognizing text of arbitrary shapes; However, they are not tailored to the type of data and classification task at hand. In this paper, we extend a recent Learnable Sinusoidal frequencies PE (LSPE) from 1D to 2D, which we hereafter refer to as 2LSPE, and study how to adaptively choose the sinusoidal frequencies from the input training data. Moreover, we show how to apply the proposed Transformer architecture for scene text recognition. We compare our method against 11 state-of-the-art methods and show that it outperforms them in over 50% of the standard tests and are no worse than the second best performer, whereas we outperform all other methods on irregular text datasets (i.e., non horizontal or vertical layouts). Experimental results demonstrate that the proposed method offers higher word recognition accuracy (WRA) than two recent Transformer-based methods, and eleven state-of-the-art RNN-based techniques on four challenging irregular-text recognition datasets, all while maintaining the highest WRA values on the regular-text datasets.

引用

页码：119 / 126

页数：8

共 50 条

[41] Recognition of facial expressions using 2D DCT and neural network
Xiao, YG
Chandrasiri, NP
Tadokoro, Y
Oda, M
ELECTRONICS AND COMMUNICATIONS IN JAPAN PART III-FUNDAMENTAL ELECTRONIC SCIENCE, 1999, 82 (07): : 1 - 11
[42] Gait Recognition with 2D Pose Infomation Using a Surveillance Camera
Inoue, Tomohiro
Chikano, Megumi
Awai, Shuji
Konno, Takeshi
2021 IEEE 3RD GLOBAL CONFERENCE ON LIFE SCIENCES AND TECHNOLOGIES (IEEE LIFETECH 2021), 2021, : 351 - 355
[43] 2D HUMAN-EAR RECOGNITION USING GEOMETRIC FEATURES
Polin, Md. Zahid Hasan
Kabir, A. N. M. Enamul
Sadi, Muhammad Sheikh
2012 7TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (ICECE), 2012,
[44] Palmprint recognition using 2D Gabor filter and hamming classifier
Xi, Wang
Lu, Jiwen
ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 2508 - 2511
[45] Automatic recognition of image details using stereovision and 2D algorithms
Balcerek, Julian
Luczak, Mateusz
Pawlowski, Pawel
Dabrowski, Adam
2018 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2018, : 268 - 273
[46] Recognition of 2D standalone and occluded objects using wavelet transform
Tsang, KM
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2001, 15 (04) : 691 - 705
[47] Invariant 2D object recognition using the wavelet modulus maxima
Khalil, MI
Bayoumi, MM
PATTERN RECOGNITION LETTERS, 2000, 21 (09) : 863 - 872
[48] Noisy Phoneme Recognition Using 2D Convolution Neural Network
Ramonaite, Justina
Korvel, Grazina
2023 IEEE 10TH JUBILEE WORKSHOP ON ADVANCES IN INFORMATION, ELECTRONIC AND ELECTRICAL ENGINEERING, AIEEE, 2023,
[49] 2D partially occluded object recognition using curve moments
Lim, KB
Du, TH
Zheng, H
PROCEEDINGS OF THE SEVENTH IASTED INTERNATIONAL CONFERENCE ON COMPUTER GRAPHICS AND IMAGING, 2004, : 303 - +
[50] Novel scene generation, merging and stitching views using the 2D affine space
Sengupta, K
Ohya, J
IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS '97, PROCEEDINGS, 1997, : 602 - 603

← 1 2 3 4 5 →