Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

被引：0

作者：

Gao, Xinjian ^{[1
]}

Pang, Ye ^{[2
]}

Liu, Yuyu ^{[2
]}

Han, Maokun ^{[2
]}

Yu, Jun ^{[1
]}

Wang, Wei ^{[2
]}

Chen, Yuanxu ^{[2
]}

机构：

[1] Univ Sci & Technol China USTC, 96 Jinzhai Rd, Hefei 230026, Peoples R China

[2] Ping An Technol Co Ltd, 3 Xinyuan South Rd, Beijing 100016, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2024年 / 20卷 / 07期

关键词：

Scene text recognition; vision transformer; self-supervised learning;

D O I：

10.1145/3646551

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.

引用

页数：18

共 50 条

[21] A novel image captioning model with visual-semantic similarities and visual representations re-weighting
Thobhani, Alaa
Zou, Beiji
Kui, Xiaoyan
Al-Shargabi, Asma A.
Derea, Zaid
Abdussalam, Amr
Asham, Mohammed A.
JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
[22] Transductive Visual-Semantic Embedding for Zero-shot Learning
Xu, Xing
Shen, Fumin
Yang, Yang
Shao, Jie
Huang, Zi
PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 41 - 49
[23] Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation
Li, Qiaozhe
Zhao, Xin
He, Ran
Huang, Kaiqi
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 833 - 839
[24] Indirect visual-semantic alignment for generalized zero-shot recognition
Chen, Yan-He
Yeh, Mei-Chen
MULTIMEDIA SYSTEMS, 2024, 30 (02)
[25] Transformer-Enhanced Visual-Semantic Representation for Text-Image Retrieval
Zhang, Meng
Wu, Wei
Zhang, Haotian
2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 2042 - 2048
[26] Curriculum learning for scene text recognition
Yan, Jingzhe
Tao, Yuefeng
Zhang, Wanjun
JOURNAL OF ELECTRONIC IMAGING, 2021, 30 (04)
[27] Visual-semantic graph neural network with pose-position attentive learning for group activity recognition
Liu, Tianshan
Zhao, Rui
Lam, Kin-Man
Kong, Jun
NEUROCOMPUTING, 2022, 491 : 217 - 231
[28] Visual attention models for scene text recognition
Ghosh, Suman K.
Valveny, Ernest
Bagdanov, Andrew D.
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 943 - 948
[29] Display-Semantic Transformer for Scene Text Recognition
Yang, Xinqi
Silamu, Wushour
Xu, Miaomiao
Li, Yanbing
SENSORS, 2023, 23 (19)
[30] On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval
Gong, Yan
Cosma, Georgina
Fang, Hui
JOURNAL OF IMAGING, 2021, 7 (08)

← 1 2 3 4 5 →