Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

被引:0
|
作者
Gao, Xinjian [1 ]
Pang, Ye [2 ]
Liu, Yuyu [2 ]
Han, Maokun [2 ]
Yu, Jun [1 ]
Wang, Wei [2 ]
Chen, Yuanxu [2 ]
机构
[1] Univ Sci & Technol China USTC, 96 Jinzhai Rd, Hefei 230026, Peoples R China
[2] Ping An Technol Co Ltd, 3 Xinyuan South Rd, Beijing 100016, Peoples R China
关键词
Scene text recognition; vision transformer; self-supervised learning;
D O I
10.1145/3646551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] A novel image captioning model with visual-semantic similarities and visual representations re-weighting
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Al-Shargabi, Asma A.
    Derea, Zaid
    Abdussalam, Amr
    Asham, Mohammed A.
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (07)
  • [22] Transductive Visual-Semantic Embedding for Zero-shot Learning
    Xu, Xing
    Shen, Fumin
    Yang, Yang
    Shao, Jie
    Huang, Zi
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 41 - 49
  • [23] Pedestrian Attribute Recognition by Joint Visual-semantic Reasoning and Knowledge Distillation
    Li, Qiaozhe
    Zhao, Xin
    He, Ran
    Huang, Kaiqi
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 833 - 839
  • [24] Indirect visual-semantic alignment for generalized zero-shot recognition
    Chen, Yan-He
    Yeh, Mei-Chen
    MULTIMEDIA SYSTEMS, 2024, 30 (02)
  • [25] Transformer-Enhanced Visual-Semantic Representation for Text-Image Retrieval
    Zhang, Meng
    Wu, Wei
    Zhang, Haotian
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 2042 - 2048
  • [26] Curriculum learning for scene text recognition
    Yan, Jingzhe
    Tao, Yuefeng
    Zhang, Wanjun
    JOURNAL OF ELECTRONIC IMAGING, 2021, 30 (04)
  • [27] Visual-semantic graph neural network with pose-position attentive learning for group activity recognition
    Liu, Tianshan
    Zhao, Rui
    Lam, Kin-Man
    Kong, Jun
    NEUROCOMPUTING, 2022, 491 : 217 - 231
  • [28] Visual attention models for scene text recognition
    Ghosh, Suman K.
    Valveny, Ernest
    Bagdanov, Andrew D.
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 943 - 948
  • [29] Display-Semantic Transformer for Scene Text Recognition
    Yang, Xinqi
    Silamu, Wushour
    Xu, Miaomiao
    Li, Yanbing
    SENSORS, 2023, 23 (19)
  • [30] On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval
    Gong, Yan
    Cosma, Georgina
    Fang, Hui
    JOURNAL OF IMAGING, 2021, 7 (08)