Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

被引:0
|
作者
Gao, Xinjian [1 ]
Pang, Ye [2 ]
Liu, Yuyu [2 ]
Han, Maokun [2 ]
Yu, Jun [1 ]
Wang, Wei [2 ]
Chen, Yuanxu [2 ]
机构
[1] Univ Sci & Technol China USTC, 96 Jinzhai Rd, Hefei 230026, Peoples R China
[2] Ping An Technol Co Ltd, 3 Xinyuan South Rd, Beijing 100016, Peoples R China
关键词
Scene text recognition; vision transformer; self-supervised learning;
D O I
10.1145/3646551
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Hierarchical visual-semantic interaction for scene text recognition
    Diao, Liang
    Tang, Xin
    Wang, Jun
    Xie, Guotong
    Hu, Junlin
    INFORMATION FUSION, 2024, 102
  • [2] Emergent Visual-Semantic Hierarchies in Image-Text Representations
    Alper, Morris
    Averbuch-Elor, Hadar
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 220 - 238
  • [3] Reading Scene Text by Fusing Visual Attention with Semantic Representations
    Liu, Zhiguang
    Wang, Liangwei
    Qiao, Jian
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 210 - 218
  • [4] MKVSE: Multimodal Knowledge Enhanced Visual-semantic Embedding for Image-text Retrieval
    Feng, Duoduo
    He, Xiangteng
    Peng, Yuxin
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)
  • [5] Visual-semantic network: a visual and semantic enhanced model for gesture recognition
    Yizhe Wang
    Congqi Cao
    Yanning Zhang
    Visual Intelligence, 1 (1):
  • [6] Learning Robust Visual-Semantic Embeddings
    Tsai, Yao-Hung Hubert
    Huang, Liang-Kang
    Salakhutdinov, Ruslan
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3591 - 3600
  • [7] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
    Niu, Zhenxing
    Zhou, Mo
    Wang, Le
    Gao, Xinbo
    Hua, Gang
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
  • [8] Zero-shot learning with visual-semantic mutual reinforcement for image recognition
    Zhang, Yuhong
    Chen, Taohong
    Yu, Kui
    Hua, Xuegang
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (05)
  • [9] Visual-Semantic Graph Reasoning for Pedestrian Attribute Recognition
    Li, Qiaozhe
    Zhao, Xin
    He, Ran
    Huang, Kaiqi
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8634 - 8641
  • [10] Visual and semantic ensemble for scene text recognition with gated dual mutual attention
    Liu, Zhiguang
    Wang, Liangwei
    Qiao, Jian
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (04) : 669 - 680