Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network

被引:0
|
作者
Wei, Jiajun [1 ]
Zhan, Hongjian [1 ]
Lu, Yue [1 ]
Tu, Xiao [1 ]
Yin, Bing [2 ]
Liu, Cong [2 ]
Pal, Umapada [3 ]
机构
[1] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai, Peoples R China
[2] iFLYTEK, iFLYTEK Res, Hefei, Peoples R China
[3] Indian Stat Inst, CVPR Unit, Kolkata, India
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.
引用
收藏
页码:5885 / 5893
页数:9
相关论文
共 41 条
  • [21] Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
    Li, Rui
    Fischer, Tobias
    Segu, Mattia
    Pollefeys, Marc
    Van Gool, Luc
    Tombari, Federico
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 9848 - 9858
  • [22] Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval
    Tang, Xu
    Huang, Dabiao
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [23] SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly
    Zhu, Liuwan
    Ning, Rui
    Li, Jiang
    Xin, Chunsheng
    Wu, Hongyi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7766 - 7774
  • [24] The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model
    Ye Y.
    Zhu Q.
    Xiao S.
    Zhang K.
    Zeng W.
    Proceedings of the ACM on Human-Computer Interaction, 2024, 8 (CSCW1)
  • [25] Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
    Singh, Harman
    Zhang, Pengchuan
    Wang, Qifan
    Wang, Mengjiao
    Xiong, Wenhan
    Du, Jingfei
    Chen, Yu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 869 - 893
  • [26] Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval
    Moro, Gianluca
    Salvatori, Stefano
    Frisoni, Giacomo
    NEUROCOMPUTING, 2023, 538
  • [27] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
    Moon, Jong Hak
    Lee, Hyungyung
    Shin, Woncheol
    Kim, Young-Hak
    Choi, Edward
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
  • [28] Advancing Real-World Stereoscopic Image Super-Resolution via Vision-Language Model
    Zhang, Zhe
    Lei, Jianjun
    Peng, Bo
    Zhu, Jie
    Xu, Liying
    Huang, Qingming
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 2187 - 2197
  • [29] Image caption generation via improved vision-language pre-training model: perception towards image retrieval
    Padate, Roshni
    Gupta, Ashutosh
    Kalla, Mukesh
    Sharma, Arvind
    IMAGING SCIENCE JOURNAL, 2025,
  • [30] Reject Decoding via Language-Vision Models for Text-to-Image Synthesis
    Wu, Fuxiang
    Liu, Liu
    Hao, Fusheng
    He, Fengxiang
    Wang, Lei
    Cheng, Jun
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 2785 - 2794