Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network

被引:0
|
作者
Wei, Jiajun [1 ]
Zhan, Hongjian [1 ]
Lu, Yue [1 ]
Tu, Xiao [1 ]
Yin, Bing [2 ]
Liu, Cong [2 ]
Pal, Umapada [3 ]
机构
[1] East China Normal Univ, Shanghai Key Lab Multidimens Informat Proc, Shanghai, Peoples R China
[2] iFLYTEK, iFLYTEK Res, Hefei, Peoples R China
[3] Indian Stat Inst, CVPR Unit, Kolkata, India
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Scene text recognition is inherently a vision-language task. However, previous works have predominantly focused either on extracting more robust visual features or designing better language modeling. How to effectively and jointly model vision and language to mitigate heavy reliance on a single modality remains a problem. In this paper, aiming to enhance vision-language reasoning in scene text recognition, we present a balanced, unified and synchronized vision-language reasoning network (BUSNet). Firstly, revisiting the image as a language by balanced concatenation along length dimension alleviates the issue of over-reliance on vision or language. Secondly, BUSNet learns an ensemble of unified external and internal vision-language model with shared weight by masked modality modeling (MMM). Thirdly, a novel vision-language reasoning module (VLRM) with synchronized vision-language decoding capacity is proposed. Additionally, BUSNet achieves improved performance through iterative reasoning, which utilizes the vision-language prediction as a new language input. Extensive experiments indicate that BUSNet achieves state-of-the-art performance on several mainstream benchmark datasets and more challenge datasets for both synthetic and real training data compared to recent outstanding methods. Code and dataset will be available at https://github.com/jjwei66/BUSNet.
引用
收藏
页码:5885 / 5893
页数:9
相关论文
共 41 条
  • [31] CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels
    Li, Siyuan
    Sun, Li
    Li, Qingli
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 1, 2023, : 1405 - 1413
  • [32] ChatEarthNet: a global-scale image-text dataset empowering vision-language geo-foundation models
    Yuan, Zhenghang
    Xiong, Zhitong
    Mou, Lichao
    Zhu, Xiao Xiang
    EARTH SYSTEM SCIENCE DATA, 2025, 17 (03) : 1245 - 1263
  • [33] Bridging the Lexical Gap: Generative Text-to-Image Retrieval for Parts-of-Speech Imbalance in Vision-Language Models
    Hwang, Hyesu
    Kim, Daeun
    Park, Jaehui
    Kwon, Yongjin
    PROCEEDINGS OF THE 2ND INTERNATIONAL WORKSHOP ON DEEP MULTIMODAL GENERATION AND RETRIEVAL, MMGR 2024, 2024, : 25 - 33
  • [34] Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model
    Xu, Zipeng
    Lin, Tianwei
    Tang, Hao
    Li, Fu
    He, Dongliang
    Sebe, Nicu
    Timofte, Radu
    Van Gool, Luc
    Ding, Errui
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18208 - 18217
  • [35] Semantic-aware frame-event fusion based pattern recognition via large vision-language models
    Li, Dong
    Jin, Jiandong
    Zhang, Yuhao
    Zhong, Yanlin
    Wu, Yaoyang
    Chen, Lan
    Wang, Xiao
    Luo, Bin
    PATTERN RECOGNITION, 2025, 158
  • [36] Alzheimer's disease recognition using graph neural network by leveraging image-text similarity from vision language model
    Lee, Byounghwa
    Bang, Jeong-Uk
    Song, Hwa Jeon
    Kang, Byung Ok
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [37] Open-world driving scene segmentation via multi-stage and multi-modality fusion of vision-language embedding
    Niu, Yingjie
    Ding, Ming
    Zhang, Yuxiao
    Ge, Maoning
    Yang, Hanting
    Takeda, Kazuya
    2023 IEEE INTELLIGENT VEHICLES SYMPOSIUM, IV, 2023,
  • [38] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
    He, Liu
    Liu, Shuyan
    An, Ran
    Zhuo, Yudong
    Tao, Jian
    MATHEMATICS, 2023, 11 (10)
  • [39] OPEN-VOCABULARY SKELETON ACTION RECOGNITION WITH DIFFUSION GRAPH CONVOLUTIONAL NETWORK AND PRE-TRAINED VISION-LANGUAGE MODELS
    Wei, Chao
    Deng, Zhidong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3195 - 3199
  • [40] End-to-End: A Simple Template for the Long-Tailed-Recognition of Transmission Line Clamps via a Vision-Language Model
    Yan, Fei
    Zhang, Hui
    Li, Yaogen
    Yang, Yongjia
    Liu, Yinping
    APPLIED SCIENCES-BASEL, 2023, 13 (05):