Cascade Semantic Prompt Alignment Network for Image Captioning

被引:3
|
作者
Li, Jingyu [1 ]
Zhang, Lei [2 ]
Zhang, Kun [2 ]
Hu, Bo [2 ]
Xie, Hongtao [2 ]
Mao, Zhendong [1 ,3 ]
机构
[1] Univ Sci & Technol China, Sch Cyberspace Sci & Technol, Hefei 230022, Peoples R China
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230022, Peoples R China
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230022, Peoples R China
基金
中国国家自然科学基金;
关键词
Semantics; Visualization; Feature extraction; Detectors; Integrated circuit modeling; Transformers; Task analysis; Image captioning; textual-visual alignment; RegionCLIP; prompt; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3343520
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning (IC) takes an image as input and generates open-form descriptions in the domain of natural language. IC requires the detection of objects, modeling of relations between them, an assessment of the semantics of the scene and representing the extracted knowledge in a language space. Previous detector-based models suffer from limited semantic perception capability due to predefined object detection classes and semantic inconsistency between visual region features and numeric labels of the detector. Inspired by the fact that text prompts in pre-trained multi-modal models contain specific linguistic knowledge rather than discrete labels, and excel at an open-form semantic understanding of visual inputs and their representation in the domain of natural language. We aim to distill and leverage the transferable language knowledge from the pre-trained RegionCLIP model to remedy the detector for generating rich image captioning. In this paper, we propose a novel Cascade Semantic Prompt Alignment Network (CSA-Net) to produce an aligned fine-grained regional semantic-visual space where rich and consistent textual semantic details are automatically incorporated to region features. Specifically, we first align the object semantic prompt and region features to produce semantic grounded object features. Then, we employ these object features and relation semantic prompt to predict the relations between objects. Finally, these enhanced object and relation features are fed into the language decoder, generating rich descriptions. Extensive experiments conducted on the MSCOCO dataset show that our method achieves a new state-of-the-art performance with 145.2% (single model) and 147.0% (ensemble of 4 models) CIDEr scores on the 'Karpathy' split, 141.6% (c5) and 144.1% (c40) CIDEr scores on the official online test server. Significantly, CSA-Net outperforms in generating captions with higher quality and diversity, achieving a RefCLIP-S score of 83.2. Moreover, we expand the testbeds to other challenging captioning benchmarks, i.e., nocaps datasets, CSA-Net demonstrates superior zero-shot capability. Source codes released at https://github.com/CrossmodalGroup/CSA-Net.
引用
收藏
页码:5266 / 5281
页数:16
相关论文
共 50 条
  • [1] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [2] Bidirectional interactive alignment network for image captioning
    Cao, Xinrong
    Yan, Peixin
    Hu, Rong
    Li, Zuoyong
    MULTIMEDIA SYSTEMS, 2024, 30 (06)
  • [3] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [4] A Context Semantic Auxiliary Network for Image Captioning
    Li, Jianying
    Shao, Xiangjun
    INFORMATION, 2023, 14 (07)
  • [5] Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization
    Hu, Wenjin
    Qiao, Lang
    Kang, Wendong
    Shi, Xinyue
    JOURNAL OF IMAGING, 2023, 9 (08)
  • [6] Unbinding tensor product representations for image captioning with semantic alignment and complementation
    Wu, Bicheng
    Wo, Yan
    MULTIMEDIA SYSTEMS, 2024, 30 (03)
  • [7] Image Captioning with Semantic Attention
    You, Quanzeng
    Jin, Hailin
    Wang, Zhaowen
    Fang, Chen
    Luo, Jiebo
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4651 - 4659
  • [8] Long-tail image captioning with dynamic semantic memory network
    Liu, Hao
    Yang, Xiaoshan
    Xu, Changsheng
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2022, 48 (08): : 1399 - 1408
  • [9] A Sub-captions Semantic-Guided Network for Image Captioning
    Tian, Wei-Dong
    Zhu, Jun-jun
    Wu, Shuang
    Zhao, Zhong-Qiu
    Zhang, Yu-Zheng
    Zhang, Tian-yu
    INTELLIGENT COMPUTING METHODOLOGIES, PT III, 2022, 13395 : 367 - 379
  • [10] Object semantic analysis for image captioning
    Sen Du
    Hong Zhu
    Guangfeng Lin
    Dong Wang
    Jing Shi
    Jing Wang
    Multimedia Tools and Applications, 2023, 82 : 43179 - 43206