Vision-Enhanced and Consensus-Aware Transformer for Image Captioning

被引:28
|
作者
Cao, Shan [1 ,2 ]
An, Gaoyun [1 ,2 ]
Zheng, Zhenxing [1 ,2 ]
Wang, Zhiyong [3 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
[3] Univ Sydney, Sch Comp Sci, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Transformers; Visualization; Decoding; Semantics; Task analysis; Convolution; Visual perception; Image captioning; vision-enhanced encoder; consensus-aware decoder; consensus knowledge;
D O I
10.1109/TCSVT.2022.3178844
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning generates descriptions in a natural language for a given image. Due to its great potential for a wide range of applications, many deep learning based-methods have been proposed. The co-occurrence of words such as mouse and keyboard, constitutes commonsense knowledge, which is referred to as consensus. However, it is challenging to consider commonsense knowledge in producing captions that have rich, natural, and meaningful semantics. In this paper, a Vision-enhanced and Consensus-aware Transformer (VCT) is proposed to exploit both visual information and consensus knowledge for image captioning with three key components: a vision-enhanced encoder, consensus-aware knowledge representation generator, and consensus-aware decoder. The vision-enhanced encoder extends the vanilla self-attention module with a memory-based attention module and a visual perception module for learning better visual representation of an image. Specifically, the relationships between regions in an image and the image's global context are leveraged with scene memory in the memory-based attention module. The visual perception module further enhances the correlation among neighboring tokens in both the spatial and channel-wise dimensions. To learn consensus-aware representations, a word correlation graph is constructed by computing the statistical co-occurrence between semantic concepts. Then consensus knowledge can be acquired using a graph convolutional network in the consensus-aware knowledge representation generator. Finally, such consensus knowledge is integrated into the consensus-aware decoder through consensus memory and a knowledge-based control module to produce a caption. Experimental results on two popular benchmark datasets (MSCOCO and Flickr30k) demonstrate that our proposed model achieves state-of-the-art performance. Extensive ablation studies also validate the effectiveness of each component.
引用
收藏
页码:7005 / 7018
页数:14
相关论文
共 50 条
  • [21] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [22] Consensus-Aware Sociopsychological Trust Model for Wireless Sensor Networks
    Rathore, Heena
    Badarla, Venkataramana
    Shit, Supratim
    ACM TRANSACTIONS ON SENSOR NETWORKS, 2016, 12 (03)
  • [23] Aware-Transformer: A Novel Pure Transformer-Based Model for Remote Sensing Image Captioning
    Cao, Yukun
    Yan, Jialuo
    Tang, Yijia
    He, Zhenyi
    Xu, Kangle
    Cheng, Yu
    ADVANCES IN COMPUTER GRAPHICS, CGI 2023, PT I, 2024, 14495 : 105 - 117
  • [24] VisBLE: Vision-Enhanced BLE Device Tracking
    Jiang, Wenchao
    Li, Feng
    Mei, Luoyu
    Liu, Ruofeng
    Wang, Shuai
    2022 19TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING (SECON), 2022, : 217 - 225
  • [25] Interactive Concept Network Enhanced Transformer for Remote Sensing Image Captioning
    Zhang, Cheng
    Ren, Zhongle
    Hou, Biao
    Meng, Jianhua
    Li, Weibin
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [26] Robust consensus-aware network for 3D point registration
    Yang, Fan
    Chen, Zhi
    Sun, Kun
    Liu, Liman
    Tao, Wenbing
    NEUROCOMPUTING, 2022, 514 : 464 - 476
  • [27] DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based image captioning
    Xu, Dongsheng
    Huang, Qingbao
    Zhang, Xingmao
    Cheng, Haonan
    Shuang, Feng
    Cai, Yi
    PATTERN RECOGNITION, 2025, 164
  • [28] Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning
    Cai, Chen
    Wang, Yi
    Yap, Kim-Hui
    REMOTE SENSING, 2023, 15 (23)
  • [29] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
    Chang, Yu-Ling
    Ma, Hao-Shang
    Li, Shiou-Chi
    Huang, Jen-Wei
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
  • [30] Complementary Shifted Transformer for Image Captioning
    Liu, Yanbo
    Yang, You
    Xiang, Ruoyu
    Ma, Jixin
    NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363