Learning Distinct and Representative Modes for Image Captioning

被引:0
|
作者
Chen, Qi [1 ]
Deng, Chaorui [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset(1).
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Deep learning-based solar image captioning
    Baek, Ji-Hye
    Kim, Sujin
    Choi, Seonghwan
    Park, Jongyeob
    Kim, Dongil
    ADVANCES IN SPACE RESEARCH, 2024, 73 (06) : 3270 - 3281
  • [42] Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5263 - 5271
  • [43] Temporal-Difference Learning with Sampling Baseline for Image Captioning
    Chen, Hui
    Ding, Guiguang
    Zhao, Sicheng
    Han, Jungong
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6706 - 6713
  • [44] Image Captioning using Deep Learning: A Systematic Literature Review
    Chohan, Murk
    Khan, Adil
    Mahar, Muhammad Saleem
    Hassan, Saif
    Ghafoor, Abdul
    Khan, Mehmood
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (05) : 278 - 286
  • [45] Discriminative Style Learning for Cross-Domain Image Captioning
    Yuan, Jin
    Zhu, Shuai
    Huang, Shuyin
    Zhang, Hanwang
    Xiao, Yaoqiang
    Li, Zhiyong
    Wang, Meng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1723 - 1736
  • [46] Local-to-Global Semantic Supervised Learning for Image Captioning
    Wang, Juan
    Duan, Yiping
    Tao, Xiaoming
    Lu, Jianhua
    ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
  • [47] Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning
    Honda, Ukyo
    Watanabe, Taro
    Matsumoto, Yuji
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1124 - 1134
  • [48] Learning Double-Level Relationship Networks for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (03)
  • [49] Application of human computing in image captioning under deep learning
    Zhihong Zeng
    Xiaowen Li
    Microsystem Technologies, 2021, 27 : 1687 - 1692
  • [50] "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention
    Chen, Tianlang
    Zhang, Zhongping
    You, Quanzeng
    Fang, Chen
    Wang, Zhaowen
    Jin, Hailin
    Luo, Jiebo
    COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 : 527 - 543