Learning Combinatorial Prompts for Universal Controllable Image Captioning

被引:0
|
作者
Wang, Zhen [1 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
Gao, Fei [2 ]
Shao, Jian [1 ]
Chen, Long [3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ Technol, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Controllable image captioning (CIC); Prompt learning; Pretrained model;
D O I
10.1007/s11263-024-02179-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Controllable Image Captioning (CIC)-generating natural language descriptions about images under the guidance of given control signals-is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By "combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
引用
收藏
页码:129 / 150
页数:22
相关论文
共 50 条
  • [1] Controllable Image Captioning via Prompting
    Wang, Ning
    Xie, Jiahao
    Wu, Jihao
    Jia, Mingbo
    Li, Linlin
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 2, 2023, : 2617 - 2625
  • [2] Image Captioning With Controllable and Adaptive Length Levels
    Ding, Ning
    Deng, Chaorui
    Tan, Mingkui
    Du, Qing
    Ge, Zhiwei
    Wu, Qi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 764 - 779
  • [3] SEMANTIC LEARNING NETWORK FOR CONTROLLABLE VIDEO CAPTIONING
    Chen, Kaixuan
    Di, Qianji
    Lu, Yang
    Wang, Hanzi
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 880 - 884
  • [4] Controllable Image Captioning with Feature Refinement and Multilayer Fusion
    Du, Sen
    Zhu, Hong
    Zhang, Yujia
    Wang, Dong
    Shi, Jing
    Xing, Nan
    Lin, Guangfeng
    Zhou, Huiyu
    APPLIED SCIENCES-BASEL, 2023, 13 (08):
  • [5] Imageability- and Length-Controllable Image Captioning
    Kastner, Marc A.
    Umemura, Kazuki
    Ide, Ichiro
    Kawanishi, Yasutomo
    Hirayama, Takatsugu
    Doman, Keisuke
    Deguchi, Daisuke
    Murase, Hiroshi
    Satoh, Shin'Ichi
    IEEE ACCESS, 2021, 9 (09): : 162951 - 162961
  • [6] Contrastive Learning for Image Captioning
    Dai, Bo
    Lin, Dahua
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [7] Learning to Evaluate Image Captioning
    Cui, Yin
    Yang, Guandao
    Veit, Andreas
    Huang, Xun
    Belongie, Serge
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5804 - 5812
  • [8] Meta Learning for Image Captioning
    Li, Nannan
    Chen, Zhenzhong
    Liu, Shan
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8626 - 8633
  • [9] Deep Learning for Military Image Captioning
    Das, Subrata
    Jain, Lalit
    Das, Amp
    2018 21ST INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2018, : 2165 - 2171
  • [10] Learning to Guide Decoding for Image Captioning
    Jiang, Wenhao
    Ma, Lin
    Chen, Xinpeng
    Zhang, Hanwang
    Liu, Wei
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6959 - 6966