Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引:0
|
作者
Su, Jindian [1 ]
Mou, Yueqi [1 ]
Xie, Yunhao [2 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
关键词
Image Captioning; Two-Pass Decoding; Transformer;
D O I
10.1007/978-981-97-5663-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.
引用
收藏
页码:171 / 183
页数:13
相关论文
共 50 条
  • [41] Multiple-Symbol Interleaved RS Codes and Two-Pass Decoding Algorithm
    Wang Zhongfeng
    Chini, Ahmad
    Kilani, Mehdi T.
    Zhou Jun
    CHINA COMMUNICATIONS, 2016, 13 (04) : 14 - 19
  • [42] Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion
    Liu, Lei
    Jiao, Yidi
    Li, Xiaoran
    Li, Jing
    Wang, Haitao
    Cao, Xinyu
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2024,
  • [43] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [44] SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding
    Zhang, Yuhao
    Tong, Jiaqi
    Liu, Honglin
    VISUAL COMPUTER, 2025,
  • [45] Transformer-Based Unified Neural Network for Quality Estimation and Transformer-Based Re-decoding Model for Machine Translation
    Chen, Cong
    Zong, Qinqin
    Luo, Qi
    Qiu, Bailian
    Li, Maoxi
    MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 66 - 75
  • [46] A Transformer-Based Framework for Tiny Object Detection
    Liao, Yi-Kai
    Lin, Gong-Si
    Yeh, Mei-Chen
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 373 - 377
  • [47] A transformer-based adversarial network framework for steganography
    Xiao, Chaoen
    Peng, Sirui
    Zhang, Lei
    Wang, Jianxin
    Ding, Ding
    Zhang, Jianyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269
  • [48] A Transformer-Based Framework for Geomagnetic Activity Prediction
    Abduallah, Yasser
    Wang, Jason T. L.
    Xu, Chunhui
    Wang, Haimin
    FOUNDATIONS OF INTELLIGENT SYSTEMS (ISMIS 2022), 2022, 13515 : 325 - 335
  • [49] A transformer-based framework for enterprise sales forecasting
    Sun, Yupeng
    Li, Tian
    PEERJ COMPUTER SCIENCE, 2024, 10 : 1 - 14
  • [50] HEAD-SYNCHRONOUS DECODING FOR TRANSFORMER-BASED STREAMING ASR
    Li, Mohan
    Zorila, Catalin
    Doddipatla, Rama
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5909 - 5913