Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引:0
|
作者
Su, Jindian [1 ]
Mou, Yueqi [1 ]
Xie, Yunhao [2 ]
机构
[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
关键词
Image Captioning; Two-Pass Decoding; Transformer;
D O I
10.1007/978-981-97-5663-6_15
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.
引用
收藏
页码:171 / 183
页数:13
相关论文
共 50 条
  • [31] Two-pass rate constrained still image compression
    Yovanof, GS
    COLOR IMAGING: DEVICE-INDEPENDENT COLOR, COLOR HARD COPY, AND GRAPHIC ARTS II, 1997, 3018 : 30 - 37
  • [32] An anti-aliasing two-pass image rotation
    Chung, DJ
    deGyvez, JP
    SanchezSinencio, E
    IMAGE AND VIDEO PROCESSING IV, 1996, 2666 : 54 - 63
  • [33] An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks
    Osolo, Raymond Ian
    Yang, Zhan
    Long, Jun
    APPLIED SCIENCES-BASEL, 2021, 11 (24):
  • [34] A No-reference Image Blur Metric Based on Two-pass Edge Analysis
    Ma, Xiaoyu
    Jiang, Xiuhua
    Lei, Xiaohua
    Zhang, Hui
    Liu, Ping
    2015 11TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2015, : 919 - 924
  • [35] TRANSFORMER-BASED SAR IMAGE DESPECKLING
    Perera, Malsha V.
    Bandara, Wele Gedara Chaminda
    Valanarasu, Jeya Maria Jose
    Patel, Vishal M.
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 751 - 754
  • [36] NFC Based Two-Pass Mobile Authentication
    Vempati, Jagannadh
    Bajwa, Garima
    Dantu, Ram
    RESEARCH IN ATTACKS, INTRUSIONS, AND DEFENSES, 2013, 8145 : 467 - 468
  • [37] From Patch to Pixel: A Transformer-Based Hierarchical Framework for Compressive Image Sensing
    Gan, Hongping
    Shen, Minghe
    Hua, Yi
    Ma, Chunyan
    Zhang, Tao
    IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, 2023, 9 : 133 - 146
  • [38] Multiple-Symbol Interleaved RS Codes and Two-Pass Decoding Algorithm
    WANG Zhongfeng
    Ahmad Chini
    Mehdi T.Kilani
    ZHOU Jun
    中国通信, 2016, 13 (04) : 14 - 19
  • [39] Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features
    Tong, Chuxuan
    Natgunanathan, Iynkaran
    Xiang, Yong
    Li, Jianhua
    Zong, Tianrui
    Zheng, Xi
    Gao, Longxiang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4822 - 4837
  • [40] Adaptive two-pass median filter based on support vector machines for image restoration
    Lin, TC
    Yu, PT
    NEURAL COMPUTATION, 2004, 16 (02) : 333 - 354