Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引：0

作者：

Su, Jindian ^{[1
]}

Mou, Yueqi ^{[1
]}

Xie, Yunhao ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷

关键词：

Image Captioning; Two-Pass Decoding; Transformer;

D O I：

10.1007/978-981-97-5663-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.

引用

页码：171 / 183

页数：13

共 50 条

[41] Multiple-Symbol Interleaved RS Codes and Two-Pass Decoding Algorithm
Wang Zhongfeng
Chini, Ahmad
Kilani, Mehdi T.
Zhou Jun
CHINA COMMUNICATIONS, 2016, 13 (04) : 14 - 19
[42] Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion
Liu, Lei
Jiao, Yidi
Li, Xiaoran
Li, Jing
Wang, Haitao
Cao, Xinyu
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2024,
[43] Efficient Image Captioning Based on Vision Transformer Models
Elbedwehy, Samar
Medhat, T.
Hamza, Taher
Alrahmawy, Mohammed F.
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
[44] SCAP: enhancing image captioning through lightweight feature sifting and hierarchical decoding
Zhang, Yuhao
Tong, Jiaqi
Liu, Honglin
VISUAL COMPUTER, 2025,
[45] Transformer-Based Unified Neural Network for Quality Estimation and Transformer-Based Re-decoding Model for Machine Translation
Chen, Cong
Zong, Qinqin
Luo, Qi
Qiu, Bailian
Li, Maoxi
MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 66 - 75
[46] A Transformer-Based Framework for Tiny Object Detection
Liao, Yi-Kai
Lin, Gong-Si
Yeh, Mei-Chen
2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 373 - 377
[47] A transformer-based adversarial network framework for steganography
Xiao, Chaoen
Peng, Sirui
Zhang, Lei
Wang, Jianxin
Ding, Ding
Zhang, Jianyi
EXPERT SYSTEMS WITH APPLICATIONS, 2025, 269
[48] A Transformer-Based Framework for Geomagnetic Activity Prediction
Abduallah, Yasser
Wang, Jason T. L.
Xu, Chunhui
Wang, Haimin
FOUNDATIONS OF INTELLIGENT SYSTEMS (ISMIS 2022), 2022, 13515 : 325 - 335
[49] A transformer-based framework for enterprise sales forecasting
Sun, Yupeng
Li, Tian
PEERJ COMPUTER SCIENCE, 2024, 10 : 1 - 14
[50] HEAD-SYNCHRONOUS DECODING FOR TRANSFORMER-BASED STREAMING ASR
Li, Mohan
Zorila, Catalin
Doddipatla, Rama
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5909 - 5913

← 1 2 3 4 5 →