Enhancing Image Captioning with Transformer-Based Two-Pass Decoding Framework

被引：0

作者：

Su, Jindian ^{[1
]}

Mou, Yueqi ^{[1
]}

Xie, Yunhao ^{[2
]}

机构：

[1] South China Univ Technol, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024 | 2024年 / 14875卷

关键词：

Image Captioning; Two-Pass Decoding; Transformer;

D O I：

10.1007/978-981-97-5663-6_15

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The two-pass decoding framework significantly enhances image captioning models. However, existing two-pass models often train from scratch, missing the opportunity to fully leverage pre-trained knowledge from single-pass models. This practice leads to increased training cost and complexity. In this paper, we propose a unified two-pass decoding framework comprising three core modules: a pre-trained Visual Encoder, a pre-trained Draft Decoder, and a Deliberation Decoder. To enable effective information alignment and complementation between image and draft caption, we design a Cross-Modality Fusion (CMF) module in the Deliberation Decoder, forming a Cross-Modality Fusion-based Deliberation Decoder (CMF-DD). During the training process, we facilitate the transfer of foundational knowledge by extensively sharing parameters between the Draft and Deliberation Decoders. At the same time, we fix parameters from the single-pass baseline and only update a small subset within the Deliberation Decoder to reduce cost and complexity. Additionally, we introduce a Dominance-Adaptive reward scoring algorithm within the reinforcement learning stage to pertinently enhance the quality of refinements. Experiments on MS COCO datasets demonstrate that our method achieves substantial improvements over single-pass decoding baselines and competes favorably with other two-pass decoding methods.

引用

页码：171 / 183

页数：13

共 50 条

[31] Two-pass rate constrained still image compression
Yovanof, GS
COLOR IMAGING: DEVICE-INDEPENDENT COLOR, COLOR HARD COPY, AND GRAPHIC ARTS II, 1997, 3018 : 30 - 37
[32] An anti-aliasing two-pass image rotation
Chung, DJ
deGyvez, JP
SanchezSinencio, E
IMAGE AND VIDEO PROCESSING IV, 1996, 2666 : 54 - 63
[33] An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks
Osolo, Raymond Ian
Yang, Zhan
Long, Jun
APPLIED SCIENCES-BASEL, 2021, 11 (24):
[34] A No-reference Image Blur Metric Based on Two-pass Edge Analysis
Ma, Xiaoyu
Jiang, Xiuhua
Lei, Xiaohua
Zhang, Hui
Liu, Ping
2015 11TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2015, : 919 - 924
[35] TRANSFORMER-BASED SAR IMAGE DESPECKLING
Perera, Malsha V.
Bandara, Wele Gedara Chaminda
Valanarasu, Jeya Maria Jose
Patel, Vishal M.
2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 751 - 754
[36] NFC Based Two-Pass Mobile Authentication
Vempati, Jagannadh
Bajwa, Garima
Dantu, Ram
RESEARCH IN ATTACKS, INTRUSIONS, AND DEFENSES, 2013, 8145 : 467 - 468
[37] From Patch to Pixel: A Transformer-Based Hierarchical Framework for Compressive Image Sensing
Gan, Hongping
Shen, Minghe
Hua, Yi
Ma, Chunyan
Zhang, Tao
IEEE TRANSACTIONS ON COMPUTATIONAL IMAGING, 2023, 9 : 133 - 146
[38] Multiple-Symbol Interleaved RS Codes and Two-Pass Decoding Algorithm
WANG Zhongfeng
Ahmad Chini
Mehdi T.Kilani
ZHOU Jun
中国通信, 2016, 13 (04) : 14 - 19
[39] Enhancing Robustness of Speech Watermarking Using a Transformer-Based Framework Exploiting Acoustic Features
Tong, Chuxuan
Natgunanathan, Iynkaran
Xiang, Yong
Li, Jianhua
Zong, Tianrui
Zheng, Xi
Gao, Longxiang
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4822 - 4837
[40] Adaptive two-pass median filter based on support vector machines for image restoration
Lin, TC
Yu, PT
NEURAL COMPUTATION, 2004, 16 (02) : 333 - 354

← 1 2 3 4 5 →