Towards local visual modeling for image captioning

被引：49

作者：

Ma, Yiwei ^{[1
]}

Ji, Jiayi ^{[1
]}

Sun, Xiaoshuai ^{[1
,2
,4
]}

Zhou, Yiyi ^{[1
]}

Ji, Rongrong ^{[1
,2
,3
]}

机构：

[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

[4] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Room B705,Haiyu Adm Bldg,XMU Haiyun Campus, Xiamen 361005, Peoples R China

来源：

PATTERN RECOGNITION | 2023年 / 138卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Attention mechanism; Local visual modeling;

D O I：

10.1016/j.patcog.2023.109420

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Trans-former Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning qual-ity. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets. The source code is available on GitHub: https://www.github.com/xmu-xiaoma666/LSTNet .(c) 2023 Elsevier Ltd. All rights reserved.

引用

页数：12

共 50 条

[31] Visual contextual relationship augmented transformer for image captioning
Qiang Su
Junbo Hu
Zhixin Li
Applied Intelligence, 2024, 54 : 4794 - 4813
[32] Visual News: Benchmark and Challenges in News Image Captioning
Liu, Fuxiao
Wang, Yinghan
Wang, Tianlu
Ordonez, Vicente
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6761 - 6771
[33] Aligned visual semantic scene graph for image captioning
Zhao, Shanshan
Li, Lixiang
Peng, Haipeng
DISPLAYS, 2022, 74
[34] Image Captioning with Text-Based Visual Attention
He, Chen
Hu, Haifeng
NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
[35] DIFNet: Boosting Visual Information Flow for Image Captioning
Wu, Mingrui
Zhang, Xuying
Sun, Xiaoshuai
Zhou, Yiyi
Chen, Chao
Gu, Jiaxin
Sun, Xing
Ji, Rongrong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17999 - 18008
[36] Combine Visual Features and Scene Semantics for Image Captioning
Li Z.-X.
Wei H.-Y.
Huang F.-C.
Zhang C.-L.
Ma H.-F.
Shi Z.-Z.
Li, Zhi-Xin (lizx@gxnu.edu.cn), 1624, Science Press (43): : 1624 - 1640
[37] Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
Lee, Hojun
Cho, Hyunjun
Park, Jieun
Chae, Jinyeong
Kim, Jihie
SENSORS, 2022, 22 (04)
[38] Towards Unsupervised Image Captioning with Shared Multimodal Embeddings
Laina, Iro
Rupprecht, Christian
Navab, Nassir
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7413 - 7423
[39] Towards Retrieval-Augmented Architectures for Image Captioning
Sarto, Sara
Cornia, Marcella
Baraldi, Lorenzo
Nicolosi, Alessandro
Cucchiara, Rita
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (08)
[40] SubICap: Towards Subword-informed Image Captioning
Sharif, Naeha
Bennamoun, Mohammed
Liu, Wei
Shah, Syed Afaq Ali
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 3539 - 3548

← 1 2 3 4 5 →