On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval

被引：8

作者：

Gong, Yan ^{[1
]}

Cosma, Georgina ^{[1
]}

Fang, Hui ^{[1
]}

机构：

[1] Loughborough Univ, Sch Sci, Dept Comp Sci, Loughborough LE11 3TT, Leics, England

来源：

JOURNAL OF IMAGING | 2021年 / 7卷 / 08期

关键词：

visual-semantic embedding network; multi-modal deep learning; cross-modal; information retrieval;

D O I：

10.3390/jimaging7080125

中图分类号：

TB8 [摄影技术];

学科分类号：

0804 ;

摘要：

Visual-semantic embedding (VSE) networks create joint image-text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image-text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image-text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.

引用

页数：15

共 50 条

[21] Shapley visual transformers for image-to-text generation
Belhadi, Asma
Djenouri, Youcef
Belbachir, Ahmed Nabil
Michalak, Tomasz
Srivastava, Gautam
APPLIED SOFT COMPUTING, 2024, 166
[22] Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding
Huang, Yan
Long, Yang
Wang, Liang
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8489 - 8496
[23] A deep person re-identification model with multi visual-semantic information embedding
Xiaopei Wang
Xiaoxia Liu
Jun Guo
Jiaxiang Zheng
Pengfei Xu
Yun Xiao
Baoying Liu
Multimedia Tools and Applications, 2021, 80 : 6853 - 6870
[24] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
Niu, Zhenxing
Zhou, Mo
Wang, Le
Gao, Xinbo
Hua, Gang
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
[25] A deep person re-identification model with multi visual-semantic information embedding
Wang, Xiaopei
Liu, Xiaoxia
Guo, Jun
Zheng, Jiaxiang
Xu, Pengfei
Xiao, Yun
Liu, Baoying
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (05) : 6853 - 6870
[26] Evaluating Text-to-Visual Generation with Image-to-Text Generation
Lin, Zhiqiu
Athaki, Deepak
Li, Baiqi
Li, Jiayao
Xia, Xide
Neubig, Graham
Zhang, Pengchuan
Ramanan, Deva
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 366 - 384
[27] Multi-view visual semantic embedding for cross-modal image-text retrieval
Li, Zheng
Guo, Caili
Wang, Xin
Zhang, Hao
Hu, Lin
PATTERN RECOGNITION, 2025, 159
[28] Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval
Feng, Wenjun
Lin, Dazhen
Cao, Donglin
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 210 - 221
[29] Image Captioning With Visual-Semantic Double Attention
He, Chen
Hu, Haifeng
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
[30] Hierarchical visual-semantic interaction for scene text recognition
Diao, Liang
Tang, Xin
Wang, Jun
Xie, Guotong
Hu, Junlin
INFORMATION FUSION, 2024, 102

← 1 2 3 4 5 →