On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval

被引:8
|
作者
Gong, Yan [1 ]
Cosma, Georgina [1 ]
Fang, Hui [1 ]
机构
[1] Loughborough Univ, Sch Sci, Dept Comp Sci, Loughborough LE11 3TT, Leics, England
关键词
visual-semantic embedding network; multi-modal deep learning; cross-modal; information retrieval;
D O I
10.3390/jimaging7080125
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Visual-semantic embedding (VSE) networks create joint image-text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image-text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image-text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Shapley visual transformers for image-to-text generation
    Belhadi, Asma
    Djenouri, Youcef
    Belbachir, Ahmed Nabil
    Michalak, Tomasz
    Srivastava, Gautam
    APPLIED SOFT COMPUTING, 2024, 166
  • [22] Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding
    Huang, Yan
    Long, Yang
    Wang, Liang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8489 - 8496
  • [23] A deep person re-identification model with multi visual-semantic information embedding
    Xiaopei Wang
    Xiaoxia Liu
    Jun Guo
    Jiaxiang Zheng
    Pengfei Xu
    Yun Xiao
    Baoying Liu
    Multimedia Tools and Applications, 2021, 80 : 6853 - 6870
  • [24] Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
    Niu, Zhenxing
    Zhou, Mo
    Wang, Le
    Gao, Xinbo
    Hua, Gang
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1899 - 1907
  • [25] A deep person re-identification model with multi visual-semantic information embedding
    Wang, Xiaopei
    Liu, Xiaoxia
    Guo, Jun
    Zheng, Jiaxiang
    Xu, Pengfei
    Xiao, Yun
    Liu, Baoying
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (05) : 6853 - 6870
  • [26] Evaluating Text-to-Visual Generation with Image-to-Text Generation
    Lin, Zhiqiu
    Athaki, Deepak
    Li, Baiqi
    Li, Jiayao
    Xia, Xide
    Neubig, Graham
    Zhang, Pengchuan
    Ramanan, Deva
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 366 - 384
  • [27] Multi-view visual semantic embedding for cross-modal image-text retrieval
    Li, Zheng
    Guo, Caili
    Wang, Xin
    Zhang, Hao
    Hu, Lin
    PATTERN RECOGNITION, 2025, 159
  • [28] Multimodal Causal Relations Enhanced CLIP for Image-to-Text Retrieval
    Feng, Wenjun
    Lin, Dazhen
    Cao, Donglin
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 210 - 221
  • [29] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [30] Hierarchical visual-semantic interaction for scene text recognition
    Diao, Liang
    Tang, Xin
    Wang, Jun
    Xie, Guotong
    Hu, Junlin
    INFORMATION FUSION, 2024, 102