Understanding, Categorizing and Predicting Semantic Image-Text Relations

被引:15
|
作者
Otto, Christian [1 ]
Springstein, Matthias [1 ]
Anand, Avishek [2 ]
Ewerth, Ralph [3 ]
机构
[1] Leibniz Informat Ctr Sci & Technol TIB, Hannover, Germany
[2] Leibniz Univ Hannover, L3S Res Ctr, Hannover, Germany
[3] Leibniz Univ Hannover, L3S Res Ctr, Leibniz Informat Ctr Sci & Technol TIB, Hannover, Germany
关键词
Image-text class; multimodality; data augmentation; semantic gap;
D O I
10.1145/3323873.3325049
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes ( e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
引用
收藏
页码:168 / 176
页数:9
相关论文
共 50 条
  • [21] Emergent Visual-Semantic Hierarchies in Image-Text Representations
    Alper, Morris
    Averbuch-Elor, Hadar
    COMPUTER VISION - ECCV 2024, PT LII, 2025, 15110 : 220 - 238
  • [22] Image-Text Fusion Sentiment Analysis Method Based on Image Semantic Translation
    Huang, Jian
    Wang, Ying
    Computer Engineering and Applications, 2023, 59 (11) : 180 - 187
  • [23] Image-Text Interaction
    Strothotte, Thomas
    2007 INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2007, : 3 - 3
  • [24] MISL: Multi-grained image-text semantic learning for text-guided image inpainting
    Wu, Xingcai
    Zhao, Kejun
    Huang, Qianding
    Wang, Qi
    Yang, Zhenguo
    Hao, Gefei
    PATTERN RECOGNITION, 2024, 145
  • [25] Semantic Completion: Enhancing Image-Text Retrieval with Information Extraction and Compression
    Chen, Xue
    Guo, Yi
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT IV, PAKDD 2024, 2024, 14648 : 59 - 71
  • [26] Image-text interaction graph neural network for image-text sentiment analysis
    Wenxiong Liao
    Bi Zeng
    Jianqi Liu
    Pengfei Wei
    Jiongkun Fang
    Applied Intelligence, 2022, 52 : 11184 - 11198
  • [27] Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching
    Zhang, Huatian
    Zhang, Lei
    Zhang, Kun
    Mao, Zhendong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7105 - 7114
  • [28] Dual-View Semantic Inference Network for image-text matching
    Wu, Chunlei
    Wu, Jie
    Cao, Haiwen
    Wei, Yiwei
    Wang, Leiquan
    NEUROCOMPUTING, 2021, 426 : 47 - 57
  • [29] Image-Text Embedding Learning via Visual and Textual Semantic Reasoning
    Li, Kunpeng
    Zhang, Yulun
    Li, Kai
    Li, Yuanyuan
    Fu, Yun
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 641 - 656
  • [30] Image-Text Retrieval With Cross-Modal Semantic Importance Consistency
    Liu, Zejun
    Chen, Fanglin
    Xu, Jun
    Pei, Wenjie
    Lu, Guangming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (05) : 2465 - 2476