Estimating the information gap between textual and visual representations

被引:8
|
作者
Henning, Christian [1 ,2 ,4 ]
Ewerth, Ralph [1 ,2 ,3 ]
机构
[1] Leibniz Univ Hannover, Inst Distributed Syst, Hannover, Germany
[2] Leibniz Univ Hannover, Res Ctr L3S, Hannover, Germany
[3] Leibniz Informat Ctr Sci & Technol TIB, Res Grp Visual Analyt, Dept Res & Dev, Hannover, Germany
[4] Swiss Fed Inst Technol, Inst Neuroinformat, Zurich, Switzerland
关键词
Text-image relations; Multimodal embeddings; Deep learning; Visual/verbal divide;
D O I
10.1007/s13735-017-0142-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image-text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.
引用
收藏
页码:43 / 56
页数:14
相关论文
共 50 条
  • [1] Estimating the information gap between textual and visual representations
    Christian Henning
    Ralph Ewerth
    International Journal of Multimedia Information Retrieval, 2018, 7 : 43 - 56
  • [2] Estimating the Information Gap between Textual and Visual Representations
    Henning, Christian
    Ewerth, Ralph
    PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 14 - 22
  • [3] Textual Representations for Crosslingual Information Retrieval
    Zhang, Bryan
    Tan, Liling
    ECNLP 4: THE FOURTH WORKSHOP ON E-COMMERCE AND NLP, 2021, : 116 - 122
  • [4] Textual Representations for Crosslingual Information Retrieval
    Zhang, Bryan
    Tan, Liling
    ECNLP 2021 - 4th Workshop on e-Commerce and NLP, Proceedings, 2021, : 116 - 122
  • [5] SYNCHRONIZATION OF TEXTUAL AND VISUAL REPRESENTATIONS OF EVOLVING INFORMATION IN THE CONTEXT OF MODEL-BASED DEVELOPMENT
    Angyal, Laszlo
    Lengyel, Laszlo
    EUROCON 2009: INTERNATIONAL IEEE CONFERENCE DEVOTED TO THE 150 ANNIVERSARY OF ALEXANDER S. POPOV, VOLS 1- 4, PROCEEDINGS, 2009, : 420 - 425
  • [6] Visual Question Answering with Textual Representations for Images
    Hirota, Yusuke
    Garcia, Noa
    Otani, Mayu
    Chu, Chenhui
    Nakashima, Yuta
    Taniguchi, Ittetsu
    Onoye, Takao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
  • [7] Multipage Document Retrieval by Textual and Visual Representations
    Rusinol, Marcal
    Karatzas, Dimosthenis
    Bagdanov, Andrew D.
    Llados, Josep
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 521 - 524
  • [8] VirTex: Learning Visual Representations from Textual Annotations
    Desai, Karan
    Johnson, Justin
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11157 - 11168
  • [9] Evaluating Multimodal Representations on Visual Semantic Textual Similarity
    de Lacalle, Oier Lopez
    Salaberria, Ander
    Soroa, Aitor
    Azkune, Gorka
    Agirre, Eneko
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1990 - 1997
  • [10] Textual and Visual Representations of Power and Justice in Medieval France
    Tarnowski, Andrea
    ENGLISH HISTORICAL REVIEW, 2017, 132 (558) : 1310 - 1312