ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

被引:0
|
作者
Wu, Guojun [1 ]
机构
[1] Univ Zurich, Dept Computat Linguist, Zurich, Switzerland
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
引用
收藏
页码:14740 / 14746
页数:7
相关论文
共 50 条
  • [21] Image Captioning in Turkish Language
    Yilmaz, Berk Dursun
    Demir, Ali Emre
    Sonmez, Elena Battini
    Yildiz, Tugba
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 413 - 417
  • [22] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
  • [23] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
    Zhou, Gengze
    Hong, Yicong
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
  • [24] Understanding Equity in Vision-and-Language Learning for Glaucoma Diagnosis with Deep Learning
    Luo, Yan
    Shi, Min
    Tian, Yu
    Eslami, Mohammad
    Hashemabad, Saber Kazeminasab
    Rana, Hannah
    Elze, Tobias
    Shen, Lucy Q.
    Pasquale, Louis R.
    Zebardast, Nazlee
    Boland, Michael V.
    Friedman, David S.
    Wang, Mengyu
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2024, 65 (07)
  • [25] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
    Lu, Jiasen
    Batra, Dhruv
    Parikh, Devi
    Lee, Stefan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [26] Memory-Adaptive Vision-and-Language Navigation
    He, Keji
    Jing, Ya
    Huang, Yan
    Lu, Zhihe
    An, Dong
    Wang, Liang
    PATTERN RECOGNITION, 2024, 153
  • [27] Vital information matching in vision-and-language navigation
    Jia, Zixi
    Yu, Kai
    Ru, Jingyu
    Yang, Sikai
    Coleman, Sonya
    FRONTIERS IN NEUROROBOTICS, 2022, 16
  • [28] VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
    Zheng, Kaizhi
    Chen, Xiaotong
    Jenkins, Odest Chadwicke
    Wang, Xin Eric
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [29] Local Slot Attention for Vision-and-Language Navigation
    Zhuang, Yifeng
    Sun, Qiang
    Fu, Yanwei
    Chen, Lifeng
    Xue, Xiangyang
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
  • [30] Improved Speaker and Navigator for Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Ting
    Wang, Donglin
    IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63