ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

被引:0
|
作者
Wu, Guojun [1 ]
机构
[1] Univ Zurich, Dept Computat Linguist, Zurich, Switzerland
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
引用
收藏
页码:14740 / 14746
页数:7
相关论文
共 50 条
  • [31] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [32] Federated Learning for Vision-and-Language Grounding Problems
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Fan, Wei
    Zou, Yuexian
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
  • [33] Behavioral Analysis of Vision-and-Language Navigation Agents
    Yang, Zijiao
    Majumdar, Arjun
    Lee, Stefan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2574 - 2582
  • [34] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [35] VLSlice: Interactive Vision-and-Language Slice Discovery
    Slyman, Eric
    Kahng, Minsuk
    Lee, Stefan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15245 - 15255
  • [36] ENVEDIT: Environment Editing for Vision-and-Language Navigation
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
  • [37] TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning
    Fan, Zhihao
    Wei, Zhongyu
    Wang, Siyuan
    Wang, Ruize
    Li, Zejun
    Shan, Haijun
    Huang, Xuanjing
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 657 - 663
  • [38] Diagnosing the Environment Bias in Vision-and-Language Navigation
    Zhang, Yubo
    Tan, Hao
    Bansal, Mohit
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
  • [39] KAT: A Knowledge Augmented Transformer for Vision-and-Language
    Gui, Liangke
    Wang, Borui
    Huang, Qiuyuan
    Hauptmann, Alexander
    Bisk, Yonatan
    Gao, Jianfeng
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
  • [40] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281