ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

被引:0
|
作者
Wu, Guojun [1 ]
机构
[1] Univ Zurich, Dept Computat Linguist, Zurich, Switzerland
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
引用
收藏
页码:14740 / 14746
页数:7
相关论文
共 50 条
  • [1] Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
    Iki, Taichi
    Aizawa, Akiko
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2189 - 2196
  • [2] HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
    Zhang, Zhengkun
    Guo, Wenya
    Meng, Xiaojun
    Wang, Yasheng
    Wang, Yadao
    Jiang, Xin
    Liu, Qun
    Yang, Zhenglu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11442 - 11453
  • [3] Unifying Vision-and-Language Tasks via Text Generation
    Cho, Jaemin
    Lei, Jie
    Tan, Hao
    Bansal, Mohit
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [4] CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
    Srinivasan, Tejas
    Chang, Ting-Yun
    Alva, Leticia Pinto
    Chochlakis, Georgios
    Rostami, Mohammad
    Thomason, Jesse
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
    Alper, Morris
    Fiman, Michael
    Averbuch-Elor, Hadar
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6778 - 6788
  • [6] Masked Path Modeling for Vision-and-Language Navigation
    Dou, Zi-Yi
    Gao, Feng
    Peng, Nanyun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15255 - 15269
  • [7] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [8] Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
    Gu, Jing
    Stefani, Eliana
    Wu, Qi
    Thomason, Jesse
    Wang, Xin Eric
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7606 - 7623
  • [9] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [10] Improving Image Captioning with Language Modeling Regularizations
    Ulusoy, Okan
    Akgul, Ceyhun Burak
    Anarim, Emin
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 407 - 412