ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

被引:0
|
作者
Wu, Guojun [1 ]
机构
[1] Univ Zurich, Dept Computat Linguist, Zurich, Switzerland
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
引用
收藏
页码:14740 / 14746
页数:7
相关论文
共 50 条
  • [41] Bridging the Gap between Vision and Language Domains for Improved Image Captioning
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Zhang, Xiaoyu
    Fan, Wei
    Zou, Yuexian
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4153 - 4161
  • [42] Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning
    Xie, Yujia
    Zhou, Luowei
    Dai, Xiyang
    Yuan, Lu
    Bach, Nguyen
    Liu, Ce
    Zeng, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [43] Scaling Data Generation in Vision-and-Language Navigation
    Wang, Zun
    Li, Jialu
    Hong, Yicong
    Wang, Yi
    Wu, Qi
    Bansal, Mohit
    Gould, Stephen
    Tan, Hao
    Qiao, Yu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
  • [44] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
    Liu, Shubo
    Zhang, Hongsheng
    Qi, Yuankai
    Wang, Peng
    Zhang, Yanning
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
  • [45] Language Features Matter: Effective Language Representations for Vision-Language Tasks
    Burns, Andrea
    Tan, Reuben
    Saenko, Kate
    Sclaroff, Stan
    Plummer, Bryan A.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7473 - 7482
  • [46] Vision-and-Language Navigation via Causal Learning
    Wang, Liuyi
    He, Zongtao
    Dang, Ronghao
    Shen, Mengjiao
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
  • [47] Image Captioning with Pretrained Language Generators
    Vishnubhatla, Saketh
    Sinha, Nishant
    CODS-COMAD 2021: PROCEEDINGS OF THE 3RD ACM INDIA JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE & MANAGEMENT OF DATA (8TH ACM IKDD CODS & 26TH COMAD), 2021, : 427 - 427
  • [48] Unpaired Image Captioning by Language Pivoting
    Gu, Jiuxiang
    Joty, Shafiq
    Cai, Jianfei
    Wang, Gang
    COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 519 - 535
  • [49] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
    Sung, Yi-Lin
    Cho, Jaemin
    Bansal, Mohit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5217 - 5227
  • [50] Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks
    Luo, Gen
    Zhou, Yiyi
    Sun, Xiaoshuai
    Wang, Yan
    Cao, Liujuan
    Wu, Yongjian
    Huang, Feiyue
    Ji, Rongrong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3386 - 3398