ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding

被引：0

作者：

Wu, Guojun ^{[1
]}

机构：

[1] Univ Zurich, Dept Computat Linguist, Zurich, Switzerland

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.

引用

页码：14740 / 14746

页数：7

共 50 条

[21] Image Captioning in Turkish Language
Yilmaz, Berk Dursun
Demir, Ali Emre
Sonmez, Elena Battini
Yildiz, Tugba
2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 413 - 417
[22] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Frank, Stella
Bugliarello, Emanuele
Elliott, Desmond
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
[23] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Zhou, Gengze
Hong, Yicong
Wu, Qi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7641 - 7649
[24] Understanding Equity in Vision-and-Language Learning for Glaucoma Diagnosis with Deep Learning
Luo, Yan
Shi, Min
Tian, Yu
Eslami, Mohammad
Hashemabad, Saber Kazeminasab
Rana, Hannah
Elze, Tobias
Shen, Lucy Q.
Pasquale, Louis R.
Zebardast, Nazlee
Boland, Michael V.
Friedman, David S.
Wang, Mengyu
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2024, 65 (07)
[25] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Lu, Jiasen
Batra, Dhruv
Parikh, Devi
Lee, Stefan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[26] Memory-Adaptive Vision-and-Language Navigation
He, Keji
Jing, Ya
Huang, Yan
Lu, Zhihe
An, Dong
Wang, Liang
PATTERN RECOGNITION, 2024, 153
[27] Vital information matching in vision-and-language navigation
Jia, Zixi
Yu, Kai
Ru, Jingyu
Yang, Sikai
Coleman, Sonya
FRONTIERS IN NEUROROBOTICS, 2022, 16
[28] VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
Zheng, Kaizhi
Chen, Xiaotong
Jenkins, Odest Chadwicke
Wang, Xin Eric
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[29] Local Slot Attention for Vision-and-Language Navigation
Zhuang, Yifeng
Sun, Qiang
Fu, Yanwei
Chen, Lifeng
Xue, Xiangyang
PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
[30] Improved Speaker and Navigator for Vision-and-Language Navigation
Wu, Zongkai
Liu, Zihan
Wang, Ting
Wang, Donglin
IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63

← 1 2 3 4 5 →