Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-Tuning

被引:0
|
作者
Cai, Rui [1 ,2 ]
Dong, Jianfeng [1 ,3 ]
Liang, Tianxiang [1 ,2 ]
Liang, Yonghui [1 ,2 ]
Wang, Yabing [1 ,2 ]
Yang, Xun [4 ]
Wang, Xun [1 ,2 ]
Wang, Meng [5 ]
机构
[1] Zhejiang Gongshang Univ, Coll Comp Sci & Technol, Hangzhou 310035, Zhejiang, Peoples R China
[2] Zhejiang Key Lab Big Data & Future Ecommerce Techn, Hangzhou 310035, Peoples R China
[3] Univ Sci & Technol China, Hefei 230026, Peoples R China
[4] Univ Sci & Technol China, Sch Informat Sci & Technol, Dept Elect Engn & Informat Sci, Hefei 230026, Peoples R China
[5] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Noise; Data models; Videos; Noise robustness; Noise measurement; Task analysis; Visualization; Cross-Lingual transfer; cross-modal retrieval; machine translation; noise-robust fine-tuning; EMBEDDINGS;
D O I
10.1109/TKDE.2024.3400060
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-lingual cross-modal retrieval aims at leveraging human-labeled annotations in a source language to construct cross-modal retrieval models for a new target language, due to the lack of manually-annotated dataset in low-resource languages (target languages). Contrary to the growing developments in the field of monolingual cross-modal retrieval, there has been less research focusing on cross-modal retrieval in the cross-lingual scenario. A straightforward method to obtain target-language labeled data is translating source-language datasets utilizing Machine Translations (MT). However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we propose Noise-Robust Fine-tuning (NRF) which tries to extract clean textual information from a possibly noisy target-language input with the guidance of its source-language counterpart. Besides, contrastive learning involving different modalities are performed to strengthen the noise-robustness of our model. Different from traditional cross-modal retrieval methods which only employ image/video-text paired data for fine-tuning, in NRF, selected parallel data plays a key role in improving the noise-filtering ability of our model. Extensive experiments are conducted on three video-text and image-text retrieval benchmarks across different target languages, and the results demonstrate that our method significantly improves the overall performance without using any image/video-text paired data on target languages.
引用
收藏
页码:5860 / 5873
页数:14
相关论文
共 50 条
  • [41] MULTI-MODAL KNOWLEDGE TRANSFER FOR TARGET SPEAKER LIPREADING WITH IMPROVED AUDIO-VISUAL PRETRAINING AND CROSS-LINGUAL FINE-TUNING<bold> </bold>
    Wan, Genshun
    Ye, Zhongfu
    2024 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS, ICMEW 2024, 2024,
  • [42] INVGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution
    Jian, Xiangru
    Wang, Yimu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 836 - 865
  • [43] ROBUST MULTI-VIEW HASHING FOR CROSS-MODAL RETRIEVAL
    Wang, Haitao
    Chen, Hui
    Meng, Min
    Wu, JiGang
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1012 - 1017
  • [44] Robust and discrete matrix factorization hashing for cross-modal retrieval
    Zhang, Donglin
    Wu, Xiao-Jun
    PATTERN RECOGNITION, 2022, 122
  • [45] Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval
    Han, Haochen
    Zheng, Qinghua
    Dai, Guang
    Luo, Minnan
    Wang, Jingdong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26669 - 26678
  • [46] XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
    Han, HyoJung
    Anwar, Mohamed
    Pino, Juan
    Hsu, Wei-Ning
    Carpuat, Marine
    Shi, Bowen
    Wang, Changhan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12896 - 12911
  • [47] On cross-lingual retrieval with multilingual text encoders
    Litschko, Robert
    Vulic, Ivan
    Ponzetto, Simone Paolo
    Glavas, Goran
    INFORMATION RETRIEVAL JOURNAL, 2022, 25 (02): : 149 - 183
  • [48] Query by Example for Cross-Lingual Event Retrieval
    Sarwar, Sheikh Muhammad
    Allan, James
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 1601 - 1604
  • [49] Cross-lingual information retrieval by feature vectors
    Lilleng, Jeanine
    Tomassen, Stein L.
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2007, 4592 : 229 - +
  • [50] UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
    Zhou, Mingyang
    Zhou, Luowei
    Wang, Shuohang
    Cheng, Yu
    Li, Linjie
    Yu, Zhou
    Liu, Jingjing
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4153 - 4163