VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

被引：0

作者：

Yin, Ziyi ^{[1
]}

Ye, Muchao ^{[1
]}

Zhang, Tianrong ^{[1
]}

Du, Tianyu ^{[2
]}

Zhu, Jinguo ^{[3
]}

Liu, Han ^{[4
]}

Chen, Jinghui ^{[1
]}

Wang, Ting ^{[5
]}

Ma, Fenglong ^{[1
]}

机构：

[1] Penn State Univ, University Pk, PA 16802 USA

[2] Zhejiang Univ, Hangzhou, Peoples R China

[3] Xi An Jiao Tong Univ, Xian, Peoples R China

[4] Dalian Univ Technol, Dalian, Peoples R China

[5] SUNY Stony Brook, Stony Brook, NY USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

基金：

美国国家科学基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.

引用

页数：21

共 50 条

[21] Connecting Pre-trained Language Models and Downstream Tasks via Properties of Representations
Wu, Chenwei
Lee, Holden
Ge, Rong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[22] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
SENSORS, 2024, 24 (22)
[23] Rethinking Textual Adversarial Defense for Pre-Trained Language Models
Wang, Jiayi
Bao, Rongzhou
Zhang, Zhuosheng
Zhao, Hai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2526 - 2540
[24] Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wu, Wenhao
Wang, Xiaohan
Luo, Haipeng
Wang, Jingdong
Yang, Yi
Ouyang, Wanli
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6620 - 6630
[25] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
Tang, Longxiang
Tian, Zhuotao
Li, Kai
He, Chunming
Zhou, Hantao
Zhao, Hengshuang
Li, Xiu
Jia, Jiaya
COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 : 346 - 365
[26] UOR: Universal Backdoor Attacks on Pre-trained Language Models
Du, Wei
Li, Peixuan
Zhao, Haodong
Ju, Tianjie
Ren, Ge
Liu, Gongshen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 7865 - 7877
[27] Temporal Effects on Pre-trained Models for Language Processing Tasks
Agarwal, Oshin
Nenkova, Ani
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 904 - 921
[28] Robotic environmental state recognition with pre-trained vision-language models and black-box optimization
Kawaharazuka, Kento
Obinata, Yoshiki
Kanazawa, Naoaki
Okada, Kei
Inaba, Masayuki
ADVANCED ROBOTICS, 2024, 38 (18) : 1255 - 1264
[29] Quantifying Adaptability in Pre-trained Language Models with 500 Tasks
Li, Belinda Z.
Yu, Jane
Khabsa, Madian
Zettlemoyer, Luke
Halevy, Alon
Andreas, Jacob
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4696 - 4715
[30] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644

← 1 2 3 4 5 →