VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models

被引:0
|
作者
Yin, Ziyi [1 ]
Ye, Muchao [1 ]
Zhang, Tianrong [1 ]
Du, Tianyu [2 ]
Zhu, Jinguo [3 ]
Liu, Han [4 ]
Chen, Jinghui [1 ]
Wang, Ting [5 ]
Ma, Fenglong [1 ]
机构
[1] Penn State Univ, University Pk, PA 16802 USA
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Xi An Jiao Tong Univ, Xian, Peoples R China
[4] Dalian Univ Technol, Dalian, Peoples R China
[5] SUNY Stony Brook, Stony Brook, NY USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks. Towards this end, we propose VLATTACK(2) to generate adversarial samples by fusing perturbations of images and texts from both single-modal and multimodal levels. At the single-modal level, we propose a new blockwise similarity attack (BSA) strategy to learn image perturbations for disrupting universal representations. Besides, we adopt an existing text attack strategy to generate text perturbations independent of the image-modal attack. At the multimodal level, we design a novel iterative cross-search attack (ICSA) method to update adversarial image-text pairs periodically, starting with the outputs from the single-modal level. We conduct extensive experiments to attack five widely-used VL pre-trained models for six tasks. Experimental results show that VLATTACK achieves the highest attack success rates on all tasks compared with state-of-the-art baselines, which reveals a blind spot in the deployment of pre-trained VL models.
引用
收藏
页数:21
相关论文
共 50 条
  • [21] Connecting Pre-trained Language Models and Downstream Tasks via Properties of Representations
    Wu, Chenwei
    Lee, Holden
    Ge, Rong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [22] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
    Zhao, Xiaoqing
    Xu, Miaomiao
    Silamu, Wushour
    Li, Yanbing
    SENSORS, 2024, 24 (22)
  • [23] Rethinking Textual Adversarial Defense for Pre-Trained Language Models
    Wang, Jiayi
    Bao, Rongzhou
    Zhang, Zhuosheng
    Zhao, Hai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 2526 - 2540
  • [24] Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
    Wu, Wenhao
    Wang, Xiaohan
    Luo, Haipeng
    Wang, Jingdong
    Yang, Yi
    Ouyang, Wanli
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6620 - 6630
  • [25] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
    Tang, Longxiang
    Tian, Zhuotao
    Li, Kai
    He, Chunming
    Zhou, Hantao
    Zhao, Hengshuang
    Li, Xiu
    Jia, Jiaya
    COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 : 346 - 365
  • [26] UOR: Universal Backdoor Attacks on Pre-trained Language Models
    Du, Wei
    Li, Peixuan
    Zhao, Haodong
    Ju, Tianjie
    Ren, Ge
    Liu, Gongshen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 7865 - 7877
  • [27] Temporal Effects on Pre-trained Models for Language Processing Tasks
    Agarwal, Oshin
    Nenkova, Ani
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 904 - 921
  • [28] Robotic environmental state recognition with pre-trained vision-language models and black-box optimization
    Kawaharazuka, Kento
    Obinata, Yoshiki
    Kanazawa, Naoaki
    Okada, Kei
    Inaba, Masayuki
    ADVANCED ROBOTICS, 2024, 38 (18) : 1255 - 1264
  • [29] Quantifying Adaptability in Pre-trained Language Models with 500 Tasks
    Li, Belinda Z.
    Yu, Jane
    Khabsa, Madian
    Zettlemoyer, Luke
    Halevy, Alon
    Andreas, Jacob
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4696 - 4715
  • [30] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644