ModelShield: Adaptive and Robust Watermark Against Model Extraction Attack

被引:0
|
作者
Pang, Kaiyi [1 ]
Qi, Tao [2 ]
Wu, Chuhan [3 ]
Bai, Minhao [4 ]
Jiang, Minghu [1 ]
Huang, Yongfeng [4 ,5 ]
机构
[1] Tsinghua Univ, Sch Humanities, Beijing 100084, Peoples R China
[2] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
[3] Huawei Technol Co Ltd, Beijing 100077, Peoples R China
[4] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
[5] Zhongguancun Lab, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; model extraction attack; text watermarking; model IP protection;
D O I
10.1109/TIFS.2025.3530691
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.
引用
收藏
页码:1767 / 1782
页数:16
相关论文
共 50 条
  • [1] MEA-Defender: A Robust Watermark against Model Extraction Attack
    Lv, Peizhuo
    Ma, Hualong
    Chen, Kai
    Zhou, Jiachen
    Zhang, Shengzhi
    Liang, Ruigang
    Zhu, Shenchen
    Li, Pan
    Zhang, Yingjun
    45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 2515 - 2533
  • [2] Robust Adversarial Watermark Defending Against GAN Synthesization Attack
    Xu, Shengwang
    Qiao, Tong
    Xu, Ming
    Wang, Wei
    Zheng, Ning
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 351 - 355
  • [3] Adversarial Attack for Robust Watermark Protection Against Inpainting-based and Blind Watermark Removers
    Lyu, Mingzhi
    Huang, Yi
    Kong, Adams Wai-Kin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8396 - 8405
  • [4] Adversarial watermark: A robust and reliable watermark against removal
    Wang, Jinwei
    Huang, Wanyun
    Zhang, Jiawei
    Luo, Xiangyang
    Ma, Bin
    Journal of Information Security and Applications, 2024, 82
  • [5] Adversarial watermark: A robust and reliable watermark against removal
    Wang, Jinwei
    Huang, Wanyun
    Zhang, Jiawei
    Luo, Xiangyang
    Ma, Bin
    JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2024, 82
  • [6] A channel model for a watermark attack
    Su, JK
    Hartung, F
    Girod, B
    SECURITY AND WATERMARKING OF MULTIMEDIA CONTENTS, 1999, 3657 : 159 - 170
  • [7] Exposing Model Theft: A Robust and Transferable Watermark for Thwarting Model Extraction Attacks
    Tang, Ruixiang
    Jin, Hongye
    Du, Mengnan
    Wigington, Curtis
    Jain, Rajiv
    Hu, Xia
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4315 - 4319
  • [8] LWT-DSR based new robust framework for watermark extraction under intentional attack conditions
    Verma, Vivek Singh
    Jha, Rajib Kumar
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2017, 354 (14): : 6422 - 6449
  • [9] SPY-WATERMARK: ROBUST INVISIBLE WATERMARKING FOR BACKDOOR ATTACK
    Wang, Ruofei
    Wan, Renjie
    Guo, Zongyu
    Guo, Qing
    Huang, Rui
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 2700 - 2704
  • [10] Invisible DNN Watermarking Against Model Extraction Attack
    Xi, Zuping
    Qu, Zuomin
    Lu, Wei
    Luo, Xiangyang
    Cao, Xiaochun
    IEEE TRANSACTIONS ON CYBERNETICS, 2025, 55 (02) : 800 - 811