ModelShield: Adaptive and Robust Watermark Against Model Extraction Attack

被引：0

作者：

Pang, Kaiyi ^{[1
]}

Qi, Tao ^{[2
]}

Wu, Chuhan ^{[3
]}

Bai, Minhao ^{[4
]}

Jiang, Minghu ^{[1
]}

Huang, Yongfeng ^{[4
,5
]}

机构：

[1] Tsinghua Univ, Sch Humanities, Beijing 100084, Peoples R China

[2] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China

[3] Huawei Technol Co Ltd, Beijing 100077, Peoples R China

[4] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China

[5] Zhongguancun Lab, Beijing 100081, Peoples R China

来源：

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY | 2025年 / 20卷

基金：

中国国家自然科学基金;

关键词：

Large language models; model extraction attack; text watermarking; model IP protection;

D O I：

10.1109/TIFS.2025.3530691

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.

引用

页码：1767 / 1782

页数：16

共 50 条

[1] MEA-Defender: A Robust Watermark against Model Extraction Attack
Lv, Peizhuo
Ma, Hualong
Chen, Kai
Zhou, Jiachen
Zhang, Shengzhi
Liang, Ruigang
Zhu, Shenchen
Li, Pan
Zhang, Yingjun
45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024, 2024, : 2515 - 2533
[2] Robust Adversarial Watermark Defending Against GAN Synthesization Attack
Xu, Shengwang
Qiao, Tong
Xu, Ming
Wang, Wei
Zheng, Ning
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 351 - 355
[3] Adversarial Attack for Robust Watermark Protection Against Inpainting-based and Blind Watermark Removers
Lyu, Mingzhi
Huang, Yi
Kong, Adams Wai-Kin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 8396 - 8405
[4] Adversarial watermark: A robust and reliable watermark against removal
Wang, Jinwei
Huang, Wanyun
Zhang, Jiawei
Luo, Xiangyang
Ma, Bin
Journal of Information Security and Applications, 2024, 82
[5] Adversarial watermark: A robust and reliable watermark against removal
Wang, Jinwei
Huang, Wanyun
Zhang, Jiawei
Luo, Xiangyang
Ma, Bin
JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2024, 82
[6] A channel model for a watermark attack
Su, JK
Hartung, F
Girod, B
SECURITY AND WATERMARKING OF MULTIMEDIA CONTENTS, 1999, 3657 : 159 - 170
[7] Exposing Model Theft: A Robust and Transferable Watermark for Thwarting Model Extraction Attacks
Tang, Ruixiang
Jin, Hongye
Du, Mengnan
Wigington, Curtis
Jain, Rajiv
Hu, Xia
PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 4315 - 4319
[8] LWT-DSR based new robust framework for watermark extraction under intentional attack conditions
Verma, Vivek Singh
Jha, Rajib Kumar
JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2017, 354 (14): : 6422 - 6449
[9] SPY-WATERMARK: ROBUST INVISIBLE WATERMARKING FOR BACKDOOR ATTACK
Wang, Ruofei
Wan, Renjie
Guo, Zongyu
Guo, Qing
Huang, Rui
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 2700 - 2704
[10] Invisible DNN Watermarking Against Model Extraction Attack
Xi, Zuping
Qu, Zuomin
Lu, Wei
Luo, Xiangyang
Cao, Xiaochun
IEEE TRANSACTIONS ON CYBERNETICS, 2025, 55 (02) : 800 - 811

← 1 2 3 4 5 →