ModelShield: Adaptive and Robust Watermark Against Model Extraction Attack

IF 8 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Information Forensics and Security Pub Date : 2025-01-16 DOI:10.1109/TIFS.2025.3530691

Kaiyi Pang;Tao Qi;Chuhan Wu;Minhao Bai;Minghu Jiang;Yongfeng Huang

{"title":"ModelShield: Adaptive and Robust Watermark Against Model Extraction Attack","authors":"Kaiyi Pang;Tao Qi;Chuhan Wu;Minhao Bai;Minghu Jiang;Yongfeng Huang","doi":"10.1109/TIFS.2025.3530691","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"1767-1782"},"PeriodicalIF":8.0000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10843740/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ModelShield：抗模型提取攻击的自适应鲁棒水印

大型语言模型（llm）在各种机器学习任务中展示了通用智能，从而提高了其知识产权（IP）的商业价值。为了保护这个IP，模型所有者通常只允许用户以黑盒方式访问，然而，攻击者仍然可以利用模型提取攻击来窃取模型生成中编码的模型智能。水印技术通过将唯一标识符嵌入到模型生成的内容中，为防御此类攻击提供了一种很有前途的解决方案。然而，现有的水印方法往往会由于启发式更改而损害生成内容的质量，并且缺乏对抗对抗策略的强大机制，从而限制了它们在现实场景中的实用性。本文引入了一种自适应鲁棒水印方法（ModelShield）来保护llm的IP。我们的方法结合了一种自水印机制，允许llm自动将水印插入其生成的内容中，以避免模型内容的退化。我们还提出了一种鲁棒的水印检测机制，能够在不同对抗策略的干扰下有效识别水印信号。此外，ModelShield是一种即插即用的方法，不需要额外的模型训练，增强了其在LLM部署中的适用性。对两个真实世界数据集和三个llm的广泛评估表明，我们的方法在防御有效性和鲁棒性方面优于现有方法，同时显著降低了模型生成内容上水印的退化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features