A safety realignment framework via subspace-oriented model fusion for large language models

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-11-09 DOI:10.1016/j.knosys.2024.112701
Xin Yi , Shunfan Zheng , Linlin Wang , Xiaoling Wang , Liang He
{"title":"A safety realignment framework via subspace-oriented model fusion for large language models","authors":"Xin Yi ,&nbsp;Shunfan Zheng ,&nbsp;Linlin Wang ,&nbsp;Xiaoling Wang ,&nbsp;Liang He","doi":"10.1016/j.knosys.2024.112701","DOIUrl":null,"url":null,"abstract":"<div><div>To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model’s safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at <span><span>https://github.com/xinykou/safety_realignment</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112701"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013352","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model’s safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at https://github.com/xinykou/safety_realignment.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过面向子空间的模型融合实现大型语言模型的安全调整框架
为了提高大型语言模型(LLM)在特定任务中的性能,必须对特定任务指令进行微调。然而,这一过程很容易损害特定任务模型的安全性,使其容易服从恶意指令并生成有害内容。目前针对微调攻击的方法通常会干扰原有的微调目标,或者需要大量数据来重新调整被破坏的模型。为了应对这两大挑战,我们提出了在安全子空间中重新使用初始对齐模型和重新对齐特定任务模型的建议。在本文中,我们通过面向子空间的模型融合(SOMF)引入了一个安全重新调整框架,旨在将初始对齐模型的保障能力转移到当前的特定任务模型中。我们的方法首先将所有任务向量与每个任务特定模型的参数分离开来。然后,我们通过子空间掩蔽技术识别这些向量中的安全关键区域。最后,我们根据识别出的安全子空间,将初始安全对齐的 LLM 与所有任务向量融合,以恢复模型的安全属性。我们的实验证实,我们的安全重新对齐框架既能满足独立任务特定模型的安全要求,也能满足传统多任务模型在融合过程中的安全要求。我们的研究结果证实,SOMF 既能保持安全性,又不会明显影响特定任务的性能,同时还能表现出更高的数据效率。代码可在 https://github.com/xinykou/safety_realignment 公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
期刊最新文献
Progressive de-preference task-specific processing for generalizable person re-identification GKA-GPT: Graphical knowledge aggregation for multiturn dialog generation A novel spatio-temporal feature interleaved contrast learning neural network from a robustness perspective PSNet: A non-uniform illumination correction method for underwater images based pseudo-siamese network A novel domain-private-suppress meta-recognition network based universal domain generalization for machinery fault diagnosis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1