A safety realignment framework via subspace-oriented model fusion for large language models

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-11-09 DOI:10.1016/j.knosys.2024.112701

Xin Yi , Shunfan Zheng , Linlin Wang , Xiaoling Wang , Liang He

{"title":"A safety realignment framework via subspace-oriented model fusion for large language models","authors":"Xin Yi , Shunfan Zheng , Linlin Wang , Xiaoling Wang , Liang He","doi":"10.1016/j.knosys.2024.112701","DOIUrl":null,"url":null,"abstract":"<div><div>To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model’s safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at <span><span>https://github.com/xinykou/safety_realignment</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112701"},"PeriodicalIF":7.2000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013352","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model’s safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at https://github.com/xinykou/safety_realignment.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过面向子空间的模型融合实现大型语言模型的安全调整框架

为了提高大型语言模型（LLM）在特定任务中的性能，必须对特定任务指令进行微调。然而，这一过程很容易损害特定任务模型的安全性，使其容易服从恶意指令并生成有害内容。目前针对微调攻击的方法通常会干扰原有的微调目标，或者需要大量数据来重新调整被破坏的模型。为了应对这两大挑战，我们提出了在安全子空间中重新使用初始对齐模型和重新对齐特定任务模型的建议。在本文中，我们通过面向子空间的模型融合（SOMF）引入了一个安全重新调整框架，旨在将初始对齐模型的保障能力转移到当前的特定任务模型中。我们的方法首先将所有任务向量与每个任务特定模型的参数分离开来。然后，我们通过子空间掩蔽技术识别这些向量中的安全关键区域。最后，我们根据识别出的安全子空间，将初始安全对齐的 LLM 与所有任务向量融合，以恢复模型的安全属性。我们的实验证实，我们的安全重新对齐框架既能满足独立任务特定模型的安全要求，也能满足传统多任务模型在融合过程中的安全要求。我们的研究结果证实，SOMF 既能保持安全性，又不会明显影响特定任务的性能，同时还能表现出更高的数据效率。代码可在 https://github.com/xinykou/safety_realignment 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.