Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li
{"title":"CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration","authors":"Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li","doi":"arxiv-2409.11365","DOIUrl":null,"url":null,"abstract":"The deployment of multimodal large language models (MLLMs) has demonstrated\nremarkable success in engaging in conversations involving visual inputs, thanks\nto the superior power of large language models (LLMs). Those MLLMs are\ntypically built based on the LLMs, with an image encoder to process images into\nthe token embedding space of the LLMs. However, the integration of visual\nmodality has introduced a unique vulnerability: the MLLM becomes susceptible to\nmalicious visual inputs and prone to generating sensitive or harmful responses,\neven though the LLM has been trained on textual dataset to align with human\nvalue. In this paper, we first raise the question: ``Do the MLLMs possess\nsafety-awareness against malicious image inputs?\". We find that after adding a\nprinciple that specifies the safety requirement into the input of the MLLM, the\nmodel's safety awareness becomes boosted. This phenomenon verifies the\nexistence of MLLM's safety-awareness against image inputs, it is only weakened\nby the modality gap. We then introduce a simple yet effective technique termed\nCoCA, which amplifies the safety-awareness of the MLLM by calibrating its\noutput distribution. Our proposed strategy helps the model reclaim its original\nsafety awareness without losing its original capabilities. We verify the\neffectiveness of our approach on both multimodal safety and understanding\nbenchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The deployment of multimodal large language models (MLLMs) has demonstrated
remarkable success in engaging in conversations involving visual inputs, thanks
to the superior power of large language models (LLMs). Those MLLMs are
typically built based on the LLMs, with an image encoder to process images into
the token embedding space of the LLMs. However, the integration of visual
modality has introduced a unique vulnerability: the MLLM becomes susceptible to
malicious visual inputs and prone to generating sensitive or harmful responses,
even though the LLM has been trained on textual dataset to align with human
value. In this paper, we first raise the question: ``Do the MLLMs possess
safety-awareness against malicious image inputs?". We find that after adding a
principle that specifies the safety requirement into the input of the MLLM, the
model's safety awareness becomes boosted. This phenomenon verifies the
existence of MLLM's safety-awareness against image inputs, it is only weakened
by the modality gap. We then introduce a simple yet effective technique termed
CoCA, which amplifies the safety-awareness of the MLLM by calibrating its
output distribution. Our proposed strategy helps the model reclaim its original
safety awareness without losing its original capabilities. We verify the
effectiveness of our approach on both multimodal safety and understanding
benchmarks.