As Large Language Models (LLMs) have become integral to numerous practical applications, ensuring their robustness and safety is critical. Despite advancements in alignment techniques significantly improving overall safety, LLMs remain susceptible to adversarial inputs designed to exploit vulnerabilities. Existing adversarial attack methods have notable limitations: discrete token-based methods suffer from inefficiency, whereas continuous optimization methods typically fail to produce valid tokens from the model’s vocabulary, making them impractical for real-world applications.
In this paper, we propose Regularized Relaxation, a novel technique for adversarial attacks that overcomes these limitations by leveraging regularized gradients, computed with a constraint that encourages optimized embeddings to stay close to valid token representations. This enables continuous optimization to produce discrete tokens directly from the model’s vocabulary while preserving attack effectiveness. Our approach achieves a two-order-of-magnitude speed improvement compared to the state-of-the-art greedy coordinate gradient-based method. It significantly outperforms other recent methods in runtime and efficiency, while consistently achieving higher attack success rates across the majority of tested models and datasets. Crucially, our method produces valid tokens directly from the model’s vocabulary, overcoming a significant limitation of previous continuous optimization approaches. We demonstrate the effectiveness of our attack through extensive experiments on five state-of-the-art LLMs across four diverse datasets. Our implementation is publicly available at: https://github.com/sj21j/Regularized_Relaxation.
扫码关注我们
求助内容:
应助结果提醒方式:
