Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans

WIREs Data Mining and Knowledge Discovery Pub Date : 2024-10-28 DOI:10.1002/widm.1567

Jon Vadillo, Roberto Santana, Jose A. Lozano

{"title":"Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans","authors":"Jon Vadillo, Roberto Santana, Jose A. Lozano","doi":"10.1002/widm.1567","DOIUrl":null,"url":null,"abstract":"Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WIREs Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/widm.1567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可解释机器学习中的对抗性攻击：针对模型和人类的威胁调查

神经网络等机器学习模型的可靠部署仍面临挑战，原因在于其存在若干局限性。其中一些主要缺点是缺乏可解释性，以及对对抗性示例或超出分布范围的输入缺乏鲁棒性。在本文中，我们全面回顾了可解释机器学习模型对抗性攻击的可能性和局限性。首先，我们扩展了对抗性示例的概念，使其适用于可解释机器学习场景，即人类不仅要评估输入和输出分类，还要评估模型决策的解释。接下来，我们提出了一个综合框架，用于研究在人类评估下是否（以及如何）为可解释模型生成对抗性示例。在此框架的基础上，我们对该领域现有的各种攻击范例进行了结构化回顾，确定了当前的差距和未来的研究方向，并对所讨论的主要攻击范例进行了说明。此外，我们的框架还考虑了一系列相关但往往被忽视的因素，如问题的类型、用户的专业知识或解释的目的，以确定在每种情况下应采取的攻击策略，从而成功地欺骗模型（和人类）。这些贡献的目的是为在可解释机器学习领域对对抗性实例进行更严格、更现实的研究奠定基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

WIREs Data Mining and Knowledge Discovery

自引率

0.00%

发文量