Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans

Jon Vadillo, Roberto Santana, Jose A. Lozano
{"title":"Adversarial Attacks in Explainable Machine Learning: A Survey of Threats Against Models and Humans","authors":"Jon Vadillo, Roberto Santana, Jose A. Lozano","doi":"10.1002/widm.1567","DOIUrl":null,"url":null,"abstract":"Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.","PeriodicalId":501013,"journal":{"name":"WIREs Data Mining and Knowledge Discovery","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WIREs Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/widm.1567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out‐of‐distribution inputs. In this paper, we comprehensively review the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios where a human assesses not only the input and the output classification, but also the explanation of the model's decision. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment. Based on this framework, we provide a structured review of the diverse attack paradigms existing in this domain, identify current gaps and future research directions, and illustrate the main attack paradigms discussed. Furthermore, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
可解释机器学习中的对抗性攻击:针对模型和人类的威胁调查
神经网络等机器学习模型的可靠部署仍面临挑战,原因在于其存在若干局限性。其中一些主要缺点是缺乏可解释性,以及对对抗性示例或超出分布范围的输入缺乏鲁棒性。在本文中,我们全面回顾了可解释机器学习模型对抗性攻击的可能性和局限性。首先,我们扩展了对抗性示例的概念,使其适用于可解释机器学习场景,即人类不仅要评估输入和输出分类,还要评估模型决策的解释。接下来,我们提出了一个综合框架,用于研究在人类评估下是否(以及如何)为可解释模型生成对抗性示例。在此框架的基础上,我们对该领域现有的各种攻击范例进行了结构化回顾,确定了当前的差距和未来的研究方向,并对所讨论的主要攻击范例进行了说明。此外,我们的框架还考虑了一系列相关但往往被忽视的因素,如问题的类型、用户的专业知识或解释的目的,以确定在每种情况下应采取的攻击策略,从而成功地欺骗模型(和人类)。这些贡献的目的是为在可解释机器学习领域对对抗性实例进行更严格、更现实的研究奠定基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Trace Encoding Techniques for Multi‐Perspective Process Mining: A Comparative Study Hyper‐Parameter Optimization of Kernel Functions on Multi‐Class Text Categorization: A Comparative Evaluation Dimensionality Reduction for Data Analysis With Quantum Feature Learning Business Analytics in Customer Lifetime Value: An Overview Analysis Knowledge Graph for Solubility Big Data: Construction and Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1