Elodie Escriva , Tom Lefrere , Manon Martin , Julien Aligon , Alexandre Chanson , Jean-Baptiste Excoffier , Nicolas Labroche , Chantal Soulé-Dupuy , Paul Monsarrat
{"title":"Effective data exploration through clustering of local attributive explanations","authors":"Elodie Escriva , Tom Lefrere , Manon Martin , Julien Aligon , Alexandre Chanson , Jean-Baptiste Excoffier , Nicolas Labroche , Chantal Soulé-Dupuy , Paul Monsarrat","doi":"10.1016/j.is.2024.102464","DOIUrl":null,"url":null,"abstract":"<div><div>Machine Learning (ML) has become an essential tool for modeling complex phenomena, offering robust predictions and comprehensive data analysis. Nevertheless, the lack of interpretability in these predictions often results in a closed-box effect, which the field of eXplainable Machine Learning (XML) aims to address. Local attributive XML methods, in particular, provide explanations by quantifying the contribution of each attribute to individual predictions, referred to as influences. This type of explanation is the most acute as it focuses on each instance of the dataset and allows the detection of individual differences. Additionally, aggregating local explanations allows for a deeper analysis of the underlying data. In this context, influences can be considered as a new data space to reveal and understand complex data patterns. We hypothesize that these influences, derived from ML explanations, are more informative than the original raw data, especially for identifying homogeneous groups within the data. To identify such groups effectively, we utilize a clustering approach. We compare clusters formed using raw data against those formed using influences computed by various local attributive XML methods. Our findings reveal that clusters based on influences consistently outperform those based on raw data, even when using models with low accuracy.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"127 ","pages":"Article 102464"},"PeriodicalIF":3.0000,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001224","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Machine Learning (ML) has become an essential tool for modeling complex phenomena, offering robust predictions and comprehensive data analysis. Nevertheless, the lack of interpretability in these predictions often results in a closed-box effect, which the field of eXplainable Machine Learning (XML) aims to address. Local attributive XML methods, in particular, provide explanations by quantifying the contribution of each attribute to individual predictions, referred to as influences. This type of explanation is the most acute as it focuses on each instance of the dataset and allows the detection of individual differences. Additionally, aggregating local explanations allows for a deeper analysis of the underlying data. In this context, influences can be considered as a new data space to reveal and understand complex data patterns. We hypothesize that these influences, derived from ML explanations, are more informative than the original raw data, especially for identifying homogeneous groups within the data. To identify such groups effectively, we utilize a clustering approach. We compare clusters formed using raw data against those formed using influences computed by various local attributive XML methods. Our findings reveal that clusters based on influences consistently outperform those based on raw data, even when using models with low accuracy.
机器学习(ML)已成为复杂现象建模的重要工具,可提供可靠的预测和全面的数据分析。然而,由于这些预测缺乏可解释性,往往会产生闭箱效应,而可解释机器学习(XML)领域正是要解决这一问题。局部属性 XML 方法尤其通过量化每个属性对单个预测的贡献(称为影响)来提供解释。这种类型的解释最为尖锐,因为它侧重于数据集的每个实例,并允许检测个体差异。此外,汇总局部解释可以对基础数据进行更深入的分析。在这种情况下,影响因素可被视为一种新的数据空间,用于揭示和理解复杂的数据模式。我们假设,这些从 ML 解释中得出的影响因素比原始数据更有参考价值,尤其是在识别数据中的同质群体方面。为了有效识别这类群体,我们采用了聚类方法。我们将使用原始数据形成的聚类与使用各种局部归因 XML 方法计算的影响因素形成的聚类进行了比较。我们的研究结果表明,基于影响因素的聚类始终优于基于原始数据的聚类,即使在使用准确率较低的模型时也是如此。
期刊介绍:
Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems.
Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.