A graph theoretic approach to assess quality of data for classification task

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data & Knowledge Engineering Pub Date : 2025-07-01 Epub Date: 2025-03-03 DOI:10.1016/j.datak.2025.102421

Payel Sadhukhan , Samrat Gupta

{"title":"A graph theoretic approach to assess quality of data for classification task","authors":"Payel Sadhukhan , Samrat Gupta","doi":"10.1016/j.datak.2025.102421","DOIUrl":null,"url":null,"abstract":"<div><div>The correctness of predictions rendered by an AI/ML model is key to its acceptability. To foster researchers’ and practitioners’ confidence in the model, it is necessary to render an intuitive understanding of the workings of a model. In this work, we attempt to explain a model’s working by providing some insights into the quality of data. While doing this, it is essential to consider that revealing the training data to the users is not feasible for logistical and security reasons. However, sharing some interpretable parameters of the training data and correlating them with the model’s performance can be helpful in this regard. To this end, we propose a new measure based on Euclidean Minimum Spanning Tree (EMST) for quantifying the intrinsic separation (or overlaps) between the data classes. For experiments, we use datasets from diverse domains such as finance, medical, and marketing. We use state-of-the-art measure known as <em>Davies Bouldin Index (DBI)</em> to validate our approach on four different datasets from aforementioned domains. The experimental results of this study establish the viability of the proposed approach in explaining the working and efficiency of a classifier. Firstly, the proposed measure of class-overlap quantification has shown a better correlation with the classification performance as compared to DBI scores. Secondly, the results on multi-class datasets demonstrate that the proposed measure can be used to determine the feature importance so as to learn a better classification model.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"158 ","pages":"Article 102421"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X25000163","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/3 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The correctness of predictions rendered by an AI/ML model is key to its acceptability. To foster researchers’ and practitioners’ confidence in the model, it is necessary to render an intuitive understanding of the workings of a model. In this work, we attempt to explain a model’s working by providing some insights into the quality of data. While doing this, it is essential to consider that revealing the training data to the users is not feasible for logistical and security reasons. However, sharing some interpretable parameters of the training data and correlating them with the model’s performance can be helpful in this regard. To this end, we propose a new measure based on Euclidean Minimum Spanning Tree (EMST) for quantifying the intrinsic separation (or overlaps) between the data classes. For experiments, we use datasets from diverse domains such as finance, medical, and marketing. We use state-of-the-art measure known as Davies Bouldin Index (DBI) to validate our approach on four different datasets from aforementioned domains. The experimental results of this study establish the viability of the proposed approach in explaining the working and efficiency of a classifier. Firstly, the proposed measure of class-overlap quantification has shown a better correlation with the classification performance as compared to DBI scores. Secondly, the results on multi-class datasets demonstrate that the proposed measure can be used to determine the feature importance so as to learn a better classification model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种评估分类任务数据质量的图论方法

AI/ML模型预测的正确性是其可接受性的关键。为了培养研究人员和实践者对模型的信心，有必要对模型的工作原理有一个直观的理解。在这项工作中，我们试图通过提供对数据质量的一些见解来解释模型的工作。在这样做的同时，必须考虑到，出于后勤和安全原因，向用户透露训练数据是不可行的。然而，在这方面，共享训练数据的一些可解释参数并将它们与模型的性能相关联可能会有所帮助。为此，我们提出了一种基于欧几里得最小生成树（EMST）的度量方法来量化数据类之间的内在分离（或重叠）。对于实验，我们使用来自不同领域的数据集，如金融、医疗和营销。我们使用最先进的测量方法戴维斯博尔丁指数（DBI）在上述领域的四个不同数据集上验证我们的方法。本研究的实验结果证明了所提出的方法在解释分类器的工作和效率方面的可行性。首先，与DBI分数相比，本文提出的类重叠量化指标与分类性能的相关性更好。其次，在多类数据集上的结果表明，该方法可以用来确定特征的重要度，从而学习到更好的分类模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.