Adrian Derstroff, Simon Leistikow, Ali Nahardani, Katja Gruen, Marcus Franz, Verena Hoerr, Lars Linsen
{"title":"Interactive visual formula composition of multidimensional data classifiers","authors":"Adrian Derstroff, Simon Leistikow, Ali Nahardani, Katja Gruen, Marcus Franz, Verena Hoerr, Lars Linsen","doi":"10.1177/14738716241270288","DOIUrl":null,"url":null,"abstract":"Understanding how a classification result is generated and what role individual features play in the classification is crucial in many applications and, in particular, in medical contexts such as the translation of diagnosis biomarkers into clinical practice. The goal is to find (ideally simple) relationships between the features in multi-dimensional data and the classification for an explanation of the underlying phenomenon. Mathematical formulas allow for the expression of these relationships and can serve as classifiers. However, there are infinitely many mathematical formulas for the given features and they bear an inherent trade-off between complexity and accuracy. We present an interactive visual approach that supports domain experts to mitigate the trade-off issue. Core to our approach is a novel feature selection method, from which formulas are composed using symbolic regression and where state-of-the-art classifiers serve as a reference. To evaluate our approach and compare the achieved classification performance to the performance achieved by other state-of-the-art feature selection techniques, we test our methods with well-known machine learning data sets. Our evaluation shows that our feature selection method performs better than randomly selecting features for data sets with many features or when a low number of generations in the symbolic regression is required. Moreover, it consistently matches or outperforms state-of-the-art methods. Moreover, we apply our approach in a case study to a hemodynamic cohort data set, where we report our findings and domain expert feedback. Our approach was able to find formulas containing features that are in agreement with literature. Also, we could find formulas that performed better in the micro-averaged F1 score when compared to established histological indices.","PeriodicalId":50360,"journal":{"name":"Information Visualization","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Visualization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/14738716241270288","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding how a classification result is generated and what role individual features play in the classification is crucial in many applications and, in particular, in medical contexts such as the translation of diagnosis biomarkers into clinical practice. The goal is to find (ideally simple) relationships between the features in multi-dimensional data and the classification for an explanation of the underlying phenomenon. Mathematical formulas allow for the expression of these relationships and can serve as classifiers. However, there are infinitely many mathematical formulas for the given features and they bear an inherent trade-off between complexity and accuracy. We present an interactive visual approach that supports domain experts to mitigate the trade-off issue. Core to our approach is a novel feature selection method, from which formulas are composed using symbolic regression and where state-of-the-art classifiers serve as a reference. To evaluate our approach and compare the achieved classification performance to the performance achieved by other state-of-the-art feature selection techniques, we test our methods with well-known machine learning data sets. Our evaluation shows that our feature selection method performs better than randomly selecting features for data sets with many features or when a low number of generations in the symbolic regression is required. Moreover, it consistently matches or outperforms state-of-the-art methods. Moreover, we apply our approach in a case study to a hemodynamic cohort data set, where we report our findings and domain expert feedback. Our approach was able to find formulas containing features that are in agreement with literature. Also, we could find formulas that performed better in the micro-averaged F1 score when compared to established histological indices.
了解分类结果是如何产生的,以及各个特征在分类过程中发挥了什么作用,这在许多应用中都至关重要,尤其是在医疗领域,例如将诊断生物标记物转化为临床实践。我们的目标是找到多维数据中的特征与分类之间的(理想情况下是简单的)关系,以解释潜在的现象。数学公式可以表达这些关系,并可作为分类器。然而,给定特征的数学公式无穷无尽,它们在复杂性和准确性之间存在固有的权衡。我们提出了一种支持领域专家的交互式可视化方法,以缓解权衡问题。我们的方法的核心是一种新颖的特征选择方法,使用符号回归法组成公式,并以最先进的分类器作为参考。为了评估我们的方法,并将其分类性能与其他最先进的特征选择技术进行比较,我们用著名的机器学习数据集测试了我们的方法。评估结果表明,对于特征较多的数据集或需要较少代数的符号回归时,我们的特征选择方法比随机选择特征的方法性能更好。而且,它的性能始终与最先进的方法相匹配或更胜一筹。此外,我们还在血液动力学队列数据集的案例研究中应用了我们的方法,并报告了我们的发现和领域专家的反馈意见。我们的方法能够找到包含与文献一致的特征的公式。此外,与已有的组织学指数相比,我们还能找到在微观平均 F1 分数上表现更好的公式。
期刊介绍:
Information Visualization is essential reading for researchers and practitioners of information visualization and is of interest to computer scientists and data analysts working on related specialisms. This journal is an international, peer-reviewed journal publishing articles on fundamental research and applications of information visualization. The journal acts as a dedicated forum for the theories, methodologies, techniques and evaluations of information visualization and its applications.
The journal is a core vehicle for developing a generic research agenda for the field by identifying and developing the unique and significant aspects of information visualization. Emphasis is placed on interdisciplinary material and on the close connection between theory and practice.
This journal is a member of the Committee on Publication Ethics (COPE).