Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces

Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt
{"title":"Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces","authors":"Felix Lanfermann, Sebastian Schmitt, Patricia Wollstadt","doi":"10.1109/ICDMW58026.2022.00032","DOIUrl":null,"url":null,"abstract":"Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW58026.2022.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将概念识别理解为跨多个特征空间的一致数据聚类
在大数据集中识别有意义的概念可以为工程设计问题提供有价值的见解。概念识别旨在识别设计实例的非重叠组,这些设计实例在所有特征的联合空间中相似,但在仅考虑特征子集时也相似。这些子集通常包含与特定上下文相关的设计特征,例如,建设性设计参数、性能值或操作模式。通过孤立地考虑这些特征子集来评估设计概念的质量是可取的。特别是,有意义的概念不仅应该识别密集的、分离良好的数据实例组,而且还应该提供在单独考虑预定义特征子集时持续存在的非重叠数据组。在这项工作中,我们建议将概念识别视为一种特殊形式的聚类算法,在工程设计之外具有广泛的潜在应用。为了说明概念识别与经典聚类算法之间的差异,我们将最近提出的概念识别算法应用于两个合成数据集,并展示了识别解的差异。此外,我们引入互信息度量作为度量来评估解决方案是否在相关子集之间返回一致的聚类。为了支持对概念识别的新理解,我们考虑了来自能源管理领域决策问题的模拟数据集,并表明识别的聚类比普通聚类算法发现的聚类在相关特征子集方面更具可解释性,因此更适合支持决策者。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Above Ground Biomass Estimation of a Cocoa Plantation using Machine Learning Backdoor Poisoning of Encrypted Traffic Classifiers Identifying Patterns of Vulnerability Incidence in Foundational Machine Learning Repositories on GitHub: An Unsupervised Graph Embedding Approach Data-driven Kernel Subspace Clustering with Local Manifold Preservation Persona-Based Conversational AI: State of the Art and Challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1