A practical approach to novel class discovery in tabular data

IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data Mining and Knowledge Discovery Pub Date : 2024-05-31 DOI:10.1007/s10618-024-01025-y
Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine
{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":null,"url":null,"abstract":"<p>The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the <i>k</i>-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (<i>k</i>-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01025-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在表格数据中发现新类别的实用方法
新类别发现(NCD)问题包括从已标注的已知类别集合中提取知识,以准确划分未标注的新类别集合。虽然 NCD 最近受到了社会各界的广泛关注,但它通常是在计算机视觉问题和不现实的条件下解决的。特别是,新类别的数量通常被假定为事先已知,其标签有时被用于调整超参数。依赖于这些假设的方法并不适用于现实世界中的场景。在这项工作中,我们将重点关注在没有关于新类别的先验知识的情况下,如何解决表格数据中的 NCD 问题。为此,我们建议调整 NCD 方法的超参数,方法是调整 k 折交叉验证过程,并在每个折中隐藏一些已知类别。由于我们发现超参数过多的方法很可能会过度拟合这些隐藏类别,因此我们定义了一个简单的深度 NCD 模型。该方法仅由 NCD 问题所需的基本要素组成,并在现实条件下表现出稳健的性能。此外,我们发现该方法的潜在空间可用于可靠地估计新类别的数量。此外,我们还调整了两种无监督聚类算法(k-means 和光谱聚类),以充分利用已知类别的知识。我们在 7 个表格数据集上进行了广泛的实验,证明了所提方法和超参数调整过程的有效性,并表明无需依赖新类别知识也能解决 NCD 问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery 工程技术-计算机:人工智能
CiteScore
10.40
自引率
4.20%
发文量
68
审稿时长
10 months
期刊介绍: Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.
期刊最新文献
FRUITS: feature extraction using iterated sums for time series classification Bounding the family-wise error rate in local causal discovery using Rademacher averages Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack Efficient learning with projected histograms Opinion dynamics in social networks incorporating higher-order interactions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1