基于图拓扑的可扩展归纳式半监督分类器与样本加权法

IF 4 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Knowledge Discovery from Data Pub Date : 2024-01-30 DOI:10.1145/3643645

Fadi Dornaika, Zoulfikar Ibrahim, Alirezah Bosaghzadeh

{"title":"基于图拓扑的可扩展归纳式半监督分类器与样本加权法","authors":"Fadi Dornaika, Zoulfikar Ibrahim, Alirezah Bosaghzadeh","doi":"10.1145/3643645","DOIUrl":null,"url":null,"abstract":"<p>Recently, graph-based semi-supervised learning (GSSL) has garnered significant interest in the realms of machine learning and pattern recognition. Although some of the proposed methods have made some progress, there are still some shortcomings that need to be overcome. There are three main limitations. First, the graphs used in these approaches are usually predefined regardless of the task at hand. Second, due to the use of graphs, almost all approaches are unable to process and consider data with a very large number of unlabeled samples. Thirdly, the imbalance of the topology of the samples is very often not taken into account. In particular, processing large datasets with GSSL might pose challenges in terms of computational resource feasibility. In this paper, we present a scalable and inductive GSSL method. We broaden the scope of the graph topology imbalance paradigm to extensive databases. Second, we employ the calculated weights of the labeled sample for the label-matching term in the global objective function. This leads to a unified, scalable, semi-supervised learning model that allows simultaneous labeling of unlabeled data, projection of the feature space onto the labeling space, along with the graph matrix of anchors. In the proposed scheme, the integration of labels and features from anchors is applied for the adaptive construction of the anchor graph. Experimental results were performed on four large databases: NORB, RCV1, Covtype, and MNIST. These experiments demonstrate that the proposed method exhibits superior performance when compared to existing scalable semi-supervised learning models.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"8 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable and Inductive Semi-supervised Classifier with Sample Weighting Based on Graph Topology\",\"authors\":\"Fadi Dornaika, Zoulfikar Ibrahim, Alirezah Bosaghzadeh\",\"doi\":\"10.1145/3643645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recently, graph-based semi-supervised learning (GSSL) has garnered significant interest in the realms of machine learning and pattern recognition. Although some of the proposed methods have made some progress, there are still some shortcomings that need to be overcome. There are three main limitations. First, the graphs used in these approaches are usually predefined regardless of the task at hand. Second, due to the use of graphs, almost all approaches are unable to process and consider data with a very large number of unlabeled samples. Thirdly, the imbalance of the topology of the samples is very often not taken into account. In particular, processing large datasets with GSSL might pose challenges in terms of computational resource feasibility. In this paper, we present a scalable and inductive GSSL method. We broaden the scope of the graph topology imbalance paradigm to extensive databases. Second, we employ the calculated weights of the labeled sample for the label-matching term in the global objective function. This leads to a unified, scalable, semi-supervised learning model that allows simultaneous labeling of unlabeled data, projection of the feature space onto the labeling space, along with the graph matrix of anchors. In the proposed scheme, the integration of labels and features from anchors is applied for the adaptive construction of the anchor graph. Experimental results were performed on four large databases: NORB, RCV1, Covtype, and MNIST. These experiments demonstrate that the proposed method exhibits superior performance when compared to existing scalable semi-supervised learning models.</p>\",\"PeriodicalId\":49249,\"journal\":{\"name\":\"ACM Transactions on Knowledge Discovery from Data\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Knowledge Discovery from Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3643645\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3643645","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

最近，基于图的半监督学习（GSSL）在机器学习和模式识别领域引起了极大的兴趣。尽管一些提出的方法取得了一些进展，但仍有一些不足之处需要克服。主要有三个局限性。首先，这些方法中使用的图形通常是预定义的，与手头的任务无关。其次，由于使用图形，几乎所有方法都无法处理和考虑具有大量未标记样本的数据。第三，样本拓扑结构的不平衡往往没有被考虑在内。特别是，使用 GSSL 处理大型数据集可能会给计算资源的可行性带来挑战。在本文中，我们提出了一种可扩展的归纳 GSSL 方法。我们将图拓扑不平衡范例的范围扩大到了广泛的数据库。其次，我们在全局目标函数的标签匹配项中使用了计算得出的标签样本权重。这就产生了一种统一的、可扩展的半监督学习模型，它允许同时对未标记数据进行标记、将特征空间投影到标记空间以及锚点图矩阵。在所提出的方案中，锚点的标签和特征整合被应用于锚点图的自适应构建。实验结果在四个大型数据库中进行了验证：NORB、RCV1、Covtype 和 MNIST。这些实验表明，与现有的可扩展半监督学习模型相比，所提出的方法表现出更优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scalable and Inductive Semi-supervised Classifier with Sample Weighting Based on Graph Topology

Recently, graph-based semi-supervised learning (GSSL) has garnered significant interest in the realms of machine learning and pattern recognition. Although some of the proposed methods have made some progress, there are still some shortcomings that need to be overcome. There are three main limitations. First, the graphs used in these approaches are usually predefined regardless of the task at hand. Second, due to the use of graphs, almost all approaches are unable to process and consider data with a very large number of unlabeled samples. Thirdly, the imbalance of the topology of the samples is very often not taken into account. In particular, processing large datasets with GSSL might pose challenges in terms of computational resource feasibility. In this paper, we present a scalable and inductive GSSL method. We broaden the scope of the graph topology imbalance paradigm to extensive databases. Second, we employ the calculated weights of the labeled sample for the label-matching term in the global objective function. This leads to a unified, scalable, semi-supervised learning model that allows simultaneous labeling of unlabeled data, projection of the feature space onto the labeling space, along with the graph matrix of anchors. In the proposed scheme, the integration of labels and features from anchors is applied for the adaptive construction of the anchor graph. Experimental results were performed on four large databases: NORB, RCV1, Covtype, and MNIST. These experiments demonstrate that the proposed method exhibits superior performance when compared to existing scalable semi-supervised learning models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Knowledge Discovery from Data COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.70

自引率

5.60%

发文量

172

审稿时长

3 months

期刊介绍： TKDD welcomes papers on a full range of research in the knowledge discovery and analysis of diverse forms of data. Such subjects include, but are not limited to: scalable and effective algorithms for data mining and big data analysis, mining brain networks, mining data streams, mining multi-media data, mining high-dimensional data, mining text, Web, and semi-structured data, mining spatial and temporal data, data mining for community generation, social network analysis, and graph structured data, security and privacy issues in data mining, visual, interactive and online data mining, pre-processing and post-processing for data mining, robust and scalable statistical methods, data mining languages, foundations of data mining, KDD framework and process, and novel applications and infrastructures exploiting data mining technology including massively parallel processing and cloud computing platforms. TKDD encourages papers that explore the above subjects in the context of large distributed networks of computers, parallel or multiprocessing computers, or new data devices. TKDD also encourages papers that describe emerging data mining applications that cannot be satisfied by the current data mining technology.