ACM Transactions on Knowledge Discovery from Data最新文献_第7页

Mining top-k high on-shelf utility itemsets using novel threshold raising strategies 利用新颖的阈值提升策略挖掘 top-k 高货架效用项目集

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-08 DOI: 10.1145/3645115

Kuldeep Singh, Bhaskar Biswas

High utility itemsets (HUIs) mining is an emerging area of data mining which discovers sets of items generating a high profit from transactional datasets. In recent years, several algorithms have been proposed for this task. However, most of them do not consider the on-shelf time period of items and negative utility of items. High on-shelf utility itemset (HOUIs) mining is more difficult than traditional HUIs mining because it deals with on-shelf based time period and negative utility of items. Moreover, most algorithms need minimum utility threshold ((min_util )) to find rules. However, specifying the appropriate (min_util ) threshold is a difficult problem for users. A smaller (min_util ) threshold may generate too many rules and a higher one may generate a few rules, which can degrade performance. To address these issues, a novel top-k HOUIs mining algorithm named TKOS (Top-K high On-Shelf utility itemsets miner) is proposed which considers on-shelf time period and negative utility. TKOS presents a novel branch and bound based strategy to raise the internal (min_util ) threshold efficiently. It also presents two pruning strategies to speed up the mining process. In order to reduce the dataset scanning cost, we utilize transaction merging and dataset projection techniques. Extensive experiments have been conducted on real and synthetic datasets having various characteristics. Experimental results show that the proposed algorithm outperforms the state-of-the-art algorithms. The proposed algorithm is up to 42 times faster and uses up-to 19 times less memory compared to the state-of-the-art KOSHU. Moreover, the proposed algorithm has excellent scalability in terms of time periods and the number of transactions.

高效用项目集（HUIs）挖掘是数据挖掘的一个新兴领域，它能从交易数据集中发现产生高利润的项目集。近年来，针对这一任务提出了多种算法。然而，大多数算法都没有考虑物品的在架时间段和物品的负效用。高货架效用物品集（HOUIs）挖掘比传统的 HUIs 挖掘更加困难，因为它要处理基于货架时间段和物品负效用的问题。此外，大多数算法都需要最小效用阈值（min_util ）来找到规则。然而，指定合适的 (min_util ) 门槛对用户来说是个难题。较小的 (min_util )阈值可能会生成过多的规则，而较高的 (min_util )阈值可能会生成较少的规则，从而降低性能。为了解决这些问题，我们提出了一种名为 TKOS（Top-K high On-Shelf utility itemsets miner）的新型 top-k HOUIs 挖掘算法，它考虑了上架时间段和负效用。TKOS 提出了一种新颖的基于分支和边界的策略，以有效提高内部 (min_util ) 门限。它还提出了两种剪枝策略来加快挖掘过程。为了降低数据集扫描成本，我们使用了事务合并和数据集投影技术。我们在具有各种特征的真实数据集和合成数据集上进行了广泛的实验。实验结果表明，所提出的算法优于最先进的算法。与最先进的 KOSHU 相比，提出的算法速度快达 42 倍，内存使用量少达 19 倍。此外，所提出的算法在时间段和事务数量方面都具有出色的可扩展性。

{"title":"Mining top-k high on-shelf utility itemsets using novel threshold raising strategies","authors":"Kuldeep Singh, Bhaskar Biswas","doi":"10.1145/3645115","DOIUrl":"https://doi.org/10.1145/3645115","url":null,"abstract":"High utility itemsets (HUIs) mining is an emerging area of data mining which discovers sets of items generating a high profit from transactional datasets. In recent years, several algorithms have been proposed for this task. However, most of them do not consider the on-shelf time period of items and negative utility of items. High on-shelf utility itemset (HOUIs) mining is more difficult than traditional HUIs mining because it deals with on-shelf based time period and negative utility of items. Moreover, most algorithms need minimum utility threshold ((min_util )) to find rules. However, specifying the appropriate (min_util ) threshold is a difficult problem for users. A smaller (min_util ) threshold may generate too many rules and a higher one may generate a few rules, which can degrade performance. To address these issues, a novel top-k HOUIs mining algorithm named TKOS (Top-K high On-Shelf utility itemsets miner) is proposed which considers on-shelf time period and negative utility. TKOS presents a novel branch and bound based strategy to raise the internal (min_util ) threshold efficiently. It also presents two pruning strategies to speed up the mining process. In order to reduce the dataset scanning cost, we utilize transaction merging and dataset projection techniques. Extensive experiments have been conducted on real and synthetic datasets having various characteristics. Experimental results show that the proposed algorithm outperforms the state-of-the-art algorithms. The proposed algorithm is up to 42 times faster and uses up-to 19 times less memory compared to the state-of-the-art KOSHU. Moreover, the proposed algorithm has excellent scalability in terms of time periods and the number of transactions.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"175 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139762912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incorporating Multi-Level Sampling with Adaptive Aggregation for Inductive Knowledge Graph Completion 将多级采样与自适应聚合相结合，实现归纳式知识图谱补全

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-07 DOI: 10.1145/3644822

Kai Sun, Huajie Jiang, Yongli Hu, Baocai Yin

In recent years, Graph Neural Networks (GNNs) have achieved unprecedented success in handling graph-structured data, thereby driving the development of numerous GNN-oriented techniques for inductive knowledge graph completion (KGC). A key limitation of existing methods, however, is their dependence on pre-defined aggregation functions, which lack the adaptability to diverse data, resulting in suboptimal performance on established benchmarks. Another challenge arises from the exponential increase in irrelated entities as the reasoning path lengthens, introducing unwarranted noise and consequently diminishing the model’s generalization capabilities. To surmount these obstacles, we design an innovative framework that synergizes Multi-Level Sampling with an Adaptive Aggregation mechanism (MLSAA). Distinctively, our model couples GNNs with enhanced set transformers, enabling dynamic selection of the most appropriate aggregation function tailored to specific datasets and tasks. This adaptability significantly boosts both the model’s flexibility and its expressive capacity. Additionally, we unveil a unique sampling strategy designed to selectively filter irrelevant entities, while retaining potentially beneficial targets throughout the reasoning process. We undertake an exhaustive evaluation of our novel inductive KGC method across three pivotal benchmark datasets and the experimental results corroborate the efficacy of MLSAA.

近年来，图神经网络（GNN）在处理图结构数据方面取得了前所未有的成功，从而推动了众多面向图神经网络的归纳式知识图完成（KGC）技术的发展。然而，现有方法的一个主要局限是依赖于预定义的聚合函数，缺乏对各种数据的适应性，导致在既定基准上的性能不理想。另一个挑战是，随着推理路径的延长，不相关的实体会呈指数级增长，从而引入不必要的噪声，进而削弱模型的泛化能力。为了克服这些障碍，我们设计了一个创新框架，将多级采样与自适应聚合机制（MLSAA）协同作用。与众不同的是，我们的模型将 GNN 与增强型集合转换器相结合，从而能够根据特定数据集和任务动态选择最合适的聚合函数。这种适应性大大提高了模型的灵活性和表达能力。此外，我们还推出了一种独特的采样策略，旨在有选择性地过滤无关实体，同时在整个推理过程中保留潜在的有利目标。我们在三个关键基准数据集上对新颖的归纳式 KGC 方法进行了详尽的评估，实验结果证实了 MLSAA 的功效。

{"title":"Incorporating Multi-Level Sampling with Adaptive Aggregation for Inductive Knowledge Graph Completion","authors":"Kai Sun, Huajie Jiang, Yongli Hu, Baocai Yin","doi":"10.1145/3644822","DOIUrl":"https://doi.org/10.1145/3644822","url":null,"abstract":"In recent years, Graph Neural Networks (GNNs) have achieved unprecedented success in handling graph-structured data, thereby driving the development of numerous GNN-oriented techniques for inductive knowledge graph completion (KGC). A key limitation of existing methods, however, is their dependence on pre-defined aggregation functions, which lack the adaptability to diverse data, resulting in suboptimal performance on established benchmarks. Another challenge arises from the exponential increase in irrelated entities as the reasoning path lengthens, introducing unwarranted noise and consequently diminishing the model’s generalization capabilities. To surmount these obstacles, we design an innovative framework that synergizes Multi-Level Sampling with an Adaptive Aggregation mechanism (MLSAA). Distinctively, our model couples GNNs with enhanced set transformers, enabling dynamic selection of the most appropriate aggregation function tailored to specific datasets and tasks. This adaptability significantly boosts both the model’s flexibility and its expressive capacity. Additionally, we unveil a unique sampling strategy designed to selectively filter irrelevant entities, while retaining potentially beneficial targets throughout the reasoning process. We undertake an exhaustive evaluation of our novel inductive KGC method across three pivotal benchmark datasets and the experimental results corroborate the efficacy of MLSAA.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"4 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139763230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conditional Generative Adversarial Network for Early Classification of Longitudinal Datasets using an Imputation Approach 使用估算方法对纵向数据集进行早期分类的条件生成对抗网络

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-07 DOI: 10.1145/3644821

Sharon Torao Pingi, Richi Nayak, Md Abul Bashar

Early classification of longitudinal data remains an active area of research today. The complexity of these datasets and the high rates of missing data caused by irregular sampling present data-level challenges for the Early Longitudinal Data Classification (ELDC) problem. Coupled with the algorithmic challenge of optimising the opposing objectives of early classification (i.e., earliness and accuracy), ELDC becomes a non-trivial task. Inspired by the generative power and utility of the Generative Adversarial Network (GAN), we propose a novel context-conditional, longitudinal early classifier GAN (LEC-GAN). This model utilises informative missingness, static features, and earlier observations to improve the ELDC objective. It achieves this by incorporating ELDC as an auxiliary task within an imputation optimization process. Our experiments on several datasets demonstrate that LEC-GAN outperforms all relevant baselines in terms of F1 scores while increasing the earliness of prediction.

纵向数据的早期分类仍然是当今一个活跃的研究领域。这些数据集的复杂性和不规则抽样造成的高数据缺失率给早期纵向数据分类（ELDC）问题带来了数据层面的挑战。再加上优化早期分类的对立目标（即早期性和准确性）的算法挑战，ELDC 成为了一项非同小可的任务。受生成对抗网络（GAN）的生成能力和实用性的启发，我们提出了一种新颖的上下文条件纵向早期分类器 GAN（LEC-GAN）。该模型利用信息缺失、静态特征和早期观测来改善 ELDC 目标。它通过将 ELDC 作为一项辅助任务纳入估算优化流程来实现这一目标。我们在多个数据集上进行的实验表明，LEC-GAN 在提高预测准确率的同时，在 F1 分数方面优于所有相关基线。

引用次数: 0

Multi-Instance Learning with One Side Label Noise 单侧标签噪声下的多实例学习

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-07 DOI: 10.1145/3644076

Tianxiang Luan, Shilin Gu, Xijia Tang, Wenzhang Zhuge, Chenping Hou

Multi-instance Learning (MIL) is a popular learning paradigm arising from many real applications. It assigns a label to a set of instances, named as a bag, and the bag’s label is determined by the instances within it. A bag is positive if and only if it has at least one positive instance. Since labeling bags is more complicated than labeling each instance, we will often face the mislabeling problem in MIL. Furthermore, it is more common that a negative bag has been mislabeled to a positive one since one mislabeled instance will lead to the change of the whole bag label. This is an important problem that originated from real applications, e.g., web mining and image classification, but little research has concentrated on it as far as we know. In this paper, we focus on this MIL problem with one side label noise that the negative bags are mislabeled as positive ones. To address this challenging problem, we propose a novel multi-instance learning method with One Side Label Noise (OSLN). We design a new double weighting approach under traditional framework to characterize the ’faithfulness’ of each instance and each bag in learning the classifier. Briefly, on the instance level, we employ a sparse weighting method to select the key instances, and the MIL problem with one size label noise is converted to a mislabeled supervised learning scenario. On the bag level, the weights of bags, together with the selected key instances, will be utilized to identify the real positive bags. In addition, we have solved our proposed model by an alternative iteration method with proved convergence behavior. Empirical studies on various datasets have validated the effectiveness of our method.

多实例学习（Multi-instance Learning，MIL）是在许多实际应用中产生的一种流行的学习范式。它为一组实例分配一个标签，命名为 "包"，包的标签由其中的实例决定。当且仅当一个包至少有一个正向实例时，它才是正向的。由于给袋贴标签比给每个实例贴标签要复杂得多，因此我们在 MIL 中经常会遇到贴错标签的问题。此外，由于一个错误标注的实例会导致整个包的标签发生变化，因此负包被错误标注为正包的情况更为常见。这是一个源于实际应用（如网络挖掘和图像分类）的重要问题，但就我们所知，很少有研究集中于此。在本文中，我们将重点关注这一具有单侧标签噪声的 MIL 问题，即负袋被误标记为正袋。为了解决这个具有挑战性的问题，我们提出了一种带有单侧标签噪声（OSLN）的新型多实例学习方法。我们在传统框架下设计了一种新的双重加权方法，用于描述每个实例和每个袋在分类器学习中的 "忠实度"。简而言之，在实例层面，我们采用稀疏加权法来选择关键实例，并将单侧标签噪声的 MIL 问题转换为错误标签的监督学习场景。在袋层面，我们将利用袋的权重和所选的关键实例来识别真正的正向袋。此外，我们还采用了另一种迭代方法来解决我们提出的模型，其收敛性已得到证实。对各种数据集的实证研究验证了我们方法的有效性。

{"title":"Multi-Instance Learning with One Side Label Noise","authors":"Tianxiang Luan, Shilin Gu, Xijia Tang, Wenzhang Zhuge, Chenping Hou","doi":"10.1145/3644076","DOIUrl":"https://doi.org/10.1145/3644076","url":null,"abstract":"Multi-instance Learning (MIL) is a popular learning paradigm arising from many real applications. It assigns a label to a set of instances, named as a bag, and the bag’s label is determined by the instances within it. A bag is positive if and only if it has at least one positive instance. Since labeling bags is more complicated than labeling each instance, we will often face the mislabeling problem in MIL. Furthermore, it is more common that a negative bag has been mislabeled to a positive one since one mislabeled instance will lead to the change of the whole bag label. This is an important problem that originated from real applications, e.g., web mining and image classification, but little research has concentrated on it as far as we know. In this paper, we focus on this MIL problem with one side label noise that the negative bags are mislabeled as positive ones. To address this challenging problem, we propose a novel multi-instance learning method with One Side Label Noise (OSLN). We design a new double weighting approach under traditional framework to characterize the ’faithfulness’ of each instance and each bag in learning the classifier. Briefly, on the instance level, we employ a sparse weighting method to select the key instances, and the MIL problem with one size label noise is converted to a mislabeled supervised learning scenario. On the bag level, the weights of bags, together with the selected key instances, will be utilized to identify the real positive bags. In addition, we have solved our proposed model by an alternative iteration method with proved convergence behavior. Empirical studies on various datasets have validated the effectiveness of our method.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"125 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139763102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Value of Head Labels in Multi-Label Text Classification 论多标签文本分类中头部标签的价值

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-05 DOI: 10.1145/3643853

Haobo Wang, Cheng Peng, Hede Dong, Lei Feng, Weiwei Liu, Tianlei Hu, Ke Chen, Gang Chen

A formidable challenge in the multi-label text classification (MLTC) context is that the labels often exhibit a long-tailed distribution, which typically prevents deep MLTC models from obtaining satisfactory performance. To alleviate this problem, most existing solutions attempt to improve tail performance by means of sampling or introducing extra knowledge. Data-rich labels, though more trustworthy, have not received the attention they deserve. In this work, we propose a multiple-stage training framework to exploit both model- and feature-level knowledge from the head labels, to improve both the representation and generalization ability of MLTC models. Moreover, we theoretically prove the superiority of our framework design over other alternatives. Comprehensive experiments on widely-used MLTC datasets clearly demonstrate that the proposed framework achieves highly superior results to state-of-the-art methods, highlighting the value of head labels in MLTC.

多标签文本分类（MLTC）面临的一个严峻挑战是，标签经常呈现长尾分布，这通常会阻碍深度 MLTC 模型获得令人满意的性能。为了缓解这一问题，大多数现有解决方案都试图通过采样或引入额外知识来提高尾部性能。数据丰富的标签虽然更值得信赖，但却没有得到应有的重视。在这项工作中，我们提出了一个多阶段训练框架，利用来自头部标签的模型级和特征级知识，来提高 MLTC 模型的表示和泛化能力。此外，我们还从理论上证明了我们的框架设计优于其他替代方案。在广泛使用的 MLTC 数据集上进行的综合实验清楚地表明，与最先进的方法相比，所提出的框架取得了非常优越的结果，凸显了头部标签在 MLTC 中的价值。

引用次数: 0

Mixed Graph Contrastive Network for Semi-Supervised Node Classification 用于半监督节点分类的混合图对比网络

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-05 DOI: 10.1145/3641549

Xihong Yang, Yiqi Wang, Yue Liu, Yi Wen, Lingyuan Meng, Sihang Zhou, Xinwang Liu, En Zhu

Graph Neural Networks (GNNs) have achieved promising performance in semi-supervised node classification in recent years. However, the problem of insufficient supervision, together with representation collapse, largely limits the performance of the GNNs in this field. To alleviate the collapse of node representations in semi-supervised scenario, we propose a novel graph contrastive learning method, termed Mixed Graph Contrastive Network (MGCN). In our method, we improve the discriminative capability of the latent embeddings by an interpolation-based augmentation strategy and a correlation reduction mechanism. Specifically, we first conduct the interpolation-based augmentation in the latent space and then force the prediction model to change linearly between samples. Second, we enable the learned network to tell apart samples across two interpolation-perturbed views through forcing the correlation matrix across views to approximate an identity matrix. By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discriminative representation learning. Extensive experimental results on six datasets demonstrate the effectiveness and the generality of MGCN compared to the existing state-of-the-art methods. The code of MGCN is available at https://github.com/xihongyang1999/MGCN on Github.

近年来，图神经网络（GNN）在半监督节点分类方面取得了可喜的成绩。然而，监督不足和表征坍塌问题在很大程度上限制了图神经网络在这一领域的表现。为了缓解半监督场景下节点表征的崩溃问题，我们提出了一种新颖的图对比学习方法，即混合图对比网络（MGCN）。在我们的方法中，我们通过基于插值的增强策略和相关性降低机制来提高潜在嵌入的判别能力。具体来说，我们首先在潜空间中进行基于插值的增强，然后强制预测模型在样本间线性变化。其次，我们通过迫使不同视图之间的相关矩阵近似于同一矩阵，使学习网络能够区分两个插值扰动视图之间的样本。通过结合这两种设置，我们从丰富的未标记节点和稀少但有价值的标记节点中提取了丰富的监督信息，用于判别表征学习。在六个数据集上的广泛实验结果表明，与现有的最先进方法相比，MGCN 具有高效性和通用性。MGCN 的代码可从 Github 上的 https://github.com/xihongyang1999/MGCN 获取。

{"title":"Mixed Graph Contrastive Network for Semi-Supervised Node Classification","authors":"Xihong Yang, Yiqi Wang, Yue Liu, Yi Wen, Lingyuan Meng, Sihang Zhou, Xinwang Liu, En Zhu","doi":"10.1145/3641549","DOIUrl":"https://doi.org/10.1145/3641549","url":null,"abstract":"Graph Neural Networks (GNNs) have achieved promising performance in semi-supervised node classification in recent years. However, the problem of insufficient supervision, together with representation collapse, largely limits the performance of the GNNs in this field. To alleviate the collapse of node representations in semi-supervised scenario, we propose a novel graph contrastive learning method, termed Mixed Graph Contrastive Network (MGCN). In our method, we improve the discriminative capability of the latent embeddings by an interpolation-based augmentation strategy and a correlation reduction mechanism. Specifically, we first conduct the interpolation-based augmentation in the latent space and then force the prediction model to change linearly between samples. Second, we enable the learned network to tell apart samples across two interpolation-perturbed views through forcing the correlation matrix across views to approximate an identity matrix. By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discriminative representation learning. Extensive experimental results on six datasets demonstrate the effectiveness and the generality of MGCN compared to the existing state-of-the-art methods. The code of MGCN is available at https://github.com/xihongyang1999/MGCN on Github.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"33 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139763104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Networked Time Series Prediction with Incomplete Data via Generative Adversarial Network 通过生成式对抗网络利用不完整数据进行时间序列网络预测

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-05 DOI: 10.1145/3643822

Yichen Zhu, Bo Jiang, Haiming Jin, Mengtian Zhang, Feng Gao, Jianqiang Huang, Tao Lin, Xinbing Wang

A networked time series (NETS) is a family of time series on a given graph, one for each node. It has a wide range of applications from intelligent transportation, environment monitoring to smart grid management. An important task in such applications is to predict the future values of a NETS based on its historical values and the underlying graph. Most existing methods require complete data for training. However, in real-world scenarios, it is not uncommon to have missing data due to sensor malfunction, incomplete sensing coverage, etc. In this paper, we study the problem of NETS prediction with incomplete data. We propose NETS-ImpGAN, a novel deep learning framework that can be trained on incomplete data with missing values in both history and future. Furthermore, we propose Graph Temporal Attention Networks, which incorporate the attention mechanism to capture both inter-time series and temporal correlations. We conduct extensive experiments on four real-world datasets under different missing patterns and missing rates. The experimental results show that NETS-ImpGAN outperforms existing methods, reducing the MAE by up to 25%.

网络时间序列（NETS）是给定图形上的时间序列系列，每个节点一个。它的应用范围非常广泛，从智能交通、环境监测到智能电网管理。此类应用中的一项重要任务是根据 NETS 的历史值和底层图预测其未来值。大多数现有方法都需要完整的数据进行训练。然而，在现实世界中，由于传感器故障、传感覆盖范围不完整等原因导致数据缺失的情况并不少见。在本文中，我们研究了不完整数据下的 NETS 预测问题。我们提出了一种新颖的深度学习框架 NETS-ImpGAN，它可以在历史和未来都有缺失值的不完整数据上进行训练。此外，我们还提出了图时态注意力网络（Graph Temporal Attention Networks），它结合了注意力机制来捕捉时间序列间和时间上的相关性。我们在四个真实世界数据集上进行了广泛的实验，这些数据集具有不同的缺失模式和缺失率。实验结果表明，NETS-ImpGAN 的性能优于现有方法，其 MAE 降低了 25%。

{"title":"Networked Time Series Prediction with Incomplete Data via Generative Adversarial Network","authors":"Yichen Zhu, Bo Jiang, Haiming Jin, Mengtian Zhang, Feng Gao, Jianqiang Huang, Tao Lin, Xinbing Wang","doi":"10.1145/3643822","DOIUrl":"https://doi.org/10.1145/3643822","url":null,"abstract":"A networked time series (NETS) is a family of time series on a given graph, one for each node. It has a wide range of applications from intelligent transportation, environment monitoring to smart grid management. An important task in such applications is to predict the future values of a NETS based on its historical values and the underlying graph. Most existing methods require complete data for training. However, in real-world scenarios, it is not uncommon to have missing data due to sensor malfunction, incomplete sensing coverage, etc. In this paper, we study the problem of NETS prediction with incomplete data. We propose NETS-ImpGAN, a novel deep learning framework that can be trained on incomplete data with missing values in both history and future. Furthermore, we propose Graph Temporal Attention Networks, which incorporate the attention mechanism to capture both inter-time series and temporal correlations. We conduct extensive experiments on four real-world datasets under different missing patterns and missing rates. The experimental results show that NETS-ImpGAN outperforms existing methods, reducing the MAE by up to 25%.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"111 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139755477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Out-of-distribution Generalization on Graphs via Causal Attention Learning 通过因果注意学习加强图上的分布外泛化

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-05 DOI: 10.1145/3644392

Yongduo Sui, Wenyu Mao, Shuyao Wang, Xiang Wang, Jiancan Wu, Xiangnan He, Tat-Seng Chua

In graph classification, attention- and pooling-based graph neural networks (GNNs) predominate to extract salient features from the input graph and support the prediction. They mostly follow the paradigm of “learning to attend”, which maximizes the mutual information between the attended graph and the ground-truth label. However, this paradigm causes GNN classifiers to indiscriminately absorb all statistical correlations between input features and labels in the training data, without distinguishing the causal and noncausal effects of features. Rather than emphasizing causal features, the attended graphs tend to rely on noncausal features as shortcuts to predictions. These shortcut features may easily change outside the training distribution, thereby leading to poor generalization for GNN classifiers. In this paper, we take a causal view on GNN modeling. Under our causal assumption, the shortcut feature serves as a confounder between the causal feature and prediction. It misleads the classifier into learning spurious correlations that facilitate prediction in in-distribution (ID) test evaluation, while causing significant performance drop in out-of-distribution (OOD) test data. To address this issue, we employ the backdoor adjustment from causal theory — combining each causal feature with various shortcut features, to identify causal patterns and mitigate the confounding effect. Specifically, we employ attention modules to estimate the causal and shortcut features of the input graph. Then, a memory bank collects the estimated shortcut features, enhancing the diversity of shortcut features for combination. Simultaneously, we apply the prototype strategy to improve the consistency of intra-class causal features. We term our method as CAL+, which can promote stable relationships between causal estimation and prediction, regardless of distribution changes. Extensive experiments on synthetic and real-world OOD benchmarks demonstrate our method’s effectiveness in improving OOD generalization. Our codes are released at https://github.com/shuyao-wang/CAL-plus.

在图分类中，基于注意力和池化的图神经网络（GNN）在从输入图中提取突出特征并支持预测方面占主导地位。它们大多遵循 "学习关注 "范式，即最大化被关注图与地面真实标签之间的互信息。然而，这种范式会导致 GNN 分类器不加区分地吸收训练数据中输入特征与标签之间的所有统计相关性，而不区分特征的因果效应和非因果效应。被关注图不仅不强调因果特征，反而倾向于依赖非因果特征作为预测的捷径。这些捷径特征很容易在训练分布之外发生变化，从而导致 GNN 分类器的泛化效果不佳。在本文中，我们从因果关系的角度来看待 GNN 建模。根据我们的因果假设，捷径特征是因果特征和预测之间的混淆物。它误导分类器学习虚假的相关性，从而在分布内（ID）测试评估中促进预测，而在分布外（OOD）测试数据中导致性能显著下降。为了解决这个问题，我们采用了因果理论中的后门调整--将每个因果特征与各种快捷特征相结合，以识别因果模式并减轻混杂效应。具体来说，我们采用注意力模块来估计输入图的因果特征和捷径特征。然后，记忆库收集估算出的捷径特征，增强捷径特征的多样性，以便进行组合。同时，我们采用原型策略来提高类内因果特征的一致性。我们将这种方法称为 CAL+，它可以促进因果估计和预测之间的稳定关系，而不受分布变化的影响。在合成和实际 OOD 基准上的广泛实验证明了我们的方法在提高 OOD 泛化方面的有效性。我们的代码发布于 https://github.com/shuyao-wang/CAL-plus。

{"title":"Enhancing Out-of-distribution Generalization on Graphs via Causal Attention Learning","authors":"Yongduo Sui, Wenyu Mao, Shuyao Wang, Xiang Wang, Jiancan Wu, Xiangnan He, Tat-Seng Chua","doi":"10.1145/3644392","DOIUrl":"https://doi.org/10.1145/3644392","url":null,"abstract":"In graph classification, attention- and pooling-based graph neural networks (GNNs) predominate to extract salient features from the input graph and support the prediction. They mostly follow the paradigm of “learning to attend”, which maximizes the mutual information between the attended graph and the ground-truth label. However, this paradigm causes GNN classifiers to indiscriminately absorb all statistical correlations between input features and labels in the training data, without distinguishing the causal and noncausal effects of features. Rather than emphasizing causal features, the attended graphs tend to rely on noncausal features as shortcuts to predictions. These shortcut features may easily change outside the training distribution, thereby leading to poor generalization for GNN classifiers. In this paper, we take a causal view on GNN modeling. Under our causal assumption, the shortcut feature serves as a confounder between the causal feature and prediction. It misleads the classifier into learning spurious correlations that facilitate prediction in in-distribution (ID) test evaluation, while causing significant performance drop in out-of-distribution (OOD) test data. To address this issue, we employ the backdoor adjustment from causal theory — combining each causal feature with various shortcut features, to identify causal patterns and mitigate the confounding effect. Specifically, we employ attention modules to estimate the causal and shortcut features of the input graph. Then, a memory bank collects the estimated shortcut features, enhancing the diversity of shortcut features for combination. Simultaneously, we apply the prototype strategy to improve the consistency of intra-class causal features. We term our method as CAL+, which can promote stable relationships between causal estimation and prediction, regardless of distribution changes. Extensive experiments on synthetic and real-world OOD benchmarks demonstrate our method’s effectiveness in improving OOD generalization. Our codes are released at https://github.com/shuyao-wang/CAL-plus.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"18 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139755547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fair Feature Selection: A Causal Perspective 公平特征选择：因果视角

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-03 DOI: 10.1145/3643890

Zhaolong Ling, Enqi Xu, Peng Zhou, Liang Du, Kui Yu, Xindong Wu

Fair feature selection for classification decision tasks has recently garnered significant attention from researchers. However, existing fair feature selection algorithms fall short of providing a full explanation of the causal relationship between features and sensitive attributes, potentially impacting the accuracy of fair feature identification. To address this issue, we propose a Fair Causal Feature Selection algorithm, called FairCFS. Specifically, FairCFS constructs a localized causal graph that identifies the Markov blankets of class and sensitive variables, to block the transmission of sensitive information for selecting fair causal features. Extensive experiments on seven public real-world datasets validate that FairCFS has comparable accuracy compared to eight state-of-the-art feature selection algorithms, while presenting more superior fairness.

用于分类决策任务的公平特征选择最近引起了研究人员的极大关注。然而，现有的公平特征选择算法无法全面解释特征与敏感属性之间的因果关系，这可能会影响公平特征识别的准确性。为了解决这个问题，我们提出了一种公平因果特征选择算法，称为 FairCFS。具体来说，FairCFS 构建了一个本地化因果图，用于识别类和敏感变量的马尔可夫空白，从而阻断敏感信息的传递，以选择公平的因果特征。在七个公开的真实世界数据集上进行的大量实验验证了 FairCFS 与八种最先进的特征选择算法相比，具有相当的准确性，同时具有更优越的公平性。

引用次数: 0

A Taxonomy for Learning with Perturbation and Algorithms 利用扰动和算法学习的分类标准

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-02-03 DOI: 10.1145/3644391

Rujing Yao, Ou Wu

Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert low weights on samples which are likely to be noisy or quite hard. This study summarizes another less-explored strategy, namely, perturbation. Various incarnations of perturbation have been utilized but it has not been explicitly revealed. Learning with perturbation is called perturbation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, learning with perturbation is divided on the basis of the perturbation targets, directions, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be understood with the constructed taxonomy. Alternatively, these algorithms share the same component, namely, perturbation in their procedures. Furthermore, a family of new learning algorithms can be obtained by varying existing learning algorithms with our taxonomy. Specifically, three concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on image classification and text sentiment analysis verify the effectiveness of the three new algorithms. Learning with perturbation can also be used in other various learning scenarios, such as imbalanced learning, clustering, regression, and so on.

机器学习中普遍采用加权策略。例如，鲁棒机器学习中的一种常见方法是对可能存在噪声或相当困难的样本施加低权重。本研究总结了另一种探索较少的策略，即扰动。扰动的各种化身已被利用，但尚未被明确揭示。利用扰动进行学习被称为扰动学习，本研究为其构建了一个系统的分类法。在我们的分类法中，扰动学习是根据扰动目标、方向、推理方式和粒度水平来划分的。现有的许多学习算法，包括一些经典算法，都可以用构建的分类法来理解。或者说，这些算法在其程序中具有相同的组成部分，即扰动。此外，利用我们的分类法改变现有的学习算法，还可以得到一系列新的学习算法。具体来说，我们为鲁棒性机器学习提出了三种具体的新学习算法。图像分类和文本情感分析的大量实验验证了这三种新算法的有效性。带扰动的学习还可用于其他各种学习场景，如不平衡学习、聚类、回归等。

{"title":"A Taxonomy for Learning with Perturbation and Algorithms","authors":"Rujing Yao, Ou Wu","doi":"10.1145/3644391","DOIUrl":"https://doi.org/10.1145/3644391","url":null,"abstract":"Weighting strategy prevails in machine learning. For example, a common approach in robust machine learning is to exert low weights on samples which are likely to be noisy or quite hard. This study summarizes another less-explored strategy, namely, perturbation. Various incarnations of perturbation have been utilized but it has not been explicitly revealed. Learning with perturbation is called perturbation learning and a systematic taxonomy is constructed for it in this study. In our taxonomy, learning with perturbation is divided on the basis of the perturbation targets, directions, inference manners, and granularity levels. Many existing learning algorithms including some classical ones can be understood with the constructed taxonomy. Alternatively, these algorithms share the same component, namely, perturbation in their procedures. Furthermore, a family of new learning algorithms can be obtained by varying existing learning algorithms with our taxonomy. Specifically, three concrete new learning algorithms are proposed for robust machine learning. Extensive experiments on image classification and text sentiment analysis verify the effectiveness of the three new algorithms. Learning with perturbation can also be used in other various learning scenarios, such as imbalanced learning, clustering, regression, and so on.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"218 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139678288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0