ACM Transactions on Knowledge Discovery from Data最新文献_第9页

Voxel-wise Medical Images Generalization for Eliminating Distribution Shift 消除分布偏移的体素医学影像泛化技术

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-25 DOI: 10.1145/3643034

Feifei Li, Yuanbin Wang, Oya Beyan, Mirjam Schöneck, Liliana Lourenco Caldeira

Nowadays, more and more machine learning methods are applied in the medical domain. Supervised Learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the i.i.d(independent and identically distributed) assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this paper, we adopt the CycleGAN(Generative Adversarial Networks) method to cycle train the CT(Computer Tomography) data from different scanners/manufacturers, which aims to eliminate the distribution shift from diverse data terminals, on the basis of our previous work[14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with SSIM(structural similarity index measure) metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. While the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. In addition, another limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In future works, we will evaluate the original and generative datasets by experimenting with a broader range of supervised methods. We will implement the generative methods under the federated learning architecture, which can preserve their benefits and eliminate the distribution shift problem in a broader range.

如今，越来越多的机器学习方法被应用于医学领域。当训练数据集和测试数据集不符合 i.i.d（独立且同分布）假设时，用于医学图像分类、预测和分割任务的监督学习方法总是会出现性能下降的情况。这些分布偏移情况严重影响了机器学习应用在医学领域的鲁棒性、公平性和可信度。因此，本文在前期工作[14]的基础上，采用CycleGAN（生成对抗网络）方法对来自不同扫描仪/制造商的CT（计算机断层扫描）数据进行循环训练，旨在消除来自不同数据终端的分布偏移。然而，由于基于 GAN 模型的模型崩溃问题和生成机制，我们生成的图像包含严重的伪影。为了去除边界标记和伪影，我们采用了基于分数的扩散生成模型来对图像进行体素细化。这种将两种生成模型相结合的创新方法既提高了数据提供者的质量，又保留了重要特征。同时，我们使用五张配对的患者医学图像进行了 SSIM（结构相似性指数测量）指标的评估实验和分割模型的性能比较。我们得出结论：CycleGAN 可以作为一种高效的数据增强技术，而不是一种消除分布偏移的方法。而去噪扩散模型更适合处理不同终端模块引起的分布偏移问题。此外，生成方法在医学影像中应用的另一个局限是难以获得能准确捕捉生物结构和变异性复杂性的大型多样化数据集。在未来的工作中，我们将通过试验更广泛的监督方法来评估原始数据集和生成数据集。我们将在联合学习架构下实施生成方法，这样可以在更大范围内保留其优点并消除分布偏移问题。

{"title":"Voxel-wise Medical Images Generalization for Eliminating Distribution Shift","authors":"Feifei Li, Yuanbin Wang, Oya Beyan, Mirjam Schöneck, Liliana Lourenco Caldeira","doi":"10.1145/3643034","DOIUrl":"https://doi.org/10.1145/3643034","url":null,"abstract":"Nowadays, more and more machine learning methods are applied in the medical domain. Supervised Learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the i.i.d(independent and identically distributed) assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this paper, we adopt the CycleGAN(Generative Adversarial Networks) method to cycle train the CT(Computer Tomography) data from different scanners/manufacturers, which aims to eliminate the distribution shift from diverse data terminals, on the basis of our previous work[14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with SSIM(structural similarity index measure) metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. While the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. In addition, another limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In future works, we will evaluate the original and generative datasets by experimenting with a broader range of supervised methods. We will implement the generative methods under the federated learning architecture, which can preserve their benefits and eliminate the distribution shift problem in a broader range.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correlation-Aware Graph Data Augmentation with Implicit and Explicit Neighbors 利用隐式和显式邻居进行关联感知图数据扩充

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-25 DOI: 10.1145/3638057

Chuan-Wei Kuo, Bo-Yu Chen, Wen-Chih Peng, Chih-Chieh Hung, Hsin-Ning Su

In recent years, there has been a significant surge in commercial demand for citation graph-based tasks, such as patent analysis, social network analysis, and recommendation systems. Graph Neural Networks (GNNs) are widely used for these tasks due to their remarkable performance in capturing topological graph information. However, GNNs’ output results are highly dependent on the composition of local neighbors within the topological structure. To address this issue, we identify two types of neighbors in a citation graph: explicit neighbors based on the topological structure, and implicit neighbors based on node features. Our primary motivation is to clearly define and visualize these neighbors, emphasizing their importance in enhancing graph neural network performance. We propose a Correlation-aware Network (CNet) to re-organize the citation graph and learn more valuable informative representations by leveraging these implicit and explicit neighbors. Our approach aims to improve graph data augmentation and classification performance, with the majority of our focus on stating the importance of using these neighbors, while also introducing a new graph data augmentation method. We compare CNet with state-of-the-art (SOTA) GNNs and other graph data augmentation approaches acting on GNNs. Extensive experiments demonstrate that CNet effectively extracts more valuable informative representations from the citation graph, significantly outperforming baselines. The code is available on public GitHub¹.

近年来，对基于引文图的任务（如专利分析、社交网络分析和推荐系统）的商业需求大幅增加。图神经网络（GNN）在捕捉拓扑图信息方面表现出色，因此被广泛用于这些任务。然而，GNN 的输出结果在很大程度上取决于拓扑结构中本地邻居的组成。为了解决这个问题，我们确定了引文图中的两类邻居：基于拓扑结构的显式邻居和基于节点特征的隐式邻居。我们的主要动机是明确定义和可视化这些邻居，强调它们在提高图神经网络性能方面的重要性。我们提出了一种相关性感知网络（CNet）来重新组织引文图，并通过利用这些隐式和显式邻居来学习更有价值的信息表征。我们的方法旨在提高图数据扩增和分类性能，重点在于说明使用这些邻居的重要性，同时还引入了一种新的图数据扩增方法。我们将 CNet 与最先进的（SOTA）GNN 及其他作用于 GNN 的图数据增强方法进行了比较。大量实验证明，CNet 能有效地从引文图中提取出更有价值的信息表征，性能明显优于基线方法。代码可在公开的 GitHub 上获取1。

{"title":"Correlation-Aware Graph Data Augmentation with Implicit and Explicit Neighbors","authors":"Chuan-Wei Kuo, Bo-Yu Chen, Wen-Chih Peng, Chih-Chieh Hung, Hsin-Ning Su","doi":"10.1145/3638057","DOIUrl":"https://doi.org/10.1145/3638057","url":null,"abstract":"In recent years, there has been a significant surge in commercial demand for citation graph-based tasks, such as patent analysis, social network analysis, and recommendation systems. Graph Neural Networks (GNNs) are widely used for these tasks due to their remarkable performance in capturing topological graph information. However, GNNs’ output results are highly dependent on the composition of local neighbors within the topological structure. To address this issue, we identify two types of neighbors in a citation graph: explicit neighbors based on the topological structure, and implicit neighbors based on node features. Our primary motivation is to clearly define and visualize these neighbors, emphasizing their importance in enhancing graph neural network performance. We propose a Correlation-aware Network (CNet) to re-organize the citation graph and learn more valuable informative representations by leveraging these implicit and explicit neighbors. Our approach aims to improve graph data augmentation and classification performance, with the majority of our focus on stating the importance of using these neighbors, while also introducing a new graph data augmentation method. We compare CNet with state-of-the-art (SOTA) GNNs and other graph data augmentation approaches acting on GNNs. Extensive experiments demonstrate that CNet effectively extracts more valuable informative representations from the citation graph, significantly outperforming baselines. The code is available on public GitHub\u00001.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"27 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Package Arrival Time Prediction via Knowledge Distillation Graph Neural Network 通过知识提炼图神经网络预测包裹到达时间

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-24 DOI: 10.1145/3643033

Lei Zhang, Yong Liu, Zhiwei Zeng, Yiming Cao, Xingyu Wu, Yonghui Xu, Zhiqi Shen, Lizhen Cui

Accurately estimating packages’ arrival time in e-commerce can enhance users’ shopping experience and improve the placement rate of products. This problem is often formalized as an Origin-Destination (OD)-based ETA (i.e. estimated time of arrival) prediction task, where the delivery time is estimated mainly based on sender and receiver addresses and other context information. One inherent challenge of the OD-based ETA problem is that the delivery time highly depends on the actual delivery trajectory which is unknown at the time of prediction. In this paper, we tackle this challenge by effectively exploiting historical delivery trajectories. We propose a novel Knowledge Distillation Graph neural network-based package ETA prediction (KDG-ETA) model, which uses knowledge distillation in the training phase to distill the knowledge of historical trajectories into OD pair embeddings. In KDG-ETA, a multi-level trajectory graph representation model is proposed to fully exploit trajectory information at the node-level, edge-level, and path-level. Then, the OD representations embedded with trajectory knowledge are combined with context embeddings from feature extraction module for delivery time prediction using an adaptive attention module. KDG-ETA consistently outperforms existing state-of-the-art OD-based ETA prediction methods on three real-world Alibaba datasets, reducing the Mean Absolute Error (MAE) by 3.0%-39.1% as demonstrated in our extensive empirical evaluation.

在电子商务中，准确估计包裹的到达时间可以增强用户的购物体验，提高产品的投放率。这一问题通常被形式化为基于原产地-目的地（OD）的 ETA（即估计到达时间）预测任务，即主要根据寄件人和收件人地址以及其他上下文信息来估计交付时间。基于 OD 的 ETA 问题所面临的一个固有挑战是，配送时间在很大程度上取决于实际配送轨迹，而这在预测时是未知的。在本文中，我们通过有效利用历史投递轨迹来应对这一挑战。我们提出了一种新颖的基于知识蒸馏图神经网络的包裹 ETA 预测（KDG-ETA）模型，该模型在训练阶段使用知识蒸馏法将历史轨迹知识蒸馏为 OD 对嵌入。在 KDG-ETA 模型中，提出了一种多层次轨迹图表示模型，以充分利用节点级、边级和路径级的轨迹信息。然后，将嵌入轨迹知识的 OD 表示与特征提取模块的上下文嵌入相结合，利用自适应注意力模块进行交付时间预测。在阿里巴巴的三个真实数据集上，KDG-ETA 的表现始终优于现有的基于 OD 的先进 ETA 预测方法，在广泛的实证评估中，KDG-ETA 的平均绝对误差（MAE）降低了 3.0%-39.1%。

{"title":"Package Arrival Time Prediction via Knowledge Distillation Graph Neural Network","authors":"Lei Zhang, Yong Liu, Zhiwei Zeng, Yiming Cao, Xingyu Wu, Yonghui Xu, Zhiqi Shen, Lizhen Cui","doi":"10.1145/3643033","DOIUrl":"https://doi.org/10.1145/3643033","url":null,"abstract":"Accurately estimating packages’ arrival time in e-commerce can enhance users’ shopping experience and improve the placement rate of products. This problem is often formalized as an Origin-Destination (OD)-based ETA (i.e. estimated time of arrival) prediction task, where the delivery time is estimated mainly based on sender and receiver addresses and other context information. One inherent challenge of the OD-based ETA problem is that the delivery time highly depends on the actual delivery trajectory which is unknown at the time of prediction. In this paper, we tackle this challenge by effectively exploiting historical delivery trajectories. We propose a novel Knowledge Distillation Graph neural network-based package ETA prediction (KDG-ETA) model, which uses knowledge distillation in the training phase to distill the knowledge of historical trajectories into OD pair embeddings. In KDG-ETA, a multi-level trajectory graph representation model is proposed to fully exploit trajectory information at the node-level, edge-level, and path-level. Then, the OD representations embedded with trajectory knowledge are combined with context embeddings from feature extraction module for delivery time prediction using an adaptive attention module. KDG-ETA consistently outperforms existing state-of-the-art OD-based ETA prediction methods on three real-world Alibaba datasets, reducing the Mean Absolute Error (MAE) by 3.0%-39.1% as demonstrated in our extensive empirical evaluation.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"40 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TaSPM: Targeted Sequential Pattern Mining TaSPM：目标序列模式挖掘

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-19 DOI: 10.1145/3639827

Gengsen Huang, Wensheng Gan, Philip S. Yu

Sequential pattern mining (SPM) is an important technique in the field of pattern mining, which has many applications in reality. Although many efficient SPM algorithms have been proposed, there are few studies that can focus on targeted tasks. Targeted querying of the concerned sequential patterns can not only reduce the number of patterns generated, but also increase the efficiency of users in performing related analysis. The current algorithms available for targeted sequence querying are based on specific scenarios and can not be extended to other applications. In this paper, we formulate the problem of targeted sequential pattern mining and propose a generic algorithm, namely TaSPM. What is more, to improve the efficiency of TaSPM on large-scale datasets and multiple-item-based sequence datasets, we propose several pruning strategies to reduce meaningless operations in the mining process. Totally four pruning strategies are designed in TaSPM, and hence TaSPM can terminate unnecessary pattern extensions quickly and achieve better performance. Finally, we conducted extensive experiments on different datasets to compare the baseline SPM algorithm with TaSPM. Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.

序列模式挖掘（SPM）是模式挖掘领域的一项重要技术，在现实中有很多应用。虽然已经提出了很多高效的 SPM 算法，但能专注于目标任务的研究却很少。对相关的序列模式进行有针对性的查询，不仅可以减少生成模式的数量，还能提高用户进行相关分析的效率。目前现有的有针对性的序列查询算法都是基于特定场景的，无法扩展到其他应用中。本文提出了有针对性的序列模式挖掘问题，并提出了一种通用算法，即 TaSPM。此外，为了提高 TaSPM 在大规模数据集和基于多项目的序列数据集上的效率，我们提出了几种剪枝策略，以减少挖掘过程中的无意义操作。TaSPM 总共设计了四种剪枝策略，因此 TaSPM 可以快速终止不必要的模式扩展，并获得更好的性能。最后，我们在不同的数据集上进行了大量实验，比较了基线 SPM 算法和 TaSPM 算法。实验结果表明，新颖的定向挖掘算法 TaSPM 可以实现更快的运行时间和更少的内存消耗。

{"title":"TaSPM: Targeted Sequential Pattern Mining","authors":"Gengsen Huang, Wensheng Gan, Philip S. Yu","doi":"10.1145/3639827","DOIUrl":"https://doi.org/10.1145/3639827","url":null,"abstract":"Sequential pattern mining (SPM) is an important technique in the field of pattern mining, which has many applications in reality. Although many efficient SPM algorithms have been proposed, there are few studies that can focus on targeted tasks. Targeted querying of the concerned sequential patterns can not only reduce the number of patterns generated, but also increase the efficiency of users in performing related analysis. The current algorithms available for targeted sequence querying are based on specific scenarios and can not be extended to other applications. In this paper, we formulate the problem of targeted sequential pattern mining and propose a generic algorithm, namely TaSPM. What is more, to improve the efficiency of TaSPM on large-scale datasets and multiple-item-based sequence datasets, we propose several pruning strategies to reduce meaningless operations in the mining process. Totally four pruning strategies are designed in TaSPM, and hence TaSPM can terminate unnecessary pattern extensions quickly and achieve better performance. Finally, we conducted extensive experiments on different datasets to compare the baseline SPM algorithm with TaSPM. Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"22 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interpreting Deep Forest through Feature Contribution and MDI Feature Importance 通过特征贡献和 MDI 特征重要性解读深林

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-19 DOI: 10.1145/3641108

Yi-Xiao He, Shen-Huan Lyu, Yuan Jiang

Deep forest is a non-differentiable deep model that has achieved impressive empirical success across a wide variety of applications, especially on categorical/symbolic or mixed modeling tasks. Many of the application fields prefer explainable models, such as random forests with feature contributions that can provide a local explanation for each prediction, and Mean Decrease Impurity (MDI) that can provide global feature importance. However, deep forest, as a cascade of random forests, possesses interpretability only at the first layer. From the second layer on, many of the tree splits occur on the new features generated by the previous layer, which makes existing explaining tools for random forests inapplicable. To disclose the impact of the original features in the deep layers, we design a calculation method with an estimation step followed by a calibration step for each layer, and propose our feature contribution and MDI feature importance calculation tools for deep forest. Experimental results on both simulated data and real-world data verify the effectiveness of our methods.

深度森林是一种无差别深度模型，在广泛的应用领域，特别是分类/符号或混合建模任务中取得了令人印象深刻的实证成功。许多应用领域更青睐可解释模型，如具有特征贡献的随机森林，它可以为每个预测提供局部解释，而均值递减杂质（MDI）可以提供全局特征重要性。然而，深度森林作为随机森林的级联，只在第一层具有可解释性。从第二层开始，许多树的分裂发生在前一层产生的新特征上，这使得现有的随机森林解释工具变得不适用。为了揭示原始特征对深层的影响，我们设计了一种计算方法，每一层都有一个估计步骤和校准步骤，并提出了我们的深层森林特征贡献和 MDI 特征重要性计算工具。在模拟数据和真实世界数据上的实验结果验证了我们方法的有效性。

引用次数: 0

EffCause: Discover Dynamic Causal Relationships Efficiently from Time-Series EffCause：从时间序列中有效发现动态因果关系

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-16 DOI: 10.1145/3640818

Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang

Since the proposal of Granger causality, many researchers have followed the idea and developed extensions to the original algorithm. The classic Granger causality test aims to detect the existence of the static causal relationship. Notably, a fundamental assumption underlying most previous studies is the stationarity of causality, which requires the causality between variables to keep stable. However, this study argues that it is easy to break in real-world scenarios. Fortunately, our paper presents an essential observation: if we consider a sufficiently short window when discovering the rapidly changing causalities, they will keep approximately static and thus can be detected using the static way correctly. In light of this, we develop EffCause, bringing dynamics into classic Granger causality. Specifically, to efficiently examine the causalities on different sliding window lengths, we design two optimization schemes in EffCause and demonstrate the advantage of EffCause through extensive experiments on both simulated and real-world datasets. The results validate that EffCause achieves state-of-the-art accuracy in continuous causal discovery tasks while achieving faster computation. Case studies from cloud system failure analysis and traffic flow monitoring show that EffCause effectively helps us understand real-world time-series data and solve practical problems.

自格兰杰因果关系提出以来，许多研究人员沿用了这一思想，并对原始算法进行了扩展。经典的格兰杰因果检验旨在检测静态因果关系的存在。值得注意的是，以往大多数研究的一个基本假设是因果关系的静态性，即要求变量之间的因果关系保持稳定。然而，本研究认为，在现实世界中，这一假设很容易被打破。幸运的是，我们的论文提出了一个重要观点：如果我们在发现快速变化的因果关系时考虑足够短的窗口，它们就会保持近似静态，从而可以正确地使用静态方式检测。有鉴于此，我们开发了 EffCause，将动态引入经典的格兰杰因果关系。具体来说，为了在不同的滑动窗口长度上有效地检测因果关系，我们在 EffCause 中设计了两种优化方案，并通过在模拟数据集和实际数据集上的大量实验证明了 EffCause 的优势。结果验证了 EffCause 在连续因果发现任务中达到了最先进的准确度，同时实现了更快的计算速度。来自云系统故障分析和流量监控的案例研究表明，EffCause 能有效帮助我们理解真实世界的时间序列数据并解决实际问题。

{"title":"EffCause: Discover Dynamic Causal Relationships Efficiently from Time-Series","authors":"Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang","doi":"10.1145/3640818","DOIUrl":"https://doi.org/10.1145/3640818","url":null,"abstract":"Since the proposal of Granger causality, many researchers have followed the idea and developed extensions to the original algorithm. The classic Granger causality test aims to detect the existence of the static causal relationship. Notably, a fundamental assumption underlying most previous studies is the stationarity of causality, which requires the causality between variables to keep stable. However, this study argues that it is easy to break in real-world scenarios. Fortunately, our paper presents an essential observation: if we consider a sufficiently short window when discovering the rapidly changing causalities, they will keep approximately static and thus can be detected using the static way correctly. In light of this, we develop EffCause, bringing dynamics into classic Granger causality. Specifically, to efficiently examine the causalities on different sliding window lengths, we design two optimization schemes in EffCause and demonstrate the advantage of EffCause through extensive experiments on both simulated and real-world datasets. The results validate that EffCause achieves state-of-the-art accuracy in continuous causal discovery tasks while achieving faster computation. Case studies from cloud system failure analysis and traffic flow monitoring show that EffCause effectively helps us understand real-world time-series data and solve practical problems.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"7 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139477015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Task Learning with Sequential Dependence Towards Industrial Applications: A Systematic Formulation 面向工业应用的具有序列依赖性的多任务学习：系统阐述

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-12 DOI: 10.1145/3640468

Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma

Multi-task learning (MTL) is widely used in the online recommendation and financial services for multi-step conversion estimation, but current works often overlook the sequential dependence among tasks. Particularly, sequential dependence multi-task learning (SDMTL) faces challenges in dealing with complex task correlations and extracting valuable information in real-world scenarios, leading to negative transfer and a deterioration in the performance. Herein, a systematic learning paradigm of the SDMTL problem is established for the first time, which applies to more general multi-step conversion scenarios with longer conversion paths or various task dependence relationships. Meanwhile, an SDMTL architecture, named Task Aware Feature Extraction (TAFE), is designed to enable the dynamic task representation learning from a sample-wise view. TAFE selectively reconstructs the implicit shared information corresponding to each sample case and performs the explicit task-specific extraction under dependence constraints, which can avoid the negative transfer, resulting in more effective information sharing and joint representation learning. Extensive experiment results demonstrate the effectiveness and applicability of the proposed theoretical and implementation frameworks. Furthermore, the online evaluations at MYbank showed that TAFE had an average increase of 9.22(% ) and 3.76(% ) in various scenarios on the post-view click-through (& ) conversion rate (CTCVR) estimation task. Currently, TAFE has been depolyed in an online platform to provide various traffic services.

多任务学习（Multi-task learning，MTL）被广泛应用于在线推荐和金融服务中的多步骤转换估算，但目前的研究往往忽略了任务之间的顺序依赖性。特别是，顺序依赖多任务学习（SDMTL）在处理复杂任务相关性和提取真实世界场景中有价值信息方面面临挑战，导致负迁移和性能下降。本文首次建立了 SDMTL 问题的系统学习范式，该范式适用于具有较长转换路径或各种任务依赖关系的更一般的多步骤转换场景。同时，还设计了一种名为 "任务感知特征提取（TAFE）"的 SDMTL 架构，以实现从样本视角的动态任务表示学习。TAFE 有选择地重构与每个样本案例相对应的隐式共享信息，并在依赖关系约束下进行显式任务特定提取，从而避免了负迁移，实现了更有效的信息共享和联合表征学习。广泛的实验结果证明了所提出的理论和实现框架的有效性和适用性。此外，MYbank的在线评估表明，在不同场景下，TAFE在浏览后点击转化率（CTCVR）估算任务上的平均提升分别为9.22%和3.76%。目前，TAFE已被应用于在线平台，提供各种流量服务。

{"title":"Multi-Task Learning with Sequential Dependence Towards Industrial Applications: A Systematic Formulation","authors":"Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma","doi":"10.1145/3640468","DOIUrl":"https://doi.org/10.1145/3640468","url":null,"abstract":"Multi-task learning (MTL) is widely used in the online recommendation and financial services for multi-step conversion estimation, but current works often overlook the sequential dependence among tasks. Particularly, sequential dependence multi-task learning (SDMTL) faces challenges in dealing with complex task correlations and extracting valuable information in real-world scenarios, leading to negative transfer and a deterioration in the performance. Herein, a systematic learning paradigm of the SDMTL problem is established for the first time, which applies to more general multi-step conversion scenarios with longer conversion paths or various task dependence relationships. Meanwhile, an SDMTL architecture, named Task Aware Feature Extraction (TAFE), is designed to enable the dynamic task representation learning from a sample-wise view. TAFE selectively reconstructs the implicit shared information corresponding to each sample case and performs the explicit task-specific extraction under dependence constraints, which can avoid the negative transfer, resulting in more effective information sharing and joint representation learning. Extensive experiment results demonstrate the effectiveness and applicability of the proposed theoretical and implementation frameworks. Furthermore, the online evaluations at MYbank showed that TAFE had an average increase of 9.22(% ) and 3.76(% ) in various scenarios on the post-view click-through (& ) conversion rate (CTCVR) estimation task. Currently, TAFE has been depolyed in an online platform to provide various traffic services.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"93 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Asymmetric Learning for Graph Neural Network based Link Prediction 基于图神经网络的链接预测非对称学习

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-10 DOI: 10.1145/3640347

Kai-Lang Yao, Wu-Jun Li

Link prediction is a fundamental problem in many graph-based applications, such as protein-protein interaction prediction. Recently, graph neural network (GNN) has been widely used for link prediction. However, existing GNN-based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention from researchers. In this paper, we first analyze the computation complexity of existing GNN-LP methods, revealing that one reason for the scalability problem stems from their symmetric learning strategy in applying the same class of GNN models to learn representation for both head nodes and tail nodes. We then propose a novel method, called asymmetric learning (AML), for GNN-LP. More specifically, AML applies a GNN model to learn head node representation while applying a multi-layer perceptron (MLP) model to learn tail node representation. To the best of our knowledge, AML is the first GNN-LP method to adopt an asymmetric learning strategy for node representation learning. Furthermore, we design a novel model architecture and apply a row-wise mini-batch sampling strategy to ensure promising model accuracy and training efficiency for AML. Experiments on three real large-scale datasets show that AML is 1.7 × ∼ 7.3 × faster in training than baselines with a symmetric learning strategy while having almost no accuracy loss.

链接预测是许多基于图的应用（如蛋白质-蛋白质相互作用预测）中的一个基本问题。最近，图神经网络（GNN）被广泛用于链接预测。然而，现有的基于图神经网络的链接预测（GNN-LP）方法在大规模图的训练过程中存在可扩展性问题，很少受到研究人员的关注。在本文中，我们首先分析了现有 GNN-LP 方法的计算复杂度，发现可扩展性问题的原因之一在于它们的对称学习策略，即应用同一类 GNN 模型学习头部节点和尾部节点的表示。然后，我们为 GNN-LP 提出了一种称为非对称学习（AML）的新方法。更具体地说，AML 应用 GNN 模型学习头部节点的表示，同时应用多层感知器 (MLP) 模型学习尾部节点的表示。据我们所知，AML 是第一种采用非对称学习策略进行节点表示学习的 GNN-LP 方法。此外，我们还设计了一种新颖的模型架构，并采用了行向迷你批量采样策略，以确保 AML 具有良好的模型准确性和训练效率。在三个真实大规模数据集上的实验表明，AML 的训练速度比采用对称学习策略的基线方法快 1.7 × ∼ 7.3 ×，同时几乎没有精度损失。

{"title":"Asymmetric Learning for Graph Neural Network based Link Prediction","authors":"Kai-Lang Yao, Wu-Jun Li","doi":"10.1145/3640347","DOIUrl":"https://doi.org/10.1145/3640347","url":null,"abstract":"Link prediction is a fundamental problem in many graph-based applications, such as protein-protein interaction prediction. Recently, graph neural network (GNN) has been widely used for link prediction. However, existing GNN-based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention from researchers. In this paper, we first analyze the computation complexity of existing GNN-LP methods, revealing that one reason for the scalability problem stems from their symmetric learning strategy in applying the same class of GNN models to learn representation for both head nodes and tail nodes. We then propose a novel method, called <underline>a</underline>sym<underline>m</underline>etric <underline>l</underline>earning (AML), for GNN-LP. More specifically, AML applies a GNN model to learn head node representation while applying a multi-layer perceptron (MLP) model to learn tail node representation. To the best of our knowledge, AML is the first GNN-LP method to adopt an asymmetric learning strategy for node representation learning. Furthermore, we design a novel model architecture and apply a row-wise mini-batch sampling strategy to ensure promising model accuracy and training efficiency for AML. Experiments on three real large-scale datasets show that AML is 1.7 × ∼ 7.3 × faster in training than baselines with a symmetric learning strategy while having almost no accuracy loss.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"14 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139410440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

diGRASS: Directed Graph Spectral Sparsification via Spectrum-Preserving Symmetrization diGRASS：通过保留频谱的对称化实现有向图谱稀疏化

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-04 DOI: 10.1145/3639568

Ying Zhang, Zhiqiang Zhao, Zhuo Feng

Recent spectral graph sparsification research aims to construct ultra-sparse subgraphs for preserving the original graph spectral (structural) properties, such as the first few Laplacian eigenvalues and eigenvectors, which has led to the development of a variety of nearly-linear time numerical and graph algorithms. However, there is very limited progress for spectral sparsification of directed graphs. In this work, we prove the existence of nearly-linear-sized spectral sparsifiers for directed graphs under certain conditions. Furthermore, we introduce a practically-efficient spectral algorithm (diGRASS) for sparsifying real-world, large-scale directed graphs leveraging spectral matrix perturbation analysis. The proposed method has been evaluated using a variety of directed graphs obtained from real-world applications, showing promising results for solving directed graph Laplacians, spectral partitioning of directed graphs, and approximately computing (personalized) PageRank vectors.

最近的图谱稀疏化研究旨在构建超稀疏子图，以保留原始图谱（结构）特性，如前几个拉普拉奇特征值和特征向量，这导致了各种近线性时间数值和图算法的发展。然而，有向图的谱稀疏化研究进展非常有限。在这项工作中，我们证明了在特定条件下，有向图存在近线性大小的谱稀疏化器。此外，我们还引入了一种实用高效的光谱算法（diGRASS），利用光谱矩阵扰动分析对现实世界中的大规模有向图进行稀疏化处理。我们使用从实际应用中获取的各种有向图对所提出的方法进行了评估，结果表明该方法在求解有向图拉普拉斯、有向图谱分区以及近似计算（个性化）PageRank 向量等方面具有良好的效果。

引用次数: 0

Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach 利用基于 GAT 的新方法加强异构知识图谱的完成

IF 3.6 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data

Pub Date : 2024-01-03 DOI: 10.1145/3639472

Wanxu Wei, Yitong Song, Bin Yao

Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel GAT-based method designed for Heterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH’s performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.

知识图谱（KG）在增强搜索结果和推荐系统方面发挥着重要作用。随着知识图谱规模的快速增长，知识图谱变得越来越不准确和不完整。知识图谱补全方法可以解决这个问题，其中基于图注意网络（GAT）的方法因其卓越的性能而脱颖而出。然而，现有的基于 GAT 的知识图谱补全方法在处理异构知识图谱时往往会出现过拟合问题，这主要是由于样本数量不平衡造成的。此外，这些方法在预测与其他实体共享相同关系和头部（尾部）实体的尾部（头部）实体时表现不佳。为了解决这些问题，我们提出了基于 GAT 的新方法 GATH，该方法专为异构 KG 而设计。GATH 包含两个独立的注意力网络模块，它们协同工作来预测丢失的实体。我们还引入了新颖的编码和特征转换方法，使 GATH 在样本不平衡的情况下也能表现稳健。我们进行了全面的实验来评估 GATH 的性能。与现有基于 SOTA GAT 模型的 Hits@10 和 MRR 指标相比，我们的模型在 FB15K-237 数据集上的性能分别提高了 5.2% 和 5.2%，在 WN18RR 数据集上的性能分别提高了 4.5% 和 14.6%。

{"title":"Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach","authors":"Wanxu Wei, Yitong Song, Bin Yao","doi":"10.1145/3639472","DOIUrl":"https://doi.org/10.1145/3639472","url":null,"abstract":"Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel <underline>GAT</underline>-based method designed for <underline>H</underline>eterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH’s performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"72 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139105360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0