Nowadays, more and more machine learning methods are applied in the medical domain. Supervised Learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the i.i.d(independent and identically distributed) assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this paper, we adopt the CycleGAN(Generative Adversarial Networks) method to cycle train the CT(Computer Tomography) data from different scanners/manufacturers, which aims to eliminate the distribution shift from diverse data terminals, on the basis of our previous work[14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with SSIM(structural similarity index measure) metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. While the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. In addition, another limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In future works, we will evaluate the original and generative datasets by experimenting with a broader range of supervised methods. We will implement the generative methods under the federated learning architecture, which can preserve their benefits and eliminate the distribution shift problem in a broader range.
如今,越来越多的机器学习方法被应用于医学领域。当训练数据集和测试数据集不符合 i.i.d(独立且同分布)假设时,用于医学图像分类、预测和分割任务的监督学习方法总是会出现性能下降的情况。这些分布偏移情况严重影响了机器学习应用在医学领域的鲁棒性、公平性和可信度。因此,本文在前期工作[14]的基础上,采用CycleGAN(生成对抗网络)方法对来自不同扫描仪/制造商的CT(计算机断层扫描)数据进行循环训练,旨在消除来自不同数据终端的分布偏移。然而,由于基于 GAN 模型的模型崩溃问题和生成机制,我们生成的图像包含严重的伪影。为了去除边界标记和伪影,我们采用了基于分数的扩散生成模型来对图像进行体素细化。这种将两种生成模型相结合的创新方法既提高了数据提供者的质量,又保留了重要特征。同时,我们使用五张配对的患者医学图像进行了 SSIM(结构相似性指数测量)指标的评估实验和分割模型的性能比较。我们得出结论:CycleGAN 可以作为一种高效的数据增强技术,而不是一种消除分布偏移的方法。而去噪扩散模型更适合处理不同终端模块引起的分布偏移问题。此外,生成方法在医学影像中应用的另一个局限是难以获得能准确捕捉生物结构和变异性复杂性的大型多样化数据集。在未来的工作中,我们将通过试验更广泛的监督方法来评估原始数据集和生成数据集。我们将在联合学习架构下实施生成方法,这样可以在更大范围内保留其优点并消除分布偏移问题。
{"title":"Voxel-wise Medical Images Generalization for Eliminating Distribution Shift","authors":"Feifei Li, Yuanbin Wang, Oya Beyan, Mirjam Schöneck, Liliana Lourenco Caldeira","doi":"10.1145/3643034","DOIUrl":"https://doi.org/10.1145/3643034","url":null,"abstract":"<p>Nowadays, more and more machine learning methods are applied in the medical domain. Supervised Learning methods adopted in classification, prediction, and segmentation tasks for medical images always experience decreased performance when the training and testing datasets do not follow the i.i.d(independent and identically distributed) assumption. These distribution shift situations seriously influence machine learning applications’ robustness, fairness, and trustworthiness in the medical domain. Hence, in this paper, we adopt the CycleGAN(Generative Adversarial Networks) method to cycle train the CT(Computer Tomography) data from different scanners/manufacturers, which aims to eliminate the distribution shift from diverse data terminals, on the basis of our previous work[14]. However, due to the model collapse problem and generative mechanisms of the GAN-based model, the images we generated contained serious artifacts. To remove the boundary marks and artifacts, we adopt score-based diffusion generative models to refine the images voxel-wisely. This innovative combination of two generative models enhances the quality of data providers while maintaining significant features. Meanwhile, we use five paired patients’ medical images to deal with the evaluation experiments with SSIM(structural similarity index measure) metrics and the segmentation model’s performance comparison. We conclude that CycleGAN can be utilized as an efficient data augmentation technique rather than a distribution-shift-eliminating method. While the denoising diffusion model is more suitable for dealing with the distribution shift problem aroused by the different terminal modules. In addition, another limitation of generative methods applied in medical images is the difficulty in obtaining large and diverse datasets that accurately capture the complexity of biological structure and variability. In future works, we will evaluate the original and generative datasets by experimenting with a broader range of supervised methods. We will implement the generative methods under the federated learning architecture, which can preserve their benefits and eliminate the distribution shift problem in a broader range.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuan-Wei Kuo, Bo-Yu Chen, Wen-Chih Peng, Chih-Chieh Hung, Hsin-Ning Su
In recent years, there has been a significant surge in commercial demand for citation graph-based tasks, such as patent analysis, social network analysis, and recommendation systems. Graph Neural Networks (GNNs) are widely used for these tasks due to their remarkable performance in capturing topological graph information. However, GNNs’ output results are highly dependent on the composition of local neighbors within the topological structure. To address this issue, we identify two types of neighbors in a citation graph: explicit neighbors based on the topological structure, and implicit neighbors based on node features. Our primary motivation is to clearly define and visualize these neighbors, emphasizing their importance in enhancing graph neural network performance. We propose a Correlation-aware Network (CNet) to re-organize the citation graph and learn more valuable informative representations by leveraging these implicit and explicit neighbors. Our approach aims to improve graph data augmentation and classification performance, with the majority of our focus on stating the importance of using these neighbors, while also introducing a new graph data augmentation method. We compare CNet with state-of-the-art (SOTA) GNNs and other graph data augmentation approaches acting on GNNs. Extensive experiments demonstrate that CNet effectively extracts more valuable informative representations from the citation graph, significantly outperforming baselines. The code is available on public GitHub 1.
{"title":"Correlation-Aware Graph Data Augmentation with Implicit and Explicit Neighbors","authors":"Chuan-Wei Kuo, Bo-Yu Chen, Wen-Chih Peng, Chih-Chieh Hung, Hsin-Ning Su","doi":"10.1145/3638057","DOIUrl":"https://doi.org/10.1145/3638057","url":null,"abstract":"<p>In recent years, there has been a significant surge in commercial demand for citation graph-based tasks, such as patent analysis, social network analysis, and recommendation systems. Graph Neural Networks (GNNs) are widely used for these tasks due to their remarkable performance in capturing topological graph information. However, GNNs’ output results are highly dependent on the composition of local neighbors within the topological structure. To address this issue, we identify two types of neighbors in a citation graph: explicit neighbors based on the topological structure, and implicit neighbors based on node features. Our primary motivation is to clearly define and visualize these neighbors, emphasizing their importance in enhancing graph neural network performance. We propose a Correlation-aware Network (CNet) to re-organize the citation graph and learn more valuable informative representations by leveraging these implicit and explicit neighbors. Our approach aims to improve graph data augmentation and classification performance, with the majority of our focus on stating the importance of using these neighbors, while also introducing a new graph data augmentation method. We compare CNet with state-of-the-art (SOTA) GNNs and other graph data augmentation approaches acting on GNNs. Extensive experiments demonstrate that CNet effectively extracts more valuable informative representations from the citation graph, significantly outperforming baselines. The code is available on public GitHub\u0000<sup>1</sup>.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"27 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhang, Yong Liu, Zhiwei Zeng, Yiming Cao, Xingyu Wu, Yonghui Xu, Zhiqi Shen, Lizhen Cui
Accurately estimating packages’ arrival time in e-commerce can enhance users’ shopping experience and improve the placement rate of products. This problem is often formalized as an Origin-Destination (OD)-based ETA (i.e. estimated time of arrival) prediction task, where the delivery time is estimated mainly based on sender and receiver addresses and other context information. One inherent challenge of the OD-based ETA problem is that the delivery time highly depends on the actual delivery trajectory which is unknown at the time of prediction. In this paper, we tackle this challenge by effectively exploiting historical delivery trajectories. We propose a novel Knowledge Distillation Graph neural network-based package ETA prediction (KDG-ETA) model, which uses knowledge distillation in the training phase to distill the knowledge of historical trajectories into OD pair embeddings. In KDG-ETA, a multi-level trajectory graph representation model is proposed to fully exploit trajectory information at the node-level, edge-level, and path-level. Then, the OD representations embedded with trajectory knowledge are combined with context embeddings from feature extraction module for delivery time prediction using an adaptive attention module. KDG-ETA consistently outperforms existing state-of-the-art OD-based ETA prediction methods on three real-world Alibaba datasets, reducing the Mean Absolute Error (MAE) by 3.0%-39.1% as demonstrated in our extensive empirical evaluation.
在电子商务中,准确估计包裹的到达时间可以增强用户的购物体验,提高产品的投放率。这一问题通常被形式化为基于原产地-目的地(OD)的 ETA(即估计到达时间)预测任务,即主要根据寄件人和收件人地址以及其他上下文信息来估计交付时间。基于 OD 的 ETA 问题所面临的一个固有挑战是,配送时间在很大程度上取决于实际配送轨迹,而这在预测时是未知的。在本文中,我们通过有效利用历史投递轨迹来应对这一挑战。我们提出了一种新颖的基于知识蒸馏图神经网络的包裹 ETA 预测(KDG-ETA)模型,该模型在训练阶段使用知识蒸馏法将历史轨迹知识蒸馏为 OD 对嵌入。在 KDG-ETA 模型中,提出了一种多层次轨迹图表示模型,以充分利用节点级、边级和路径级的轨迹信息。然后,将嵌入轨迹知识的 OD 表示与特征提取模块的上下文嵌入相结合,利用自适应注意力模块进行交付时间预测。在阿里巴巴的三个真实数据集上,KDG-ETA 的表现始终优于现有的基于 OD 的先进 ETA 预测方法,在广泛的实证评估中,KDG-ETA 的平均绝对误差(MAE)降低了 3.0%-39.1%。
{"title":"Package Arrival Time Prediction via Knowledge Distillation Graph Neural Network","authors":"Lei Zhang, Yong Liu, Zhiwei Zeng, Yiming Cao, Xingyu Wu, Yonghui Xu, Zhiqi Shen, Lizhen Cui","doi":"10.1145/3643033","DOIUrl":"https://doi.org/10.1145/3643033","url":null,"abstract":"<p>Accurately estimating packages’ arrival time in e-commerce can enhance users’ shopping experience and improve the placement rate of products. This problem is often formalized as an Origin-Destination (OD)-based ETA (<i>i.e.</i> estimated time of arrival) prediction task, where the delivery time is estimated mainly based on sender and receiver addresses and other context information. One inherent challenge of the OD-based ETA problem is that the delivery time highly depends on the actual delivery trajectory which is unknown at the time of prediction. In this paper, we tackle this challenge by effectively exploiting historical delivery trajectories. We propose a novel Knowledge Distillation Graph neural network-based package ETA prediction (KDG-ETA) model, which uses knowledge distillation in the training phase to distill the knowledge of historical trajectories into OD pair embeddings. In KDG-ETA, a multi-level trajectory graph representation model is proposed to fully exploit trajectory information at the node-level, edge-level, and path-level. Then, the OD representations embedded with trajectory knowledge are combined with context embeddings from feature extraction module for delivery time prediction using an adaptive attention module. KDG-ETA consistently outperforms existing state-of-the-art OD-based ETA prediction methods on three real-world Alibaba datasets, reducing the Mean Absolute Error (MAE) by 3.0%-39.1% as demonstrated in our extensive empirical evaluation.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"40 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139552582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sequential pattern mining (SPM) is an important technique in the field of pattern mining, which has many applications in reality. Although many efficient SPM algorithms have been proposed, there are few studies that can focus on targeted tasks. Targeted querying of the concerned sequential patterns can not only reduce the number of patterns generated, but also increase the efficiency of users in performing related analysis. The current algorithms available for targeted sequence querying are based on specific scenarios and can not be extended to other applications. In this paper, we formulate the problem of targeted sequential pattern mining and propose a generic algorithm, namely TaSPM. What is more, to improve the efficiency of TaSPM on large-scale datasets and multiple-item-based sequence datasets, we propose several pruning strategies to reduce meaningless operations in the mining process. Totally four pruning strategies are designed in TaSPM, and hence TaSPM can terminate unnecessary pattern extensions quickly and achieve better performance. Finally, we conducted extensive experiments on different datasets to compare the baseline SPM algorithm with TaSPM. Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.
{"title":"TaSPM: Targeted Sequential Pattern Mining","authors":"Gengsen Huang, Wensheng Gan, Philip S. Yu","doi":"10.1145/3639827","DOIUrl":"https://doi.org/10.1145/3639827","url":null,"abstract":"<p>Sequential pattern mining (SPM) is an important technique in the field of pattern mining, which has many applications in reality. Although many efficient SPM algorithms have been proposed, there are few studies that can focus on targeted tasks. Targeted querying of the concerned sequential patterns can not only reduce the number of patterns generated, but also increase the efficiency of users in performing related analysis. The current algorithms available for targeted sequence querying are based on specific scenarios and can not be extended to other applications. In this paper, we formulate the problem of targeted sequential pattern mining and propose a generic algorithm, namely TaSPM. What is more, to improve the efficiency of TaSPM on large-scale datasets and multiple-item-based sequence datasets, we propose several pruning strategies to reduce meaningless operations in the mining process. Totally four pruning strategies are designed in TaSPM, and hence TaSPM can terminate unnecessary pattern extensions quickly and achieve better performance. Finally, we conducted extensive experiments on different datasets to compare the baseline SPM algorithm with TaSPM. Experiments show that the novel targeted mining algorithm TaSPM can achieve faster running time and less memory consumption.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"22 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139499120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep forest is a non-differentiable deep model that has achieved impressive empirical success across a wide variety of applications, especially on categorical/symbolic or mixed modeling tasks. Many of the application fields prefer explainable models, such as random forests with feature contributions that can provide a local explanation for each prediction, and Mean Decrease Impurity (MDI) that can provide global feature importance. However, deep forest, as a cascade of random forests, possesses interpretability only at the first layer. From the second layer on, many of the tree splits occur on the new features generated by the previous layer, which makes existing explaining tools for random forests inapplicable. To disclose the impact of the original features in the deep layers, we design a calculation method with an estimation step followed by a calibration step for each layer, and propose our feature contribution and MDI feature importance calculation tools for deep forest. Experimental results on both simulated data and real-world data verify the effectiveness of our methods.
{"title":"Interpreting Deep Forest through Feature Contribution and MDI Feature Importance","authors":"Yi-Xiao He, Shen-Huan Lyu, Yuan Jiang","doi":"10.1145/3641108","DOIUrl":"https://doi.org/10.1145/3641108","url":null,"abstract":"<p>Deep forest is a non-differentiable deep model that has achieved impressive empirical success across a wide variety of applications, especially on categorical/symbolic or mixed modeling tasks. Many of the application fields prefer explainable models, such as random forests with feature contributions that can provide a local explanation for each prediction, and Mean Decrease Impurity (MDI) that can provide global feature importance. However, deep forest, as a cascade of random forests, possesses interpretability only at the first layer. From the second layer on, many of the tree splits occur on the new features generated by the previous layer, which makes existing explaining tools for random forests inapplicable. To disclose the impact of the original features in the deep layers, we design a calculation method with an estimation step followed by a calibration step for each layer, and propose our feature contribution and MDI feature importance calculation tools for deep forest. Experimental results on both simulated data and real-world data verify the effectiveness of our methods.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"23 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139501100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang
Since the proposal of Granger causality, many researchers have followed the idea and developed extensions to the original algorithm. The classic Granger causality test aims to detect the existence of the static causal relationship. Notably, a fundamental assumption underlying most previous studies is the stationarity of causality, which requires the causality between variables to keep stable. However, this study argues that it is easy to break in real-world scenarios. Fortunately, our paper presents an essential observation: if we consider a sufficiently short window when discovering the rapidly changing causalities, they will keep approximately static and thus can be detected using the static way correctly. In light of this, we develop EffCause, bringing dynamics into classic Granger causality. Specifically, to efficiently examine the causalities on different sliding window lengths, we design two optimization schemes in EffCause and demonstrate the advantage of EffCause through extensive experiments on both simulated and real-world datasets. The results validate that EffCause achieves state-of-the-art accuracy in continuous causal discovery tasks while achieving faster computation. Case studies from cloud system failure analysis and traffic flow monitoring show that EffCause effectively helps us understand real-world time-series data and solve practical problems.
{"title":"EffCause: Discover Dynamic Causal Relationships Efficiently from Time-Series","authors":"Yicheng Pan, Yifan Zhang, Xinrui Jiang, Meng Ma, Ping Wang","doi":"10.1145/3640818","DOIUrl":"https://doi.org/10.1145/3640818","url":null,"abstract":"<p>Since the proposal of Granger causality, many researchers have followed the idea and developed extensions to the original algorithm. The classic Granger causality test aims to detect the existence of the static causal relationship. Notably, a fundamental assumption underlying most previous studies is the stationarity of causality, which requires the causality between variables to keep stable. However, this study argues that it is easy to break in real-world scenarios. Fortunately, our paper presents an essential observation: if we consider a sufficiently short window when discovering the rapidly changing causalities, they will keep approximately static and thus can be detected using the static way correctly. In light of this, we develop EffCause, bringing dynamics into classic Granger causality. Specifically, to efficiently examine the causalities on different sliding window lengths, we design two optimization schemes in EffCause and demonstrate the advantage of EffCause through extensive experiments on both simulated and real-world datasets. The results validate that EffCause achieves state-of-the-art accuracy in continuous causal discovery tasks while achieving faster computation. Case studies from cloud system failure analysis and traffic flow monitoring show that EffCause effectively helps us understand real-world time-series data and solve practical problems.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"7 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139477015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma
Multi-task learning (MTL) is widely used in the online recommendation and financial services for multi-step conversion estimation, but current works often overlook the sequential dependence among tasks. Particularly, sequential dependence multi-task learning (SDMTL) faces challenges in dealing with complex task correlations and extracting valuable information in real-world scenarios, leading to negative transfer and a deterioration in the performance. Herein, a systematic learning paradigm of the SDMTL problem is established for the first time, which applies to more general multi-step conversion scenarios with longer conversion paths or various task dependence relationships. Meanwhile, an SDMTL architecture, named Task Aware Feature Extraction (TAFE), is designed to enable the dynamic task representation learning from a sample-wise view. TAFE selectively reconstructs the implicit shared information corresponding to each sample case and performs the explicit task-specific extraction under dependence constraints, which can avoid the negative transfer, resulting in more effective information sharing and joint representation learning. Extensive experiment results demonstrate the effectiveness and applicability of the proposed theoretical and implementation frameworks. Furthermore, the online evaluations at MYbank showed that TAFE had an average increase of 9.22(% ) and 3.76(% ) in various scenarios on the post-view click-through (& ) conversion rate (CTCVR) estimation task. Currently, TAFE has been depolyed in an online platform to provide various traffic services.
{"title":"Multi-Task Learning with Sequential Dependence Towards Industrial Applications: A Systematic Formulation","authors":"Xiaobo Guo, Mingming Ha, Xuewen Tao, Shaoshuai Li, Youru Li, Zhenfeng Zhu, Zhiyong Shen, Li Ma","doi":"10.1145/3640468","DOIUrl":"https://doi.org/10.1145/3640468","url":null,"abstract":"<p>Multi-task learning (MTL) is widely used in the online recommendation and financial services for multi-step conversion estimation, but current works often overlook the sequential dependence among tasks. Particularly, sequential dependence multi-task learning (SDMTL) faces challenges in dealing with complex task correlations and extracting valuable information in real-world scenarios, leading to negative transfer and a deterioration in the performance. Herein, a systematic learning paradigm of the SDMTL problem is established for the first time, which applies to more general multi-step conversion scenarios with longer conversion paths or various task dependence relationships. Meanwhile, an SDMTL architecture, named <b>T</b>ask <b>A</b>ware <b>F</b>eature <b>E</b>xtraction (<b>TAFE</b>), is designed to enable the dynamic task representation learning from a sample-wise view. TAFE selectively reconstructs the implicit shared information corresponding to each sample case and performs the explicit task-specific extraction under dependence constraints, which can avoid the negative transfer, resulting in more effective information sharing and joint representation learning. Extensive experiment results demonstrate the effectiveness and applicability of the proposed theoretical and implementation frameworks. Furthermore, the online evaluations at MYbank showed that TAFE had an average increase of 9.22(% ) and 3.76(% ) in various scenarios on the post-view click-through (& ) conversion rate (CTCVR) estimation task. Currently, TAFE has been depolyed in an online platform to provide various traffic services.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"93 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139461558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Link prediction is a fundamental problem in many graph-based applications, such as protein-protein interaction prediction. Recently, graph neural network (GNN) has been widely used for link prediction. However, existing GNN-based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention from researchers. In this paper, we first analyze the computation complexity of existing GNN-LP methods, revealing that one reason for the scalability problem stems from their symmetric learning strategy in applying the same class of GNN models to learn representation for both head nodes and tail nodes. We then propose a novel method, called asymmetric learning (AML), for GNN-LP. More specifically, AML applies a GNN model to learn head node representation while applying a multi-layer perceptron (MLP) model to learn tail node representation. To the best of our knowledge, AML is the first GNN-LP method to adopt an asymmetric learning strategy for node representation learning. Furthermore, we design a novel model architecture and apply a row-wise mini-batch sampling strategy to ensure promising model accuracy and training efficiency for AML. Experiments on three real large-scale datasets show that AML is 1.7 × ∼ 7.3 × faster in training than baselines with a symmetric learning strategy while having almost no accuracy loss.
{"title":"Asymmetric Learning for Graph Neural Network based Link Prediction","authors":"Kai-Lang Yao, Wu-Jun Li","doi":"10.1145/3640347","DOIUrl":"https://doi.org/10.1145/3640347","url":null,"abstract":"<p>Link prediction is a fundamental problem in many graph-based applications, such as protein-protein interaction prediction. Recently, graph neural network (GNN) has been widely used for link prediction. However, existing GNN-based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention from researchers. In this paper, we first analyze the computation complexity of existing GNN-LP methods, revealing that one reason for the scalability problem stems from their symmetric learning strategy in applying the same class of GNN models to learn representation for both head nodes and tail nodes. We then propose a novel method, called <underline>a</underline>sym<underline>m</underline>etric <underline>l</underline>earning (AML), for GNN-LP. More specifically, AML applies a GNN model to learn head node representation while applying a multi-layer perceptron (MLP) model to learn tail node representation. To the best of our knowledge, AML is the first GNN-LP method to adopt an asymmetric learning strategy for node representation learning. Furthermore, we design a novel model architecture and apply a row-wise mini-batch sampling strategy to ensure promising model accuracy and training efficiency for AML. Experiments on three real large-scale datasets show that AML is 1.7 × ∼ 7.3 × faster in training than baselines with a symmetric learning strategy while having almost no accuracy loss.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"14 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139410440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent spectral graph sparsification research aims to construct ultra-sparse subgraphs for preserving the original graph spectral (structural) properties, such as the first few Laplacian eigenvalues and eigenvectors, which has led to the development of a variety of nearly-linear time numerical and graph algorithms. However, there is very limited progress for spectral sparsification of directed graphs. In this work, we prove the existence of nearly-linear-sized spectral sparsifiers for directed graphs under certain conditions. Furthermore, we introduce a practically-efficient spectral algorithm (diGRASS) for sparsifying real-world, large-scale directed graphs leveraging spectral matrix perturbation analysis. The proposed method has been evaluated using a variety of directed graphs obtained from real-world applications, showing promising results for solving directed graph Laplacians, spectral partitioning of directed graphs, and approximately computing (personalized) PageRank vectors.
{"title":"diGRASS: Directed Graph Spectral Sparsification via Spectrum-Preserving Symmetrization","authors":"Ying Zhang, Zhiqiang Zhao, Zhuo Feng","doi":"10.1145/3639568","DOIUrl":"https://doi.org/10.1145/3639568","url":null,"abstract":"<p>Recent <i>spectral graph sparsification</i> research aims to construct ultra-sparse subgraphs for preserving the original graph spectral (structural) properties, such as the first few Laplacian eigenvalues and eigenvectors, which has led to the development of a variety of nearly-linear time numerical and graph algorithms. However, there is very limited progress for spectral sparsification of directed graphs. In this work, we prove the existence of nearly-linear-sized spectral sparsifiers for directed graphs under certain conditions. Furthermore, we introduce a practically-efficient spectral algorithm (diGRASS) for sparsifying real-world, large-scale directed graphs leveraging spectral matrix perturbation analysis. The proposed method has been evaluated using a variety of directed graphs obtained from real-world applications, showing promising results for solving directed graph Laplacians, spectral partitioning of directed graphs, and approximately computing (personalized) PageRank vectors.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"209 1 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139105578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel GAT-based method designed for Heterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH’s performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.
知识图谱(KG)在增强搜索结果和推荐系统方面发挥着重要作用。随着知识图谱规模的快速增长,知识图谱变得越来越不准确和不完整。知识图谱补全方法可以解决这个问题,其中基于图注意网络(GAT)的方法因其卓越的性能而脱颖而出。然而,现有的基于 GAT 的知识图谱补全方法在处理异构知识图谱时往往会出现过拟合问题,这主要是由于样本数量不平衡造成的。此外,这些方法在预测与其他实体共享相同关系和头部(尾部)实体的尾部(头部)实体时表现不佳。为了解决这些问题,我们提出了基于 GAT 的新方法 GATH,该方法专为异构 KG 而设计。GATH 包含两个独立的注意力网络模块,它们协同工作来预测丢失的实体。我们还引入了新颖的编码和特征转换方法,使 GATH 在样本不平衡的情况下也能表现稳健。我们进行了全面的实验来评估 GATH 的性能。与现有基于 SOTA GAT 模型的 Hits@10 和 MRR 指标相比,我们的模型在 FB15K-237 数据集上的性能分别提高了 5.2% 和 5.2%,在 WN18RR 数据集上的性能分别提高了 4.5% 和 14.6%。
{"title":"Enhancing Heterogeneous Knowledge Graph Completion with a Novel GAT-based Approach","authors":"Wanxu Wei, Yitong Song, Bin Yao","doi":"10.1145/3639472","DOIUrl":"https://doi.org/10.1145/3639472","url":null,"abstract":"<p>Knowledge graphs (KGs) play a vital role in enhancing search results and recommendation systems. With the rapid increase in the size of the KGs, they are becoming inaccuracy and incomplete. This problem can be solved by the knowledge graph completion methods, of which graph attention network (GAT)-based methods stand out since their superior performance. However, existing GAT-based knowledge graph completion methods often suffer from overfitting issues when dealing with heterogeneous knowledge graphs, primarily due to the unbalanced number of samples. Additionally, these methods demonstrate poor performance in predicting the tail (head) entity that shares the same relation and head (tail) entity with others. To solve these problems, we propose GATH, a novel <underline><b>GAT</b></underline>-based method designed for <underline><b>H</b></underline>eterogeneous KGs. GATH incorporates two separate attention network modules that work synergistically to predict the missing entities. We also introduce novel encoding and feature transformation approaches, enabling the robust performance of GATH in scenarios with imbalanced samples. Comprehensive experiments are conducted to evaluate the GATH’s performance. Compared with the existing SOTA GAT-based model on Hits@10 and MRR metrics, our model improves performance by 5.2% and 5.2% on the FB15K-237 dataset, and by 4.5% and 14.6% on the WN18RR dataset, respectively.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"72 1","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139105360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}