ACM Transactions on Knowledge Discovery from Data (TKDD)最新文献_第2页

GrOD: Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training 基于梯度正交分解的深度学习，用于知识转移、蒸馏和对抗训练

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-18 DOI: 10.1145/3530836

Haoyi Xiong, Ruosi Wan, Jian Zhao, Zeyu Chen, Xingjian Li, Zhanxing Zhu, Jun Huan

Regularization that incorporates the linear combination of empirical loss and explicit regularization terms as the loss function has been frequently used for many machine learning tasks. The explicit regularization term is designed in different types, depending on its applications. While regularized learning often boost the performance with higher accuracy and faster convergence, the regularization would sometimes hurt the empirical loss minimization and lead to poor performance. To deal with such issues in this work, we propose a novel strategy, namely Gradients Orthogonal Decomposition (GrOD), that improves the training procedure of regularized deep learning. Instead of linearly combining gradients of the two terms, GrOD re-estimates a new direction for iteration that does not hurt the empirical loss minimization while preserving the regularization affects, through orthogonal decomposition. We have performed extensive experiments to use GrOD improving the commonly used algorithms of transfer learning [2], knowledge distillation [3], and adversarial learning [4]. The experiment results based on large datasets, including Caltech 256 [5], MIT indoor 67 [6], CIFAR-10 [7], and ImageNet [8], show significant improvement made by GrOD for all three algorithms in all cases.

将经验损失和显式正则化项的线性组合作为损失函数的正则化已经经常用于许多机器学习任务。根据不同的应用，显式正则化项被设计成不同的类型。虽然正则化学习通常以更高的精度和更快的收敛速度提高性能，但正则化有时会损害经验损失最小化并导致性能下降。为了解决这些问题，我们提出了一种新的策略，即梯度正交分解(GrOD)，它改进了正则化深度学习的训练过程。GrOD不是线性组合两项的梯度，而是通过正交分解重新估计迭代的新方向，在不损害经验损失最小化的同时保留正则化影响。我们已经进行了大量的实验，使用GrOD改进迁移学习[2]、知识蒸馏[3]和对抗学习[4]等常用算法。基于Caltech 256[5]、MIT indoor 67[6]、CIFAR-10[7]和ImageNet[8]等大型数据集的实验结果表明，在所有情况下，GrOD对这三种算法都有显著的改进。

{"title":"GrOD: Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training","authors":"Haoyi Xiong, Ruosi Wan, Jian Zhao, Zeyu Chen, Xingjian Li, Zhanxing Zhu, Jun Huan","doi":"10.1145/3530836","DOIUrl":"https://doi.org/10.1145/3530836","url":null,"abstract":"Regularization that incorporates the linear combination of empirical loss and explicit regularization terms as the loss function has been frequently used for many machine learning tasks. The explicit regularization term is designed in different types, depending on its applications. While regularized learning often boost the performance with higher accuracy and faster convergence, the regularization would sometimes hurt the empirical loss minimization and lead to poor performance. To deal with such issues in this work, we propose a novel strategy, namely Gradients Orthogonal Decomposition (GrOD), that improves the training procedure of regularized deep learning. Instead of linearly combining gradients of the two terms, GrOD re-estimates a new direction for iteration that does not hurt the empirical loss minimization while preserving the regularization affects, through orthogonal decomposition. We have performed extensive experiments to use GrOD improving the commonly used algorithms of transfer learning [2], knowledge distillation [3], and adversarial learning [4]. The experiment results based on large datasets, including Caltech 256 [5], MIT indoor 67 [6], CIFAR-10 [7], and ImageNet [8], show significant improvement made by GrOD for all three algorithms in all cases.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121319353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Spectral Ranking Regression 光谱排序回归

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-12 DOI: 10.1145/3530693

Ilkay Yildiz, Jennifer G. Dy, D. Erdoğmuş, S. Ostmo, J. Campbell, Michael F. Chiang, Stratis Ioannidis

We study the problem of ranking regression, in which a dataset of rankings is used to learn Plackett–Luce scores as functions of sample features. We propose a novel spectral algorithm to accelerate learning in ranking regression. Our main technical contribution is to show that the Plackett–Luce negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of a Markov Chain. This allows us to tackle the ranking regression problem via an efficient spectral algorithm by using the Alternating Directions Method of Multipliers (ADMM). ADMM separates the learning of scores and model parameters, and in turn, enables us to devise fast spectral algorithms for ranking regression via both shallow and deep neural network (DNN) models. For shallow models, our algorithms are up to 579 times faster than the Newton’s method. For DNN models, we extend the standard ADMM via a Kullback–Leibler proximal penalty and show that this is still amenable to fast inference via a spectral approach. Compared to a state-of-the-art siamese network, our resulting algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets.

我们研究了排名回归问题，其中使用排名数据集来学习Plackett-Luce分数作为样本特征的函数。我们提出了一种新的谱算法来加速排序回归的学习。我们的主要技术贡献是证明了带有近端惩罚的Plackett-Luce负对数似然增广具有满足马尔可夫链平衡方程的平稳点。这使我们能够通过使用乘数交替方向法(ADMM)的有效谱算法来解决排序回归问题。ADMM分离了分数和模型参数的学习，反过来，使我们能够通过浅层和深层神经网络(DNN)模型设计快速的光谱算法来排序回归。对于浅模型，我们的算法比牛顿方法快579倍。对于DNN模型，我们通过Kullback-Leibler近端惩罚扩展了标准ADMM，并表明这仍然适用于通过谱方法进行快速推理。与最先进的暹罗网络相比，我们的结果算法速度快175倍，并且在五个现实生活排名数据集上获得了高达26%的Top-1准确率和6%的Kendall-Tau相关性。

{"title":"Spectral Ranking Regression","authors":"Ilkay Yildiz, Jennifer G. Dy, D. Erdoğmuş, S. Ostmo, J. Campbell, Michael F. Chiang, Stratis Ioannidis","doi":"10.1145/3530693","DOIUrl":"https://doi.org/10.1145/3530693","url":null,"abstract":"We study the problem of ranking regression, in which a dataset of rankings is used to learn Plackett–Luce scores as functions of sample features. We propose a novel spectral algorithm to accelerate learning in ranking regression. Our main technical contribution is to show that the Plackett–Luce negative log-likelihood augmented with a proximal penalty has stationary points that satisfy the balance equations of a Markov Chain. This allows us to tackle the ranking regression problem via an efficient spectral algorithm by using the Alternating Directions Method of Multipliers (ADMM). ADMM separates the learning of scores and model parameters, and in turn, enables us to devise fast spectral algorithms for ranking regression via both shallow and deep neural network (DNN) models. For shallow models, our algorithms are up to 579 times faster than the Newton’s method. For DNN models, we extend the standard ADMM via a Kullback–Leibler proximal penalty and show that this is still amenable to fast inference via a spectral approach. Compared to a state-of-the-art siamese network, our resulting algorithms are up to 175 times faster and attain better predictions by up to 26% Top-1 Accuracy and 6% Kendall-Tau correlation over five real-life ranking datasets.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126362490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond 谁将赢得数据科学竞赛?来自2019年KDD杯及以后的见解

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-05 DOI: 10.1145/3511896

Hao Liu, Qingyu Guo, Hengshu Zhu, Fuzhen Zhuang, Shen Yang, D. Dou, Hui Xiong

Data science competitions are becoming increasingly popular for enterprises collecting advanced innovative solutions and allowing contestants to sharpen their data science skills. Most existing studies about data science competitions have a focus on improving task-specific data science techniques, such as algorithm design and parameter tuning. However, little effort has been made to understand the data science competition itself. To this end, in this article, we shed light on the team’s competition performance, and investigate the team’s evolving performance in the crowd-sourcing competitive innovation context. Specifically, we first acquire and construct multi-sourced datasets of various data science competitions, including the KDD Cup 2019 machine learning competition and beyond. Then, we conduct an empirical analysis to identify and quantify a rich set of features that are significantly correlated with teams’ future performances. By leveraging team’s rank as a proxy, we observe “the stronger, the stronger” rule; that is, top-ranked teams tend to keep their advantages and dominate weaker teams for the rest of the competition. Our results also confirm that teams with diversified backgrounds tend to achieve better performances. After that, we formulate the team’s future rank prediction problem and propose the Multi-Task Representation Learning (MTRL) framework to model both static features and dynamic features. Extensive experimental results on four real-world data science competitions demonstrate the team’s future performance can be well predicted by using MTRL. Finally, we envision our study will not only help competition organizers to understand the competition in a better way, but also provide strategic implications to contestants, such as guiding the team formation and designing the submission strategy.

数据科学竞赛越来越受到企业的欢迎，这些企业收集先进的创新解决方案，并允许参赛者提高他们的数据科学技能。大多数关于数据科学竞赛的现有研究都侧重于改进特定于任务的数据科学技术，例如算法设计和参数调优。然而，很少有人努力去理解数据科学竞赛本身。为此，在本文中，我们阐明了团队的竞争绩效，并研究了团队在众包竞争创新背景下的绩效演变。具体来说，我们首先获取并构建各种数据科学竞赛的多源数据集，包括2019年KDD杯机器学习竞赛等。然后，我们进行了实证分析，以识别和量化与团队未来绩效显著相关的一系列丰富特征。通过利用团队的排名作为代理，我们遵循“越强越强”的规则;也就是说，在剩下的比赛中，排名靠前的球队往往会保持优势，压制实力较弱的球队。我们的研究结果还证实，背景多元化的团队往往会取得更好的绩效。之后，我们制定了团队未来排名预测问题，并提出了多任务表示学习(MTRL)框架来建模静态特征和动态特征。在四个现实世界的数据科学竞赛中进行的大量实验结果表明，使用MTRL可以很好地预测团队的未来表现。最后，我们设想我们的研究不仅可以帮助比赛组织者更好地了解比赛，还可以为参赛者提供战略指导，例如指导团队组建和设计提交策略。

{"title":"Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond","authors":"Hao Liu, Qingyu Guo, Hengshu Zhu, Fuzhen Zhuang, Shen Yang, D. Dou, Hui Xiong","doi":"10.1145/3511896","DOIUrl":"https://doi.org/10.1145/3511896","url":null,"abstract":"Data science competitions are becoming increasingly popular for enterprises collecting advanced innovative solutions and allowing contestants to sharpen their data science skills. Most existing studies about data science competitions have a focus on improving task-specific data science techniques, such as algorithm design and parameter tuning. However, little effort has been made to understand the data science competition itself. To this end, in this article, we shed light on the team’s competition performance, and investigate the team’s evolving performance in the crowd-sourcing competitive innovation context. Specifically, we first acquire and construct multi-sourced datasets of various data science competitions, including the KDD Cup 2019 machine learning competition and beyond. Then, we conduct an empirical analysis to identify and quantify a rich set of features that are significantly correlated with teams’ future performances. By leveraging team’s rank as a proxy, we observe “the stronger, the stronger” rule; that is, top-ranked teams tend to keep their advantages and dominate weaker teams for the rest of the competition. Our results also confirm that teams with diversified backgrounds tend to achieve better performances. After that, we formulate the team’s future rank prediction problem and propose the Multi-Task Representation Learning (MTRL) framework to model both static features and dynamic features. Extensive experimental results on four real-world data science competitions demonstrate the team’s future performance can be well predicted by using MTRL. Finally, we envision our study will not only help competition organizers to understand the competition in a better way, but also provide strategic implications to contestants, such as guiding the team formation and designing the submission strategy.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129247617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Robustness of Metric Learning: An Adversarial Perspective 论度量学习的鲁棒性:一个对抗的视角

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-05 DOI: 10.1145/3502726

Mengdi Huai, T. Zheng, Chenglin Miao, Liuyi Yao, Aidong Zhang

Metric learning aims at automatically learning a distance metric from data so that the precise similarity between data instances can be faithfully reflected, and its importance has long been recognized in many fields. An implicit assumption in existing metric learning works is that the learned models are performed in a reliable and secure environment. However, the increasingly critical role of metric learning makes it susceptible to a risk of being malicious attacked. To well understand the performance of metric learning models in adversarial environments, in this article, we study the robustness of metric learning to adversarial perturbations, which are also known as the imperceptible changes to the input data that are crafted by an attacker to fool a well-learned model. However, different from traditional classification models, metric learning models take instance pairs rather than individual instances as input, and the perturbation on one instance may not necessarily affect the prediction result for an instance pair, which makes it more difficult to study the robustness of metric learning. To address this challenge, in this article, we first provide a definition of pairwise robustness for metric learning, and then propose a novel projected gradient descent-based attack method (called AckMetric) to evaluate the robustness of metric learning models. To further explore the capability of the attacker to change the prediction results, we also propose a theoretical framework to derive the upper bound of the pairwise adversarial loss. Finally, we incorporate the derived bound into the training process of metric learning and design a novel defense method to make the learned models more robust. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed methods.

度量学习旨在从数据中自动学习一种距离度量，从而真实地反映数据实例之间精确的相似度，其重要性早已被许多领域所认识。现有度量学习工作中的一个隐含假设是，学习的模型是在可靠和安全的环境中执行的。然而，度量学习日益重要的作用使其容易受到恶意攻击的风险。为了更好地理解度量学习模型在对抗环境中的性能，在本文中，我们研究了度量学习对对抗性扰动的鲁棒性，对抗性扰动也被称为攻击者为欺骗学习良好的模型而精心制作的输入数据的不可察觉的变化。然而，与传统分类模型不同的是，度量学习模型采用实例对而非单个实例作为输入，其中一个实例的扰动不一定会影响到另一个实例对的预测结果，这给度量学习的鲁棒性研究增加了难度。为了解决这一挑战，在本文中，我们首先提供了度量学习的成对鲁棒性的定义，然后提出了一种新的基于投影梯度下降的攻击方法(称为AckMetric)来评估度量学习模型的鲁棒性。为了进一步探索攻击者改变预测结果的能力，我们还提出了一个理论框架来推导成对对抗损失的上界。最后，我们将导出的界引入到度量学习的训练过程中，并设计了一种新的防御方法，使学习到的模型更加鲁棒。在实际数据集上的大量实验证明了所提出方法的有效性。

{"title":"On the Robustness of Metric Learning: An Adversarial Perspective","authors":"Mengdi Huai, T. Zheng, Chenglin Miao, Liuyi Yao, Aidong Zhang","doi":"10.1145/3502726","DOIUrl":"https://doi.org/10.1145/3502726","url":null,"abstract":"Metric learning aims at automatically learning a distance metric from data so that the precise similarity between data instances can be faithfully reflected, and its importance has long been recognized in many fields. An implicit assumption in existing metric learning works is that the learned models are performed in a reliable and secure environment. However, the increasingly critical role of metric learning makes it susceptible to a risk of being malicious attacked. To well understand the performance of metric learning models in adversarial environments, in this article, we study the robustness of metric learning to adversarial perturbations, which are also known as the imperceptible changes to the input data that are crafted by an attacker to fool a well-learned model. However, different from traditional classification models, metric learning models take instance pairs rather than individual instances as input, and the perturbation on one instance may not necessarily affect the prediction result for an instance pair, which makes it more difficult to study the robustness of metric learning. To address this challenge, in this article, we first provide a definition of pairwise robustness for metric learning, and then propose a novel projected gradient descent-based attack method (called AckMetric) to evaluate the robustness of metric learning models. To further explore the capability of the attacker to change the prediction results, we also propose a theoretical framework to derive the upper bound of the pairwise adversarial loss. Finally, we incorporate the derived bound into the training process of metric learning and design a novel defense method to make the learned models more robust. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116335417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Generalized Euclidean Measure to Estimate Distances on Multilayer Networks 多层网络距离估计的广义欧几里得测度

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-01 DOI: 10.1145/3529396

M. Coscia

Estimating the distance covered by a spreading event on a network can lead to a better understanding of epidemics, economic growth, and human behavior. There are many methods solving this problem—which has been called Node Vector Distance (NVD)—for single layer networks. However, many phenomena are better represented by multilayer networks: networks in which nodes can connect in qualitatively different ways. In this article, we extend the literature by proposing an algorithm solving NVD for multilayer networks. We do so by adapting the Mahalanobis distance, incorporating the graph’s topology via the pseudoinverse of its Laplacian. Since this is a proper generalization of the Euclidean distance in a complex space defined by the topology of the graph, and that it works on multilayer networks, we call our measure the Multi Layer Generalized Euclidean (MLGE). In our experiments, we show that MLGE is intuitive, theoretically simpler than the alternatives, performs well in recovering infection parameters, and it is useful in specific case studies. MLGE requires solving a special case of the effective resistance on the graph, which has a high time complexity. However, this needs to be done only once per network. In the experiments, we show that MLGE can cache its most computationally heavy parts, allowing it to solve hundreds of NVD problems on the same network with little to no additional runtime. MLGE is provided as a free open source tool, along with the data and the code necessary to replicate our results.

估计网络上传播事件所覆盖的距离可以使我们更好地了解流行病、经济增长和人类行为。对于单层网络，有许多方法可以解决这个问题，这被称为节点向量距离(NVD)。然而，许多现象可以用多层网络来更好地表示:在多层网络中，节点可以以不同的方式连接。在本文中，我们通过提出一种求解多层网络NVD的算法来扩展文献。我们通过调整Mahalanobis距离来做到这一点，通过其拉普拉斯算子的伪逆结合图的拓扑结构。由于这是由图的拓扑定义的复杂空间中欧几里得距离的适当推广，并且它适用于多层网络，因此我们称我们的度量为多层广义欧几里得(MLGE)。在我们的实验中，我们表明MLGE是直观的，理论上比替代方案更简单，在恢复感染参数方面表现良好，并且在特定的案例研究中很有用。MLGE需要求解图上有效电阻的一种特殊情况，具有很高的时间复杂度。但是，每个网络只需要执行一次。在实验中，我们表明MLGE可以缓存其计算量最大的部分，使其能够在同一网络上解决数百个NVD问题，而几乎不需要额外的运行时间。MLGE是一个免费的开源工具，还提供了复制我们的结果所需的数据和代码。

{"title":"Generalized Euclidean Measure to Estimate Distances on Multilayer Networks","authors":"M. Coscia","doi":"10.1145/3529396","DOIUrl":"https://doi.org/10.1145/3529396","url":null,"abstract":"Estimating the distance covered by a spreading event on a network can lead to a better understanding of epidemics, economic growth, and human behavior. There are many methods solving this problem—which has been called Node Vector Distance (NVD)—for single layer networks. However, many phenomena are better represented by multilayer networks: networks in which nodes can connect in qualitatively different ways. In this article, we extend the literature by proposing an algorithm solving NVD for multilayer networks. We do so by adapting the Mahalanobis distance, incorporating the graph’s topology via the pseudoinverse of its Laplacian. Since this is a proper generalization of the Euclidean distance in a complex space defined by the topology of the graph, and that it works on multilayer networks, we call our measure the Multi Layer Generalized Euclidean (MLGE). In our experiments, we show that MLGE is intuitive, theoretically simpler than the alternatives, performs well in recovering infection parameters, and it is useful in specific case studies. MLGE requires solving a special case of the effective resistance on the graph, which has a high time complexity. However, this needs to be done only once per network. In the experiments, we show that MLGE can cache its most computationally heavy parts, allowing it to solve hundreds of NVD problems on the same network with little to no additional runtime. MLGE is provided as a free open source tool, along with the data and the code necessary to replicate our results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Predicting a Person’s Next Activity Region with a Dynamic Region-Relation-Aware Graph Neural Network 用动态区域关系感知图神经网络预测人的下一个活动区域

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-01 DOI: 10.1145/3529091

N. Zhu, Jiangxia Cao, Xinjiang Lu, Chuanren Liu, Hao Liu, Yanyan Li, Xiangfeng Luo, H. Xiong

The understanding of people’s inter-regional mobility behaviors, such as predicting the next activity region (AR) or uncovering the intentions for regional mobility, is of great value to public administration or business interests. While there are numerous studies on human mobility, these studies are mainly from a statistical view or study movement behaviors within a region. The work on individual-level inter-regional mobility behavior is limited. To this end, in this article, we propose a dynamic region-relation-aware graph neural network (DRRGNN) for exploring individual mobility behaviors over ARs. Specifically, we aim at developing models that can answer three questions: (1) Which regions are the ARs? (2) Which region will be the next AR, and (3) Why do people make this regional mobility? To achieve these tasks, we first propose a method to find out people’s ARs. Then, the designed model integrates a dynamic graph convolution network (DGCN) and a recurrent neural network (RNN) to depict the evolution of relations between ARs and mine the regional mobility patterns. In the learning process, the model further considers peoples’ profiles and visited point-of-interest (POIs). Finally, extensive experiments on two real-world datasets show that the proposed model can significantly improve accuracy for both the next AR prediction and mobility intention prediction.

了解人们的跨区域流动行为，如预测下一个活动区域(AR)或揭示区域流动的意图，对公共管理或商业利益具有重要价值。虽然有很多关于人类流动性的研究，但这些研究主要是从统计角度或研究区域内的运动行为。个人层面的区域间流动行为研究较为有限。为此，在本文中，我们提出了一个动态区域关系感知图神经网络(DRRGNN)来探索ar上的个体移动行为。具体来说，我们的目标是开发能够回答三个问题的模型:(1)哪些地区是ar ?(2)哪个地区将成为下一个AR，(3)为什么人们会进行这种区域流动?为了完成这些任务，我们首先提出了一种找出人们的ar的方法。然后，将动态图卷积网络(DGCN)和递归神经网络(RNN)相结合，描述ar之间关系的演变，挖掘区域迁移模式。在学习过程中，该模型进一步考虑了人们的概况和访问的兴趣点(poi)。最后，在两个真实数据集上的大量实验表明，该模型可以显著提高下一个AR预测和移动意图预测的准确性。

{"title":"Predicting a Person’s Next Activity Region with a Dynamic Region-Relation-Aware Graph Neural Network","authors":"N. Zhu, Jiangxia Cao, Xinjiang Lu, Chuanren Liu, Hao Liu, Yanyan Li, Xiangfeng Luo, H. Xiong","doi":"10.1145/3529091","DOIUrl":"https://doi.org/10.1145/3529091","url":null,"abstract":"The understanding of people’s inter-regional mobility behaviors, such as predicting the next activity region (AR) or uncovering the intentions for regional mobility, is of great value to public administration or business interests. While there are numerous studies on human mobility, these studies are mainly from a statistical view or study movement behaviors within a region. The work on individual-level inter-regional mobility behavior is limited. To this end, in this article, we propose a dynamic region-relation-aware graph neural network (DRRGNN) for exploring individual mobility behaviors over ARs. Specifically, we aim at developing models that can answer three questions: (1) Which regions are the ARs? (2) Which region will be the next AR, and (3) Why do people make this regional mobility? To achieve these tasks, we first propose a method to find out people’s ARs. Then, the designed model integrates a dynamic graph convolution network (DGCN) and a recurrent neural network (RNN) to depict the evolution of relations between ARs and mine the regional mobility patterns. In the learning process, the model further considers peoples’ profiles and visited point-of-interest (POIs). Finally, extensive experiments on two real-world datasets show that the proposed model can significantly improve accuracy for both the next AR prediction and mobility intention prediction.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126422707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Computer Science Diagram Understanding with Topology Parsing 计算机科学图理解与拓扑解析

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-18 DOI: 10.1145/3522689

Shaowei Wang, Lingling Zhang, Xuan Luo, Yi Yang, Xin Hu, Tao Qin, Jun Liu

Diagram is a special form of visual expression for representing complex concepts, logic, and knowledge, which widely appears in educational scenes such as textbooks, blogs, and encyclopedias. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. In this article, we construct the first novel geometric type of diagrams dataset in Computer Science field, which has more abstract expressions and complex logical relations. The dataset has exhaustive annotations of objects and relations for about 1,300 diagrams and 3,500 question-answer pairs. We introduce the tasks of diagram classification (DC) and diagram question answering (DQA) based on the new dataset, and propose the Diagram Paring Net (DPN) that focuses on analyzing the topological structure and text information of diagrams. We use DPN-based models to solve DC and DQA tasks, and compare the performances to well-known natural images classification models and visual question answering models. Our experiments show the effectiveness of the proposed DPN-based models on diagram understanding tasks, also indicate that our dataset is more complex compared to previous natural image understanding datasets. The presented dataset opens new challenges for research in diagram understanding, and the DPN method provides a novel perspective for studying such data. Our dataset can be available from https://github.com/WayneWong97/CSDia.

图表是表示复杂概念、逻辑和知识的一种特殊的视觉表达形式，广泛出现在教科书、博客和百科全书等教育场景中。目前对图表的研究初步集中在生物学、地理学等自然学科，其表达方式仍与自然图像相似。在本文中，我们构建了计算机科学领域第一个新的几何类型的图表数据集，该数据集具有更抽象的表达和复杂的逻辑关系。该数据集对大约1300个图和3500个问答对的对象和关系进行了详尽的注释。引入了基于新数据集的图分类(DC)和图问答(DQA)任务，提出了以分析图的拓扑结构和文本信息为重点的图配对网(DPN)。我们使用基于dnp的模型来解决DC和DQA任务，并将其性能与已知的自然图像分类模型和视觉问答模型进行比较。我们的实验表明了基于dnp的模型在图理解任务上的有效性，也表明我们的数据集比以前的自然图像理解数据集更复杂。提出的数据集为图表理解的研究带来了新的挑战，DPN方法为研究此类数据提供了新的视角。我们的数据集可以从https://github.com/WayneWong97/CSDia获得。

{"title":"Computer Science Diagram Understanding with Topology Parsing","authors":"Shaowei Wang, Lingling Zhang, Xuan Luo, Yi Yang, Xin Hu, Tao Qin, Jun Liu","doi":"10.1145/3522689","DOIUrl":"https://doi.org/10.1145/3522689","url":null,"abstract":"Diagram is a special form of visual expression for representing complex concepts, logic, and knowledge, which widely appears in educational scenes such as textbooks, blogs, and encyclopedias. Current research on diagrams preliminarily focuses on natural disciplines such as Biology and Geography, whose expressions are still similar to natural images. In this article, we construct the first novel geometric type of diagrams dataset in Computer Science field, which has more abstract expressions and complex logical relations. The dataset has exhaustive annotations of objects and relations for about 1,300 diagrams and 3,500 question-answer pairs. We introduce the tasks of diagram classification (DC) and diagram question answering (DQA) based on the new dataset, and propose the Diagram Paring Net (DPN) that focuses on analyzing the topological structure and text information of diagrams. We use DPN-based models to solve DC and DQA tasks, and compare the performances to well-known natural images classification models and visual question answering models. Our experiments show the effectiveness of the proposed DPN-based models on diagram understanding tasks, also indicate that our dataset is more complex compared to previous natural image understanding datasets. The presented dataset opens new challenges for research in diagram understanding, and the DPN method provides a novel perspective for studying such data. Our dataset can be available from https://github.com/WayneWong97/CSDia.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134446902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Improving Bandit Learning Via Heterogeneous Information Networks: Algorithms and Applications 通过异构信息网络改进强盗学习:算法和应用

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-15 DOI: 10.1145/3522590

Xiaoying Zhang, Hong Xie, John C.S. Lui

Contextual bandit serves as an invaluable tool to balance the exploration vs. exploitation tradeoff in various applications such as online recommendation. In many applications, heterogeneous information networks (HINs) provide rich side information for contextual bandits, such as different types of attributes and relationships among users and items. In this article, we propose the first HIN-assisted contextual bandit framework, which utilizes a given HIN to assist contextual bandit learning. The proposed framework uses meta-paths in HIN to extract rich relations among users and items for the contextual bandit. The main challenge is how to leverage these relations, since users’ preference over items, the target of our online learning, are closely related to users’ preference over meta-paths. However, it is unknown which meta-path a user prefers more. Thus, both preferences are needed to be learned in an online fashion with exploration vs. exploitation tradeoff balanced. We propose the HIN-assisted upper confidence bound (HUCB) algorithm to address such a challenge. For each meta-path, the HUCB algorithm employs an independent base bandit algorithm to handle online item recommendations by leveraging the relationship captured in this meta-path. A bandit master is then employed to learn users’ preference over meta-paths to dynamically combine base bandit algorithms with a balance of exploration vs. exploitation tradeoff. We theoretically prove that the HUCB algorithm can achieve similar performance compared with the optimal algorithm where each user is served according to his true preference over meta-paths (assuming the optimal algorithm knows the preference). Moreover, we prove that the HUCB algorithm benefits from leveraging HIN in achieving a smaller regret upper bound than the baseline algorithm without leveraging HIN. Experimental results on a synthetic dataset, as well as real datasets from LastFM and Yelp demonstrate the fast learning speed of the HUCB algorithm.

在各种应用程序(如在线推荐)中，上下文强盗是平衡探索与利用权衡的宝贵工具。在许多应用程序中，异构信息网络(HINs)为上下文强盗提供了丰富的侧信息，例如不同类型的属性以及用户和项之间的关系。在本文中，我们提出了第一个HIN辅助上下文强盗框架，它利用给定的HIN来辅助上下文强盗学习。提出的框架使用HIN中的元路径为上下文强盗提取用户和项目之间的丰富关系。主要的挑战是如何利用这些关系，因为用户对项目的偏好(我们在线学习的目标)与用户对元路径的偏好密切相关。但是，不知道用户更喜欢哪一种元路径。因此，这两种偏好都需要以在线方式学习，并平衡探索与利用之间的权衡。我们提出了hin辅助上置信度界(HUCB)算法来解决这一挑战。对于每个元路径，HUCB算法使用一个独立的基本算法，通过利用元路径中捕获的关系来处理在线项目推荐。然后使用强盗大师来学习用户对元路径的偏好，以动态地将基本强盗算法与探索与利用权衡的平衡结合起来。我们从理论上证明，与最优算法相比，HUCB算法可以实现类似的性能，其中每个用户根据他在元路径上的真实偏好(假设最优算法知道偏好)得到服务。此外，我们证明了利用HIN的HUCB算法比不利用HIN的基线算法获得更小的遗憾上界。在合成数据集以及LastFM和Yelp的真实数据集上的实验结果表明，HUCB算法具有较快的学习速度。

{"title":"Improving Bandit Learning Via Heterogeneous Information Networks: Algorithms and Applications","authors":"Xiaoying Zhang, Hong Xie, John C.S. Lui","doi":"10.1145/3522590","DOIUrl":"https://doi.org/10.1145/3522590","url":null,"abstract":"Contextual bandit serves as an invaluable tool to balance the exploration vs. exploitation tradeoff in various applications such as online recommendation. In many applications, heterogeneous information networks (HINs) provide rich side information for contextual bandits, such as different types of attributes and relationships among users and items. In this article, we propose the first HIN-assisted contextual bandit framework, which utilizes a given HIN to assist contextual bandit learning. The proposed framework uses meta-paths in HIN to extract rich relations among users and items for the contextual bandit. The main challenge is how to leverage these relations, since users’ preference over items, the target of our online learning, are closely related to users’ preference over meta-paths. However, it is unknown which meta-path a user prefers more. Thus, both preferences are needed to be learned in an online fashion with exploration vs. exploitation tradeoff balanced. We propose the HIN-assisted upper confidence bound (HUCB) algorithm to address such a challenge. For each meta-path, the HUCB algorithm employs an independent base bandit algorithm to handle online item recommendations by leveraging the relationship captured in this meta-path. A bandit master is then employed to learn users’ preference over meta-paths to dynamically combine base bandit algorithms with a balance of exploration vs. exploitation tradeoff. We theoretically prove that the HUCB algorithm can achieve similar performance compared with the optimal algorithm where each user is served according to his true preference over meta-paths (assuming the optimal algorithm knows the preference). Moreover, we prove that the HUCB algorithm benefits from leveraging HIN in achieving a smaller regret upper bound than the baseline algorithm without leveraging HIN. Experimental results on a synthetic dataset, as well as real datasets from LastFM and Yelp demonstrate the fast learning speed of the HUCB algorithm.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"31 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123283923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual-MGAN: An Efficient Approach for Semi-supervised Outlier Detection with Few Identified Anomalies 双mgan:一种有效的半监督异常点检测方法

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-15 DOI: 10.1145/3522690

Zhe Li, Chunhua Sun, Chunli Liu, Xiayu Chen, Meng Wang, Yezheng Liu

Outlier detection is an important task in data mining, and many technologies for it have been explored in various applications. However, owing to the default assumption that outliers are not concentrated, unsupervised outlier detection may not correctly identify group anomalies with higher levels of density. Although high detection rates and optimal parameters can usually be achieved by using supervised outlier detection, obtaining a sufficient number of correct labels is a time-consuming task. To solve these problems, we focus on semi-supervised outlier detection with few identified anomalies and a large amount of unlabeled data. The task of semi-supervised outlier detection is first decomposed into the detection of discrete anomalies and that of partially identified group anomalies, and a distribution construction sub-module and a data augmentation sub-module are then proposed to identify them, respectively. In this way, the dual multiple generative adversarial networks (Dual-MGAN) that combine the two sub-modules can identify discrete as well as partially identified group anomalies. In addition, in view of the difficulty of determining the stop node of training, two evaluation indicators are introduced to evaluate the training status of the sub-GANs. Extensive experiments on synthetic and real-world data show that the proposed Dual-MGAN can significantly improve the accuracy of outlier detection, and the proposed evaluation indicators can reflect the training status of the sub-GANs.

异常点检测是数据挖掘中的一项重要任务，针对异常点检测的技术已经在各种应用中得到了探索。然而，由于默认假设异常值不集中，无监督异常值检测可能无法正确识别密度较高的群体异常。虽然使用监督离群值检测通常可以实现高检测率和最佳参数，但获得足够数量的正确标签是一项耗时的任务。为了解决这些问题，我们将重点放在半监督异常点检测上，该检测具有少量可识别的异常和大量未标记的数据。首先将半监督离群点检测任务分解为离散异常检测和部分识别的群异常检测，并分别提出了分布构建子模块和数据增强子模块对其进行识别。通过这种方式，结合两个子模块的双多生成对抗网络(dual - mgan)可以识别离散和部分识别的群体异常。此外，针对难以确定训练停止节点的问题，引入两个评价指标对子gan的训练状态进行评价。在合成数据和真实数据上进行的大量实验表明，本文提出的Dual-MGAN能够显著提高离群点检测的准确性，所提出的评价指标能够反映子gan的训练状态。

{"title":"Dual-MGAN: An Efficient Approach for Semi-supervised Outlier Detection with Few Identified Anomalies","authors":"Zhe Li, Chunhua Sun, Chunli Liu, Xiayu Chen, Meng Wang, Yezheng Liu","doi":"10.1145/3522690","DOIUrl":"https://doi.org/10.1145/3522690","url":null,"abstract":"Outlier detection is an important task in data mining, and many technologies for it have been explored in various applications. However, owing to the default assumption that outliers are not concentrated, unsupervised outlier detection may not correctly identify group anomalies with higher levels of density. Although high detection rates and optimal parameters can usually be achieved by using supervised outlier detection, obtaining a sufficient number of correct labels is a time-consuming task. To solve these problems, we focus on semi-supervised outlier detection with few identified anomalies and a large amount of unlabeled data. The task of semi-supervised outlier detection is first decomposed into the detection of discrete anomalies and that of partially identified group anomalies, and a distribution construction sub-module and a data augmentation sub-module are then proposed to identify them, respectively. In this way, the dual multiple generative adversarial networks (Dual-MGAN) that combine the two sub-modules can identify discrete as well as partially identified group anomalies. In addition, in view of the difficulty of determining the stop node of training, two evaluation indicators are introduced to evaluate the training status of the sub-GANs. Extensive experiments on synthetic and real-world data show that the proposed Dual-MGAN can significantly improve the accuracy of outlier detection, and the proposed evaluation indicators can reflect the training status of the sub-GANs.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116969147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Toward Quality of Information Aware Distributed Machine Learning 面向信息感知质量的分布式机器学习

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-03-15 DOI: 10.1145/3522591

Houping Xiao, Shiyu Wang

In the era of big data, data are usually distributed across numerous connected computing and storage units (i.e., nodes or workers). Under such an environment, many machine learning problems can be reformulated as a consensus optimization problem, which consists of one objective and constraint terms splitting into N parts (each corresponds to a node). Such a problem can be solved efficiently in a distributed manner via Alternating Direction Method of Multipliers (ADMM). However, existing consensus optimization frameworks assume that every node has the same quality of information (QoI), i.e., the data from all the nodes are equally informative for the estimation of global model parameters. As a consequence, they may lead to inaccurate estimates in the presence of nodes with low QoI. To overcome this challenge, in this article, we propose a novel consensus optimization framework for distributed machine-learning that incorporates the crucial metric, QoI. Theoretically, we prove that the convergence rate of the proposed framework is linear to the number of iterations, but has a tighter upper bound compared with ADMM. Experimentally, we show that the proposed framework is more efficient and effective than existing ADMM-based solutions on both synthetic and real-world datasets due to its faster convergence rate and higher accuracy.

在大数据时代，数据通常分布在众多连接的计算和存储单元(即节点或工作人员)上。在这样的环境下，许多机器学习问题可以被重新表述为共识优化问题，该问题由一个目标和约束项分成N个部分(每个部分对应一个节点)组成。利用乘法器交替方向法(ADMM)可以以分布式的方式有效地解决这一问题。然而，现有的共识优化框架假设每个节点具有相同的信息质量(QoI)，即来自所有节点的数据对于全局模型参数的估计具有相同的信息量。因此，在存在低qi的节点时，它们可能导致不准确的估计。为了克服这一挑战，在本文中，我们为分布式机器学习提出了一个新的共识优化框架，该框架包含了关键指标qi。从理论上证明了该框架的收敛速度与迭代次数成线性关系，但与ADMM相比具有更严格的上界。实验表明，由于该框架具有更快的收敛速度和更高的精度，因此在合成数据集和实际数据集上都比现有的基于admm的解决方案更高效。

{"title":"Toward Quality of Information Aware Distributed Machine Learning","authors":"Houping Xiao, Shiyu Wang","doi":"10.1145/3522591","DOIUrl":"https://doi.org/10.1145/3522591","url":null,"abstract":"In the era of big data, data are usually distributed across numerous connected computing and storage units (i.e., nodes or workers). Under such an environment, many machine learning problems can be reformulated as a consensus optimization problem, which consists of one objective and constraint terms splitting into N parts (each corresponds to a node). Such a problem can be solved efficiently in a distributed manner via Alternating Direction Method of Multipliers (ADMM). However, existing consensus optimization frameworks assume that every node has the same quality of information (QoI), i.e., the data from all the nodes are equally informative for the estimation of global model parameters. As a consequence, they may lead to inaccurate estimates in the presence of nodes with low QoI. To overcome this challenge, in this article, we propose a novel consensus optimization framework for distributed machine-learning that incorporates the crucial metric, QoI. Theoretically, we prove that the convergence rate of the proposed framework is linear to the number of iterations, but has a tighter upper bound compared with ADMM. Experimentally, we show that the proposed framework is more efficient and effective than existing ADMM-based solutions on both synthetic and real-world datasets due to its faster convergence rate and higher accuracy.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0