Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining最新文献_第2页

Heterogeneous Graph Neural Network 异构图神经网络

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330961

Chuxu Zhang, Dongjin Song, Chao Huang, A. Swami, N. Chawla

Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes "deep" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.

异构图中的表示学习旨在为每个节点寻求有意义的向量表示，以方便下游应用，如链接预测、个性化推荐、节点分类等。然而，这项任务具有挑战性，不仅因为需要合并由多种类型的节点和边组成的异构结构(图)信息，而且还因为需要考虑与每个节点相关的异构属性或内容(例如，文本或图像)。尽管在同质(或异构)图嵌入、属性图嵌入以及图神经网络等方面已经做了大量的研究，但能够有效地联合考虑各节点的异构结构(图)信息和异构内容信息的却很少。本文提出了异构图神经网络模型HetGNN来解决这一问题。具体来说，我们首先引入带重启策略的随机行走，为每个节点采样固定大小的强相关异构邻居，并根据节点类型对它们进行分组。接下来，我们设计了一个包含两个模块的神经网络架构，对这些采样的相邻节点的特征信息进行聚合。第一个模块对异构内容的“深度”特征交互进行编码，并为每个节点生成内容嵌入。第二个模块对不同相邻组(类型)的内容(属性)嵌入进行聚合，并考虑不同组(类型)的影响，进一步进行组合，得到最终的节点嵌入。最后，我们利用图上下文损失和小批量梯度下降过程以端到端方式训练模型。在多个数据集上进行的大量实验表明，HetGNN在各种图挖掘任务(即链接预测、推荐、节点分类聚类和归纳节点分类聚类)中可以优于最先进的基线。

{"title":"Heterogeneous Graph Neural Network","authors":"Chuxu Zhang, Dongjin Song, Chao Huang, A. Swami, N. Chawla","doi":"10.1145/3292500.3330961","DOIUrl":"https://doi.org/10.1145/3292500.3330961","url":null,"abstract":"Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes \"deep\" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121909240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 879

DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation DAML:项目推荐中评分和评论之间的双重注意相互学习

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330906

Donghua Liu, Jing Li, Bo Du, Junfei Chang, Rong Gao

Despite the great success of many matrix factorization based collaborative filtering approaches, there is still much space for improvement in recommender system field. One main obstacle is the cold-start and data sparseness problem, requiring better solutions. Recent studies have attempted to integrate review information into rating prediction. However, there are two main problems: (1) most of existing works utilize a static and independent method to extract the latent feature representation of user and item reviews ignoring the correlation between the latent features, which may fail to capture the preference of users comprehensively. (2) there is no effective framework that unifies ratings and reviews. Therefore, we propose a novel d ual a ttention m utual l earning between ratings and reviews for item recommendation, named DAML. Specifically, we utilize local and mutual attention of the convolutional neural network to jointly learn the features of reviews to enhance the interpretability of the proposed DAML model. Then the rating features and review features are integrated into a unified neural network model, and the higher-order nonlinear interaction of features are realized by the neural factorization machines to complete the final rating prediction. Experiments on the five real-world datasets show that DAML achieves significantly better rating prediction accuracy compared to the state-of-the-art methods. Furthermore, the attention mechanism can highlight the relevant information in reviews to increase the interpretability of rating prediction.

尽管许多基于矩阵分解的协同过滤方法取得了巨大的成功，但在推荐系统领域仍有很大的改进空间。一个主要障碍是冷启动和数据稀疏问题，需要更好的解决方案。最近的研究试图将评论信息整合到评级预测中。然而，存在两个主要问题:(1)现有研究大多采用静态、独立的方法提取用户评论和物品评论的潜在特征表示，忽略了潜在特征之间的相关性，可能无法全面捕捉用户的偏好。(2)没有统一评级和评论的有效框架。因此，我们提出了一种新颖的评分和评论之间的相互关注的相互学习方法，称为DAML。具体而言，我们利用卷积神经网络的局部关注和相互关注来共同学习评论的特征，以增强所提出的DAML模型的可解释性。然后将评分特征和复习特征集成到统一的神经网络模型中，通过神经分解机实现特征的高阶非线性交互，完成最终的评分预测。在五个真实数据集上的实验表明，与最先进的方法相比，DAML取得了显著更好的评级预测精度。此外，注意机制可以突出评论中的相关信息，提高评分预测的可解释性。

{"title":"DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation","authors":"Donghua Liu, Jing Li, Bo Du, Junfei Chang, Rong Gao","doi":"10.1145/3292500.3330906","DOIUrl":"https://doi.org/10.1145/3292500.3330906","url":null,"abstract":"Despite the great success of many matrix factorization based collaborative filtering approaches, there is still much space for improvement in recommender system field. One main obstacle is the cold-start and data sparseness problem, requiring better solutions. Recent studies have attempted to integrate review information into rating prediction. However, there are two main problems: (1) most of existing works utilize a static and independent method to extract the latent feature representation of user and item reviews ignoring the correlation between the latent features, which may fail to capture the preference of users comprehensively. (2) there is no effective framework that unifies ratings and reviews. Therefore, we propose a novel d ual a ttention m utual l earning between ratings and reviews for item recommendation, named DAML. Specifically, we utilize local and mutual attention of the convolutional neural network to jointly learn the features of reviews to enhance the interpretability of the proposed DAML model. Then the rating features and review features are integrated into a unified neural network model, and the higher-order nonlinear interaction of features are realized by the neural factorization machines to complete the final rating prediction. Experiments on the five real-world datasets show that DAML achieves significantly better rating prediction accuracy compared to the state-of-the-art methods. Furthermore, the attention mechanism can highlight the relevant information in reviews to increase the interpretability of rating prediction.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116861538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

Tutorial: Are You My Neighbor?: Bringing Order to Neighbor Computing Problems. 教程:你是我的邻居吗?:为邻居计算问题带来秩序。

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3332292

D. Anastasiu, H. Rangwala, Andrea Tagarelli

Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons. In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of "neighbor computing" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.

寻找最近邻是近年来备受关注的一个重要课题，在市场购物篮分析、抄袭与异常检测、社区检测、基于配体的虚拟筛选等诸多领域都有应用。随着数据越来越容易收集，寻找邻居已经成为分析管道中的一个潜在瓶颈。考虑到今天的海量数据集，两两比较已不再可行。这项任务的高计算复杂性促使研究人员开发出近似方法，这种方法可以找到许多但不是所有的最近邻。然而，对于某些类型的数据，通过以避免大多数不必要的比较的方式仔细划分或过滤搜索空间，已经找到了有效的精确解决方案。近年来，我们在有效地识别适当的邻居的能力方面取得了一些根本性的进步，特别是在非传统数据中，如图形或文档集合。在本教程中，我们将深入概述查找(最近)邻居的最新方法，重点关注这些算法设计中所做选择背后的直觉，以及这些方法在实际应用程序中的实用性。我们的教程旨在提供“邻居计算”问题的统一视图，从数值数据到图形数据，从分类数据到顺序数据，以及相关的应用场景。对于每种类型的数据，我们将回顾当前用于识别邻居的最先进方法，并讨论如何使用邻居搜索方法来解决重要问题。

{"title":"Tutorial: Are You My Neighbor?: Bringing Order to Neighbor Computing Problems.","authors":"D. Anastasiu, H. Rangwala, Andrea Tagarelli","doi":"10.1145/3292500.3332292","DOIUrl":"https://doi.org/10.1145/3292500.3332292","url":null,"abstract":"Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons. In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of \"neighbor computing\" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128977099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

MSURU: Large Scale E-commerce Image Classification with Weakly Supervised Search Data 基于弱监督搜索数据的大规模电子商务图像分类

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330696

Yina Tang, Fedor Borisyuk, Siddarth Malreddy, Yixuan Li, Yiqun Liu, Sergey Kirshner

In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within Facebook and understanding visual representations of product content is important for search and recommendation applications on Marketplace. In this paper, we present techniques we used to develop efficient large-scale image classifiers using weakly supervised search log data. We perform extensive evaluation of presented techniques, explain practical experience of developing large-scale classification systems and discuss challenges we faced. Our system, MSURU out-performed current state of the art system developed at Facebook [23] by 16% in e-commerce domain. MSURU is deployed to production with significant improvements in search success rate and active interactions on Facebook Marketplace.

本文提出了一种部署在大型商业搜索引擎中的图像识别系统，我们称之为MSURU。它被设计用来处理每天上传到Facebook Marketplace的产品图片。社交商务在Facebook中是一个不断发展的领域，理解产品内容的可视化表示对于Marketplace上的搜索和推荐应用程序非常重要。在本文中，我们介绍了使用弱监督搜索日志数据开发高效大规模图像分类器的技术。我们对提出的技术进行了广泛的评估，解释了开发大规模分类系统的实践经验，并讨论了我们面临的挑战。我们的系统MSURU在电子商务领域的表现比Facebook[23]开发的当前最先进的系统高出16%。MSURU投入生产后，在搜索成功率和Facebook Marketplace活跃互动方面有了显著提高。

引用次数: 16

Gold Panning from the Mess: Rare Category Exploration, Exposition, Representation, and Interpretation 从混乱中淘金:稀有类别的探索、阐述、表现与诠释

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3332268

Dawei Zhou, Jingrui He

In contrast to the massive volume of data, it is often the rare categories that are of great importance in many high impact domains, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in the medical decision support system. The unique challenges of rare category analysis include: (1) the highly-skewed class-membership distribution; (2) the non-separability nature of the rare categories from the majority classes; (3) the data and task heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. This tutorial aims to provide a concise review of state-of-the-art techniques on complex rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property in the feature space or subspace. In particular, we start with the context, problem definition and unique challenges of complex rare category analysis; then we present a comprehensive overview of recent advances that are designed for this problem setting, from rare category exploration without any label information to the exposition step that characterizes rare examples with a compact representation, from representing rare patterns in a salient embedding space to interpreting the prediction results and providing relevant clues for the end users' interpretation; at last, we will discuss the potential challenges and shed light on the future directions of complex rare category analysis.

与海量数据相比，在许多高影响领域中，从在线交易网络中的金融欺诈检测到社交网络中的新兴趋势检测，从社交媒体中的垃圾图像检测到医疗决策支持系统中的罕见疾病诊断，往往是罕见类别非常重要。稀有类别分析的独特挑战包括:(1)类成员分布高度偏斜;(2)稀有类与多数类的不可分性;(3)数据和任务的异质性，例如，样本的多模态表示，以及跨多个相关任务的相似稀有类别分析。本教程旨在简要回顾复杂稀有类别分析的最新技术，其中大多数类在特征空间或子空间中具有光滑分布，而少数类在特征空间或子空间中表现出紧性。特别是，我们从复杂稀有类别分析的背景、问题定义和独特挑战开始;然后，我们全面概述了针对该问题设置的最新进展，从没有任何标签信息的稀有类别探索到用紧凑表示表征稀有示例的展示步骤，从在显著嵌入空间中表示稀有模式到解释预测结果并为最终用户的解释提供相关线索;最后，讨论了复杂稀有类分析可能面临的挑战，并对未来的研究方向进行了展望。

{"title":"Gold Panning from the Mess: Rare Category Exploration, Exposition, Representation, and Interpretation","authors":"Dawei Zhou, Jingrui He","doi":"10.1145/3292500.3332268","DOIUrl":"https://doi.org/10.1145/3292500.3332268","url":null,"abstract":"In contrast to the massive volume of data, it is often the rare categories that are of great importance in many high impact domains, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in the medical decision support system. The unique challenges of rare category analysis include: (1) the highly-skewed class-membership distribution; (2) the non-separability nature of the rare categories from the majority classes; (3) the data and task heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. This tutorial aims to provide a concise review of state-of-the-art techniques on complex rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property in the feature space or subspace. In particular, we start with the context, problem definition and unique challenges of complex rare category analysis; then we present a comprehensive overview of recent advances that are designed for this problem setting, from rare category exploration without any label information to the exposition step that characterizes rare examples with a compact representation, from representing rare patterns in a salient embedding space to interpreting the prediction results and providing relevant clues for the end users' interpretation; at last, we will discuss the potential challenges and shed light on the future directions of complex rare category analysis.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124711532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Learning Interpretable Metric between Graphs: Convex Formulation and Computation with Graph Mining 学习图之间的可解释度量:凸公式和图挖掘计算

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330845

Tomoki Yoshida, I. Takeuchi, Masayuki Karasuyama

Graph is a standard approach to modeling structured data. Although many machine learning methods depend on the metric of the input objects, defining an appropriate distance function on graph is still a controversial issue. We propose a novel supervised metric learning method for a subgraph-based distance, called interpretable graph metric learning (IGML). IGML optimizes the distance function in such a way that a small number of important subgraphs can be adaptively selected. This optimization is computationally intractable with naive application of existing optimization algorithms. We construct a graph mining based efficient algorithm to deal with this computational difficulty. Important advantages of our method are 1) guarantee of the optimality from the convex formulation, and 2) high interpretability of results. To our knowledge, none of the existing studies provide an interpretable subgraph-based metric in a supervised manner. In our experiments, we empirically verify superior or comparable prediction performance of IGML to other existing graph classification methods which do not have clear interpretability. Further, we demonstrate usefulness of IGML through some illustrative examples of extracted subgraphs and an example of data analysis on the learned metric space.

图是结构化数据建模的标准方法。尽管许多机器学习方法依赖于输入对象的度量，但在图上定义合适的距离函数仍然是一个有争议的问题。我们提出了一种新的基于子图距离的监督度量学习方法，称为可解释图度量学习(IGML)。IGML对距离函数进行了优化，从而可以自适应地选择少量重要的子图。这种优化是计算上难以处理的幼稚的应用现有的优化算法。我们构造了一个基于图挖掘的高效算法来解决这一计算难题。该方法的重要优点是:1)保证了凸公式的最优性;2)结果的可解释性高。据我们所知，现有的研究都没有以监督的方式提供可解释的基于子图的度量。在我们的实验中，我们通过经验验证了IGML的预测性能优于或与其他现有的没有明确可解释性的图分类方法相当。此外，我们通过一些提取子图的说明性示例和在学习的度量空间上进行数据分析的示例来证明IGML的有用性。

{"title":"Learning Interpretable Metric between Graphs: Convex Formulation and Computation with Graph Mining","authors":"Tomoki Yoshida, I. Takeuchi, Masayuki Karasuyama","doi":"10.1145/3292500.3330845","DOIUrl":"https://doi.org/10.1145/3292500.3330845","url":null,"abstract":"Graph is a standard approach to modeling structured data. Although many machine learning methods depend on the metric of the input objects, defining an appropriate distance function on graph is still a controversial issue. We propose a novel supervised metric learning method for a subgraph-based distance, called interpretable graph metric learning (IGML). IGML optimizes the distance function in such a way that a small number of important subgraphs can be adaptively selected. This optimization is computationally intractable with naive application of existing optimization algorithms. We construct a graph mining based efficient algorithm to deal with this computational difficulty. Important advantages of our method are 1) guarantee of the optimality from the convex formulation, and 2) high interpretability of results. To our knowledge, none of the existing studies provide an interpretable subgraph-based metric in a supervised manner. In our experiments, we empirically verify superior or comparable prediction performance of IGML to other existing graph classification methods which do not have clear interpretability. Further, we demonstrate usefulness of IGML through some illustrative examples of extracted subgraphs and an example of data analysis on the learned metric space.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Interpretable Knowledge Discovery Reinforced by Visual Methods 可视化方法强化的可解释知识发现

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3332278

Boris Kovalerchuk

This tutorial covers the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It serves the KDD mission and objectives of gaining insight from the data. The topic is interdisciplinary bridging of scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential. First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts. In the areas of Visual Analytics, Information Visualization, and HCI, the increasing trend toward machine learning tasks, including deep learning, is also apparent. This tutorial reviews progress in these areas with a comparative analysis of what each area brings to the joint table. The comparison includes the approaches: (1) to visualize Machine Learning (ML) models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover analytical ML models assisted by visual means. The presenter will use multiple relevant publications including his books: "Visual and Spatial Analysis: Advances in Visual Data Mining, Reasoning, and Problem Solving" (Springer, 2005), and "Visual Knowledge Discovery and Machine Learning" (Springer, 2018). The target audience of this tutorial consists of KDD researchers, graduate students, and practitioners with the basic knowledge of machine learning.

本教程涵盖了可解释知识发现的KDD领域的最新研究、开发和应用，这些研究、开发和应用通过可视化方法得到加强，以刺激和促进未来的工作。它服务于从数据中获得洞察力的KDD任务和目标。主题是跨学科的桥梁科学研究和应用社区在KDD，可视化分析，信息可视化，和HCI。这是一个新兴且快速发展的领域，具有重要的应用和潜力。首先，在KDD中，这些研究是以可视化数据挖掘的名义发展起来的。最近在深度可视化和视觉知识发现的名义下的增长，在很大程度上是由深度学习在预测准确性方面的成功和在没有特殊解释努力的情况下解释产生的模型方面的失败所驱动的。在视觉分析、信息可视化和HCI领域，机器学习任务(包括深度学习)的增长趋势也很明显。本教程回顾了这些领域的进展，并对每个领域带来的联合表进行了比较分析。比较包括:(1)通过分析ML方法生成的机器学习(ML)模型的可视化，(2)通过视觉手段发现ML模型，(3)通过视觉手段解释深度和其他ML模型，(4)通过分析ML算法辅助发现可视化ML模型，(5)通过视觉手段辅助发现分析ML模型。主讲人将使用多个相关出版物，包括他的书:“视觉和空间分析:视觉数据挖掘，推理和问题解决的进展”(Springer, 2005)和“视觉知识发现和机器学习”(Springer, 2018)。本教程的目标受众包括具有机器学习基础知识的KDD研究人员、研究生和实践者。

{"title":"Interpretable Knowledge Discovery Reinforced by Visual Methods","authors":"Boris Kovalerchuk","doi":"10.1145/3292500.3332278","DOIUrl":"https://doi.org/10.1145/3292500.3332278","url":null,"abstract":"This tutorial covers the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It serves the KDD mission and objectives of gaining insight from the data. The topic is interdisciplinary bridging of scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential. First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts. In the areas of Visual Analytics, Information Visualization, and HCI, the increasing trend toward machine learning tasks, including deep learning, is also apparent. This tutorial reviews progress in these areas with a comparative analysis of what each area brings to the joint table. The comparison includes the approaches: (1) to visualize Machine Learning (ML) models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover analytical ML models assisted by visual means. The presenter will use multiple relevant publications including his books: \"Visual and Spatial Analysis: Advances in Visual Data Mining, Reasoning, and Problem Solving\" (Springer, 2005), and \"Visual Knowledge Discovery and Machine Learning\" (Springer, 2018). The target audience of this tutorial consists of KDD researchers, graduate students, and practitioners with the basic knowledge of machine learning.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130154219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Graph Recurrent Networks With Attributed Random Walks 带属性随机游走的循环网络图

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330941

Xiao Huang, Qingquan Song, Yuening Li, Xia Hu

Random walks are widely adopted in various network analysis tasks ranging from network embedding to label propagation. It could capture and convert geometric structures into structured sequences while alleviating the issues of sparsity and curse of dimensionality. Though random walks on plain networks have been intensively studied, in real-world systems, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, it is unclear how random walks could be developed for attributed networks towards an effective joint information extraction. Node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures. To bridge the gap, we explore to perform joint random walks on attributed networks, and utilize them to boost the deep node representation learning. The proposed framework GraphRNA consists of two major components, i.e., a collaborative walking mechanism - AttriWalk, and a tailored deep embedding architecture for random walks, named graph recurrent networks (GRN). AttriWalk considers node attributes as a bipartite network and uses it to propel the walking more diverse and mitigate the tendency of converging to nodes with high centralities. AttriWalk enables us to advance the prominent deep network embedding model, graph convolutional networks, towards a more effective architecture - GRN. GRN empowers node representations to interact in the same way as nodes interact in the original attributed network. Experimental results on real-world datasets demonstrate the effectiveness of GraphRNA compared with the state-of-the-art embedding algorithms.

随机漫步广泛应用于从网络嵌入到标签传播等各种网络分析任务中。它可以捕获几何结构并将其转换为结构化序列，同时减轻了稀疏性和维数诅咒问题。尽管在普通网络上的随机漫步已经被深入研究，但在现实世界的系统中，节点通常不是纯粹的顶点，而是具有不同的特征，由与它们相关的丰富数据集描述。这些节点属性包含丰富的信息，这些信息通常是对网络的补充，并为基于随机行走的分析带来了机会。然而，目前尚不清楚如何为属性网络开发随机漫步，以实现有效的联合信息提取。节点属性使节点交互更加复杂，并且相对于拓扑结构是异构的。为了弥补这一差距，我们探索在属性网络上执行联合随机行走，并利用它们来促进深度节点表示学习。所提出的框架GraphRNA由两个主要部分组成，即协作行走机制(AttriWalk)和为随机行走定制的深度嵌入架构(GRN)。AttriWalk将节点属性视为一个二部网络，并利用它来推动行走更加多样化，减轻向高中心性节点收敛的倾向。AttriWalk使我们能够将突出的深度网络嵌入模型，图卷积网络，推进到更有效的体系结构- GRN。GRN使节点表示能够以与原始属性网络中的节点交互相同的方式进行交互。在真实数据集上的实验结果表明，与目前最先进的嵌入算法相比，GraphRNA是有效的。

{"title":"Graph Recurrent Networks With Attributed Random Walks","authors":"Xiao Huang, Qingquan Song, Yuening Li, Xia Hu","doi":"10.1145/3292500.3330941","DOIUrl":"https://doi.org/10.1145/3292500.3330941","url":null,"abstract":"Random walks are widely adopted in various network analysis tasks ranging from network embedding to label propagation. It could capture and convert geometric structures into structured sequences while alleviating the issues of sparsity and curse of dimensionality. Though random walks on plain networks have been intensively studied, in real-world systems, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, it is unclear how random walks could be developed for attributed networks towards an effective joint information extraction. Node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures. To bridge the gap, we explore to perform joint random walks on attributed networks, and utilize them to boost the deep node representation learning. The proposed framework GraphRNA consists of two major components, i.e., a collaborative walking mechanism - AttriWalk, and a tailored deep embedding architecture for random walks, named graph recurrent networks (GRN). AttriWalk considers node attributes as a bipartite network and uses it to propel the walking more diverse and mitigate the tendency of converging to nodes with high centralities. AttriWalk enables us to advance the prominent deep network embedding model, graph convolutional networks, towards a more effective architecture - GRN. GRN empowers node representations to interact in the same way as nodes interact in the original attributed network. Experimental results on real-world datasets demonstrate the effectiveness of GraphRNA compared with the state-of-the-art embedding algorithms.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121288868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

Using Twitter to Predict When Vulnerabilities will be Exploited 使用Twitter预测漏洞何时会被利用

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330742

Haipeng Chen, R. Liu, Noseong Park, V. S. Subrahmanian

When a new cyber-vulnerability is detected, a Common Vulnerability and Exposure (CVE) number is attached to it. Malicious "exploits'' may use these vulnerabilities to carry out attacks. Unlike works which study if a CVE will be used in an exploit, we study the problem of predicting when an exploit is first seen. This is an important question for system administrators as they need to devote scarce resources to take corrective action when a new vulnerability emerges. Moreover, past works assume that CVSS scores (released by NIST) are available for predictions, but we show on average that 49% of real world exploits occur before CVSS scores are published. This means that past works, which use CVSS scores, miss almost half of the exploits. In this paper, we propose a novel framework to predict when a vulnerability will be exploited via Twitter discussion, without using CVSS score information. We introduce the unique concept of a family of CVE-Author-Tweet (CAT) graphs and build a novel set of features based on such graphs. We define recurrence relations capturing "hotness" of tweets, "expertise" of Twitter users on CVEs, and "availability" of information about CVEs, and prove that we can solve these recurrences via a fix point algorithm. Our second innovation adopts Hawkes processes to estimate the number of tweets/retweets related to the CVEs. Using the above two sets of novel features, we propose two ensemble forecast models FEEU (for classification) and FRET (for regression) to predict when a CVE will be exploited. Compared with natural adaptations of past works (which predict if an exploit will be used), FEEU increases F1 score by 25.1%, while FRET decreases MAE by 37.2%.

当检测到新的网络漏洞时，系统会为其附加一个CVE (Common Vulnerability and Exposure)编号。恶意的“漏洞利用”可能会利用这些漏洞进行攻击。与研究CVE是否会被用于攻击的工作不同，我们研究的是预测攻击何时首次被发现的问题。对于系统管理员来说，这是一个重要的问题，因为当出现新的漏洞时，他们需要投入稀缺的资源来采取纠正措施。此外，过去的工作假设CVSS分数(由NIST发布)可用于预测，但我们显示，平均49%的现实世界漏洞利用发生在CVSS分数公布之前。这意味着过去使用CVSS分数的作品几乎错过了一半的漏洞。在本文中，我们提出了一个新的框架来预测何时漏洞将通过Twitter讨论被利用，而不使用CVSS评分信息。我们引入了CVE-Author-Tweet (CAT)图族的独特概念，并基于这些图构建了一组新的特征。我们定义了捕获tweet的“热度”、Twitter用户对cve的“专业度”和cve信息的“可用性”的递归关系，并证明了我们可以通过不动点算法求解这些递归关系。我们的第二个创新采用霍克斯流程来估计与cve相关的推文/转发数量。利用上述两组新特征，我们提出了两个集成预测模型FEEU(用于分类)和FRET(用于回归)来预测CVE何时会被利用。与过去作品的自然改编(预测漏洞是否会被利用)相比，FEEU使F1得分提高了25.1%，而FRET使MAE得分降低了37.2%。

{"title":"Using Twitter to Predict When Vulnerabilities will be Exploited","authors":"Haipeng Chen, R. Liu, Noseong Park, V. S. Subrahmanian","doi":"10.1145/3292500.3330742","DOIUrl":"https://doi.org/10.1145/3292500.3330742","url":null,"abstract":"When a new cyber-vulnerability is detected, a Common Vulnerability and Exposure (CVE) number is attached to it. Malicious \"exploits'' may use these vulnerabilities to carry out attacks. Unlike works which study if a CVE will be used in an exploit, we study the problem of predicting when an exploit is first seen. This is an important question for system administrators as they need to devote scarce resources to take corrective action when a new vulnerability emerges. Moreover, past works assume that CVSS scores (released by NIST) are available for predictions, but we show on average that 49% of real world exploits occur before CVSS scores are published. This means that past works, which use CVSS scores, miss almost half of the exploits. In this paper, we propose a novel framework to predict when a vulnerability will be exploited via Twitter discussion, without using CVSS score information. We introduce the unique concept of a family of CVE-Author-Tweet (CAT) graphs and build a novel set of features based on such graphs. We define recurrence relations capturing \"hotness\" of tweets, \"expertise\" of Twitter users on CVEs, and \"availability\" of information about CVEs, and prove that we can solve these recurrences via a fix point algorithm. Our second innovation adopts Hawkes processes to estimate the number of tweets/retweets related to the CVEs. Using the above two sets of novel features, we propose two ensemble forecast models FEEU (for classification) and FRET (for regression) to predict when a CVE will be exploited. Compared with natural adaptations of past works (which predict if an exploit will be used), FEEU increases F1 score by 25.1%, while FRET decreases MAE by 37.2%.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

A Robust Framework for Accelerated Outcome-driven Risk Factor Identification from EHR 从电子病历中加速结果驱动的风险因素识别的稳健框架

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pub Date : 2019-07-25 DOI: 10.1145/3292500.3330718

Prithwish Chakraborty, Faisal Farooq

Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.

包含数百万患者生命的纵向信息的电子健康记录(EHR)正越来越多地被医疗保健领域的组织所利用。对电子病历数据的研究使现实世界的应用成为可能，如了解疾病进展、结果分析和比较有效性研究。然而，通常每项研究都是独立委托的，数据是通过调查收集的，或者是经过一个漫长而痛苦的过程专门购买的。接下来是一个艰巨的分析、模型构建和见解生成的重复循环。这个过程可能需要1 - 3年。在本文中，我们提出了一个强大的端到端基于机器学习的SaaS系统，用于对非常大的EHR数据集进行分析。该框架包括一个专有的EHR数据集，涵盖美国约5500万患者的生活和超过200亿个数据点。据我们所知，这个框架是业界分析这种规模医疗记录的最大框架，具有如此高的效率和易用性。我们开发了一个端到端的机器学习框架，其中包含精心选择的组件，以支持大规模的电子病历分析，并适合进一步的下游临床分析。具体来说，它由具有临床核的脊化生存支持向量机(SSVM)和基于卡方距离的特征选择组成，通过利用EHR中的弱相关性来发现相关的危险因素。我们在多个实际用例上的结果表明，该框架在没有专家监督的情况下有效地识别了相关因素。该框架是稳定的，可对结果进行概括，并且还有助于对已知专家特征进行更好的边界预测。重要的是，所使用的机器学习方法是可解释的，这对于目标用户群接受我们的系统至关重要。随着系统投入使用，与行业标准的12-36个月相比，所有这些研究都在3-4周内完成。因此，我们的系统可以加速分析和发现，由于减少投资和更快的研究周转，从而获得更好的投资回报率。

{"title":"A Robust Framework for Accelerated Outcome-driven Risk Factor Identification from EHR","authors":"Prithwish Chakraborty, Faisal Farooq","doi":"10.1145/3292500.3330718","DOIUrl":"https://doi.org/10.1145/3292500.3330718","url":null,"abstract":"Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121598871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8