Chuxu Zhang, Dongjin Song, Chao Huang, A. Swami, N. Chawla
Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes "deep" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.
{"title":"Heterogeneous Graph Neural Network","authors":"Chuxu Zhang, Dongjin Song, Chao Huang, A. Swami, N. Chawla","doi":"10.1145/3292500.3330961","DOIUrl":"https://doi.org/10.1145/3292500.3330961","url":null,"abstract":"Representation learning in heterogeneous graphs aims to pursue a meaningful vector representation for each node so as to facilitate downstream applications such as link prediction, personalized recommendation, node classification, etc. This task, however, is challenging not only because of the demand to incorporate heterogeneous structural (graph) information consisting of multiple types of nodes and edges, but also due to the need for considering heterogeneous attributes or contents (e.g., text or image) associated with each node. Despite a substantial amount of effort has been made to homogeneous (or heterogeneous) graph embedding, attributed graph embedding as well as graph neural networks, few of them can jointly consider heterogeneous structural (graph) information as well as heterogeneous contents information of each node effectively. In this paper, we propose HetGNN, a heterogeneous graph neural network model, to resolve this issue. Specifically, we first introduce a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node and group them based upon node types. Next, we design a neural network architecture with two modules to aggregate feature information of those sampled neighboring nodes. The first module encodes \"deep\" feature interactions of heterogeneous contents and generates content embedding for each node. The second module aggregates content (attribute) embeddings of different neighboring groups (types) and further combines them by considering the impacts of different groups to obtain the ultimate node embedding. Finally, we leverage a graph context loss and a mini-batch gradient descent procedure to train the model in an end-to-end manner. Extensive experiments on several datasets demonstrate that HetGNN can outperform state-of-the-art baselines in various graph mining tasks, i.e., link prediction, recommendation, node classification & clustering and inductive node classification & clustering.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121909240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donghua Liu, Jing Li, Bo Du, Junfei Chang, Rong Gao
Despite the great success of many matrix factorization based collaborative filtering approaches, there is still much space for improvement in recommender system field. One main obstacle is the cold-start and data sparseness problem, requiring better solutions. Recent studies have attempted to integrate review information into rating prediction. However, there are two main problems: (1) most of existing works utilize a static and independent method to extract the latent feature representation of user and item reviews ignoring the correlation between the latent features, which may fail to capture the preference of users comprehensively. (2) there is no effective framework that unifies ratings and reviews. Therefore, we propose a novel d ual a ttention m utual l earning between ratings and reviews for item recommendation, named DAML. Specifically, we utilize local and mutual attention of the convolutional neural network to jointly learn the features of reviews to enhance the interpretability of the proposed DAML model. Then the rating features and review features are integrated into a unified neural network model, and the higher-order nonlinear interaction of features are realized by the neural factorization machines to complete the final rating prediction. Experiments on the five real-world datasets show that DAML achieves significantly better rating prediction accuracy compared to the state-of-the-art methods. Furthermore, the attention mechanism can highlight the relevant information in reviews to increase the interpretability of rating prediction.
{"title":"DAML: Dual Attention Mutual Learning between Ratings and Reviews for Item Recommendation","authors":"Donghua Liu, Jing Li, Bo Du, Junfei Chang, Rong Gao","doi":"10.1145/3292500.3330906","DOIUrl":"https://doi.org/10.1145/3292500.3330906","url":null,"abstract":"Despite the great success of many matrix factorization based collaborative filtering approaches, there is still much space for improvement in recommender system field. One main obstacle is the cold-start and data sparseness problem, requiring better solutions. Recent studies have attempted to integrate review information into rating prediction. However, there are two main problems: (1) most of existing works utilize a static and independent method to extract the latent feature representation of user and item reviews ignoring the correlation between the latent features, which may fail to capture the preference of users comprehensively. (2) there is no effective framework that unifies ratings and reviews. Therefore, we propose a novel d ual a ttention m utual l earning between ratings and reviews for item recommendation, named DAML. Specifically, we utilize local and mutual attention of the convolutional neural network to jointly learn the features of reviews to enhance the interpretability of the proposed DAML model. Then the rating features and review features are integrated into a unified neural network model, and the higher-order nonlinear interaction of features are realized by the neural factorization machines to complete the final rating prediction. Experiments on the five real-world datasets show that DAML achieves significantly better rating prediction accuracy compared to the state-of-the-art methods. Furthermore, the attention mechanism can highlight the relevant information in reviews to increase the interpretability of rating prediction.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116861538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons. In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of "neighbor computing" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.
{"title":"Tutorial: Are You My Neighbor?: Bringing Order to Neighbor Computing Problems.","authors":"D. Anastasiu, H. Rangwala, Andrea Tagarelli","doi":"10.1145/3292500.3332292","DOIUrl":"https://doi.org/10.1145/3292500.3332292","url":null,"abstract":"Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons. In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of \"neighbor computing\" problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128977099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within Facebook and understanding visual representations of product content is important for search and recommendation applications on Marketplace. In this paper, we present techniques we used to develop efficient large-scale image classifiers using weakly supervised search log data. We perform extensive evaluation of presented techniques, explain practical experience of developing large-scale classification systems and discuss challenges we faced. Our system, MSURU out-performed current state of the art system developed at Facebook [23] by 16% in e-commerce domain. MSURU is deployed to production with significant improvements in search success rate and active interactions on Facebook Marketplace.
{"title":"MSURU: Large Scale E-commerce Image Classification with Weakly Supervised Search Data","authors":"Yina Tang, Fedor Borisyuk, Siddarth Malreddy, Yixuan Li, Yiqun Liu, Sergey Kirshner","doi":"10.1145/3292500.3330696","DOIUrl":"https://doi.org/10.1145/3292500.3330696","url":null,"abstract":"In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within Facebook and understanding visual representations of product content is important for search and recommendation applications on Marketplace. In this paper, we present techniques we used to develop efficient large-scale image classifiers using weakly supervised search log data. We perform extensive evaluation of presented techniques, explain practical experience of developing large-scale classification systems and discuss challenges we faced. Our system, MSURU out-performed current state of the art system developed at Facebook [23] by 16% in e-commerce domain. MSURU is deployed to production with significant improvements in search success rate and active interactions on Facebook Marketplace.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124590923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In contrast to the massive volume of data, it is often the rare categories that are of great importance in many high impact domains, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in the medical decision support system. The unique challenges of rare category analysis include: (1) the highly-skewed class-membership distribution; (2) the non-separability nature of the rare categories from the majority classes; (3) the data and task heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. This tutorial aims to provide a concise review of state-of-the-art techniques on complex rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property in the feature space or subspace. In particular, we start with the context, problem definition and unique challenges of complex rare category analysis; then we present a comprehensive overview of recent advances that are designed for this problem setting, from rare category exploration without any label information to the exposition step that characterizes rare examples with a compact representation, from representing rare patterns in a salient embedding space to interpreting the prediction results and providing relevant clues for the end users' interpretation; at last, we will discuss the potential challenges and shed light on the future directions of complex rare category analysis.
{"title":"Gold Panning from the Mess: Rare Category Exploration, Exposition, Representation, and Interpretation","authors":"Dawei Zhou, Jingrui He","doi":"10.1145/3292500.3332268","DOIUrl":"https://doi.org/10.1145/3292500.3332268","url":null,"abstract":"In contrast to the massive volume of data, it is often the rare categories that are of great importance in many high impact domains, ranging from financial fraud detection in online transaction networks to emerging trend detection in social networks, from spam image detection in social media to rare disease diagnosis in the medical decision support system. The unique challenges of rare category analysis include: (1) the highly-skewed class-membership distribution; (2) the non-separability nature of the rare categories from the majority classes; (3) the data and task heterogeneity, e.g., the multi-modal representation of examples, and the analysis of similar rare categories across multiple related tasks. This tutorial aims to provide a concise review of state-of-the-art techniques on complex rare category analysis, where the majority classes have a smooth distribution, while the minority classes exhibit a compactness property in the feature space or subspace. In particular, we start with the context, problem definition and unique challenges of complex rare category analysis; then we present a comprehensive overview of recent advances that are designed for this problem setting, from rare category exploration without any label information to the exposition step that characterizes rare examples with a compact representation, from representing rare patterns in a salient embedding space to interpreting the prediction results and providing relevant clues for the end users' interpretation; at last, we will discuss the potential challenges and shed light on the future directions of complex rare category analysis.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124711532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph is a standard approach to modeling structured data. Although many machine learning methods depend on the metric of the input objects, defining an appropriate distance function on graph is still a controversial issue. We propose a novel supervised metric learning method for a subgraph-based distance, called interpretable graph metric learning (IGML). IGML optimizes the distance function in such a way that a small number of important subgraphs can be adaptively selected. This optimization is computationally intractable with naive application of existing optimization algorithms. We construct a graph mining based efficient algorithm to deal with this computational difficulty. Important advantages of our method are 1) guarantee of the optimality from the convex formulation, and 2) high interpretability of results. To our knowledge, none of the existing studies provide an interpretable subgraph-based metric in a supervised manner. In our experiments, we empirically verify superior or comparable prediction performance of IGML to other existing graph classification methods which do not have clear interpretability. Further, we demonstrate usefulness of IGML through some illustrative examples of extracted subgraphs and an example of data analysis on the learned metric space.
{"title":"Learning Interpretable Metric between Graphs: Convex Formulation and Computation with Graph Mining","authors":"Tomoki Yoshida, I. Takeuchi, Masayuki Karasuyama","doi":"10.1145/3292500.3330845","DOIUrl":"https://doi.org/10.1145/3292500.3330845","url":null,"abstract":"Graph is a standard approach to modeling structured data. Although many machine learning methods depend on the metric of the input objects, defining an appropriate distance function on graph is still a controversial issue. We propose a novel supervised metric learning method for a subgraph-based distance, called interpretable graph metric learning (IGML). IGML optimizes the distance function in such a way that a small number of important subgraphs can be adaptively selected. This optimization is computationally intractable with naive application of existing optimization algorithms. We construct a graph mining based efficient algorithm to deal with this computational difficulty. Important advantages of our method are 1) guarantee of the optimality from the convex formulation, and 2) high interpretability of results. To our knowledge, none of the existing studies provide an interpretable subgraph-based metric in a supervised manner. In our experiments, we empirically verify superior or comparable prediction performance of IGML to other existing graph classification methods which do not have clear interpretability. Further, we demonstrate usefulness of IGML through some illustrative examples of extracted subgraphs and an example of data analysis on the learned metric space.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This tutorial covers the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It serves the KDD mission and objectives of gaining insight from the data. The topic is interdisciplinary bridging of scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential. First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts. In the areas of Visual Analytics, Information Visualization, and HCI, the increasing trend toward machine learning tasks, including deep learning, is also apparent. This tutorial reviews progress in these areas with a comparative analysis of what each area brings to the joint table. The comparison includes the approaches: (1) to visualize Machine Learning (ML) models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover analytical ML models assisted by visual means. The presenter will use multiple relevant publications including his books: "Visual and Spatial Analysis: Advances in Visual Data Mining, Reasoning, and Problem Solving" (Springer, 2005), and "Visual Knowledge Discovery and Machine Learning" (Springer, 2018). The target audience of this tutorial consists of KDD researchers, graduate students, and practitioners with the basic knowledge of machine learning.
{"title":"Interpretable Knowledge Discovery Reinforced by Visual Methods","authors":"Boris Kovalerchuk","doi":"10.1145/3292500.3332278","DOIUrl":"https://doi.org/10.1145/3292500.3332278","url":null,"abstract":"This tutorial covers the state-of-the-art research, development, and applications in the KDD area of interpretable knowledge discovery reinforced by visual methods to stimulate and facilitate future work. It serves the KDD mission and objectives of gaining insight from the data. The topic is interdisciplinary bridging of scientific research and applied communities in KDD, Visual Analytics, Information Visualization, and HCI. This is a novel and fast growing area with significant applications, and potential. First, in KDD, these studies have grown under the name of visual data mining. The recent growth under the names of deep visualization, and visual knowledge discovery, is motivated considerably by deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts. In the areas of Visual Analytics, Information Visualization, and HCI, the increasing trend toward machine learning tasks, including deep learning, is also apparent. This tutorial reviews progress in these areas with a comparative analysis of what each area brings to the joint table. The comparison includes the approaches: (1) to visualize Machine Learning (ML) models produced by the analytical ML methods, (2) to discover ML models by visual means, (3) to explain deep and other ML models by visual means, (4) to discover visual ML models assisted by analytical ML algorithms, (5) to discover analytical ML models assisted by visual means. The presenter will use multiple relevant publications including his books: \"Visual and Spatial Analysis: Advances in Visual Data Mining, Reasoning, and Problem Solving\" (Springer, 2005), and \"Visual Knowledge Discovery and Machine Learning\" (Springer, 2018). The target audience of this tutorial consists of KDD researchers, graduate students, and practitioners with the basic knowledge of machine learning.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130154219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Random walks are widely adopted in various network analysis tasks ranging from network embedding to label propagation. It could capture and convert geometric structures into structured sequences while alleviating the issues of sparsity and curse of dimensionality. Though random walks on plain networks have been intensively studied, in real-world systems, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, it is unclear how random walks could be developed for attributed networks towards an effective joint information extraction. Node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures. To bridge the gap, we explore to perform joint random walks on attributed networks, and utilize them to boost the deep node representation learning. The proposed framework GraphRNA consists of two major components, i.e., a collaborative walking mechanism - AttriWalk, and a tailored deep embedding architecture for random walks, named graph recurrent networks (GRN). AttriWalk considers node attributes as a bipartite network and uses it to propel the walking more diverse and mitigate the tendency of converging to nodes with high centralities. AttriWalk enables us to advance the prominent deep network embedding model, graph convolutional networks, towards a more effective architecture - GRN. GRN empowers node representations to interact in the same way as nodes interact in the original attributed network. Experimental results on real-world datasets demonstrate the effectiveness of GraphRNA compared with the state-of-the-art embedding algorithms.
{"title":"Graph Recurrent Networks With Attributed Random Walks","authors":"Xiao Huang, Qingquan Song, Yuening Li, Xia Hu","doi":"10.1145/3292500.3330941","DOIUrl":"https://doi.org/10.1145/3292500.3330941","url":null,"abstract":"Random walks are widely adopted in various network analysis tasks ranging from network embedding to label propagation. It could capture and convert geometric structures into structured sequences while alleviating the issues of sparsity and curse of dimensionality. Though random walks on plain networks have been intensively studied, in real-world systems, nodes are often not pure vertices, but own different characteristics, described by the rich set of data associated with them. These node attributes contain plentiful information that often complements the network, and bring opportunities to the random-walk-based analysis. However, it is unclear how random walks could be developed for attributed networks towards an effective joint information extraction. Node attributes make the node interactions more complicated and are heterogeneous with respect to topological structures. To bridge the gap, we explore to perform joint random walks on attributed networks, and utilize them to boost the deep node representation learning. The proposed framework GraphRNA consists of two major components, i.e., a collaborative walking mechanism - AttriWalk, and a tailored deep embedding architecture for random walks, named graph recurrent networks (GRN). AttriWalk considers node attributes as a bipartite network and uses it to propel the walking more diverse and mitigate the tendency of converging to nodes with high centralities. AttriWalk enables us to advance the prominent deep network embedding model, graph convolutional networks, towards a more effective architecture - GRN. GRN empowers node representations to interact in the same way as nodes interact in the original attributed network. Experimental results on real-world datasets demonstrate the effectiveness of GraphRNA compared with the state-of-the-art embedding algorithms.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121288868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haipeng Chen, R. Liu, Noseong Park, V. S. Subrahmanian
When a new cyber-vulnerability is detected, a Common Vulnerability and Exposure (CVE) number is attached to it. Malicious "exploits'' may use these vulnerabilities to carry out attacks. Unlike works which study if a CVE will be used in an exploit, we study the problem of predicting when an exploit is first seen. This is an important question for system administrators as they need to devote scarce resources to take corrective action when a new vulnerability emerges. Moreover, past works assume that CVSS scores (released by NIST) are available for predictions, but we show on average that 49% of real world exploits occur before CVSS scores are published. This means that past works, which use CVSS scores, miss almost half of the exploits. In this paper, we propose a novel framework to predict when a vulnerability will be exploited via Twitter discussion, without using CVSS score information. We introduce the unique concept of a family of CVE-Author-Tweet (CAT) graphs and build a novel set of features based on such graphs. We define recurrence relations capturing "hotness" of tweets, "expertise" of Twitter users on CVEs, and "availability" of information about CVEs, and prove that we can solve these recurrences via a fix point algorithm. Our second innovation adopts Hawkes processes to estimate the number of tweets/retweets related to the CVEs. Using the above two sets of novel features, we propose two ensemble forecast models FEEU (for classification) and FRET (for regression) to predict when a CVE will be exploited. Compared with natural adaptations of past works (which predict if an exploit will be used), FEEU increases F1 score by 25.1%, while FRET decreases MAE by 37.2%.
当检测到新的网络漏洞时,系统会为其附加一个CVE (Common Vulnerability and Exposure)编号。恶意的“漏洞利用”可能会利用这些漏洞进行攻击。与研究CVE是否会被用于攻击的工作不同,我们研究的是预测攻击何时首次被发现的问题。对于系统管理员来说,这是一个重要的问题,因为当出现新的漏洞时,他们需要投入稀缺的资源来采取纠正措施。此外,过去的工作假设CVSS分数(由NIST发布)可用于预测,但我们显示,平均49%的现实世界漏洞利用发生在CVSS分数公布之前。这意味着过去使用CVSS分数的作品几乎错过了一半的漏洞。在本文中,我们提出了一个新的框架来预测何时漏洞将通过Twitter讨论被利用,而不使用CVSS评分信息。我们引入了CVE-Author-Tweet (CAT)图族的独特概念,并基于这些图构建了一组新的特征。我们定义了捕获tweet的“热度”、Twitter用户对cve的“专业度”和cve信息的“可用性”的递归关系,并证明了我们可以通过不动点算法求解这些递归关系。我们的第二个创新采用霍克斯流程来估计与cve相关的推文/转发数量。利用上述两组新特征,我们提出了两个集成预测模型FEEU(用于分类)和FRET(用于回归)来预测CVE何时会被利用。与过去作品的自然改编(预测漏洞是否会被利用)相比,FEEU使F1得分提高了25.1%,而FRET使MAE得分降低了37.2%。
{"title":"Using Twitter to Predict When Vulnerabilities will be Exploited","authors":"Haipeng Chen, R. Liu, Noseong Park, V. S. Subrahmanian","doi":"10.1145/3292500.3330742","DOIUrl":"https://doi.org/10.1145/3292500.3330742","url":null,"abstract":"When a new cyber-vulnerability is detected, a Common Vulnerability and Exposure (CVE) number is attached to it. Malicious \"exploits'' may use these vulnerabilities to carry out attacks. Unlike works which study if a CVE will be used in an exploit, we study the problem of predicting when an exploit is first seen. This is an important question for system administrators as they need to devote scarce resources to take corrective action when a new vulnerability emerges. Moreover, past works assume that CVSS scores (released by NIST) are available for predictions, but we show on average that 49% of real world exploits occur before CVSS scores are published. This means that past works, which use CVSS scores, miss almost half of the exploits. In this paper, we propose a novel framework to predict when a vulnerability will be exploited via Twitter discussion, without using CVSS score information. We introduce the unique concept of a family of CVE-Author-Tweet (CAT) graphs and build a novel set of features based on such graphs. We define recurrence relations capturing \"hotness\" of tweets, \"expertise\" of Twitter users on CVEs, and \"availability\" of information about CVEs, and prove that we can solve these recurrences via a fix point algorithm. Our second innovation adopts Hawkes processes to estimate the number of tweets/retweets related to the CVEs. Using the above two sets of novel features, we propose two ensemble forecast models FEEU (for classification) and FRET (for regression) to predict when a CVE will be exploited. Compared with natural adaptations of past works (which predict if an exploit will be used), FEEU increases F1 score by 25.1%, while FRET decreases MAE by 37.2%.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.
{"title":"A Robust Framework for Accelerated Outcome-driven Risk Factor Identification from EHR","authors":"Prithwish Chakraborty, Faisal Farooq","doi":"10.1145/3292500.3330718","DOIUrl":"https://doi.org/10.1145/3292500.3330718","url":null,"abstract":"Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121598871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}