Martin Junghanns, André Petermann, Niklas Teichmann, Kevin Gómez, E. Rahm
Graphs are an intuitive way to model complex relationships between real-world data objects. Thus, graph analytics plays an important role in research and industry. As graphs often reflect heterogeneous domain data, their representation requires an expressive data model including the abstraction of graph collections, for example, to analyze communities inside a social network. Further on, answering complex analytical questions about such graphs entails combining multiple analytical operations. To satisfy these requirements, we propose the Extended Property Graph Model, which is semantically rich, schema-free and supports multiple distinct graphs. Based on this representation, it provides declarative and combinable operators to analyze both single graphs and graph collections. Our current implementation is based on the distributed dataflow framework Apache Flink. We present the results of a first experimental study showing the scalability of our implementation on social network data with up to 11 billion edges.
{"title":"Analyzing extended property graphs with Apache Flink","authors":"Martin Junghanns, André Petermann, Niklas Teichmann, Kevin Gómez, E. Rahm","doi":"10.1145/2980523.2980527","DOIUrl":"https://doi.org/10.1145/2980523.2980527","url":null,"abstract":"Graphs are an intuitive way to model complex relationships between real-world data objects. Thus, graph analytics plays an important role in research and industry. As graphs often reflect heterogeneous domain data, their representation requires an expressive data model including the abstraction of graph collections, for example, to analyze communities inside a social network. Further on, answering complex analytical questions about such graphs entails combining multiple analytical operations. To satisfy these requirements, we propose the Extended Property Graph Model, which is semantically rich, schema-free and supports multiple distinct graphs. Based on this representation, it provides declarative and combinable operators to analyze both single graphs and graph collections. Our current implementation is based on the distributed dataflow framework Apache Flink. We present the results of a first experimental study showing the scalability of our implementation on social network data with up to 11 billion edges.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130296291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Considerable effort has been devoted to establishing concepts and designing algorithms that are useful for graph data management. While most work so far has focused on static graphs, there are many networks with time information, i.e., temporal graphs, such as social network messages, phone calls, public transportation, and neural networks. Even the most fundamental problems for static graphs become non-trivial for temporal graphs. In this paper, we explore the minimum-weight spanning tree problem on temporal graphs, which was recently proposed by Huang et al. [SIGMOD 2015]. Even though this problem is proven to be NP-hard, we design practically efficient exact algorithms using integer programming. Experimental results confirm that the proposed algorithms can produce better solutions than a previously proposed approximation algorithm.
{"title":"Integer programming approach for directed minimum spanning tree problem on temporal graphs","authors":"Takuto Ikuta, Takuya Akiba","doi":"10.1145/2980523.2980528","DOIUrl":"https://doi.org/10.1145/2980523.2980528","url":null,"abstract":"Considerable effort has been devoted to establishing concepts and designing algorithms that are useful for graph data management. While most work so far has focused on static graphs, there are many networks with time information, i.e., temporal graphs, such as social network messages, phone calls, public transportation, and neural networks. Even the most fundamental problems for static graphs become non-trivial for temporal graphs. In this paper, we explore the minimum-weight spanning tree problem on temporal graphs, which was recently proposed by Huang et al. [SIGMOD 2015]. Even though this problem is proven to be NP-hard, we design practically efficient exact algorithms using integer programming. Experimental results confirm that the proposed algorithms can produce better solutions than a previously proposed approximation algorithm.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132618949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks -- at the level of small network subgraphs -- remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns. Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.
{"title":"Beyond nodes and edges: multiresolution algorithms for network data","authors":"J. Leskovec","doi":"10.1145/2980523.2980525","DOIUrl":"https://doi.org/10.1145/2980523.2980525","url":null,"abstract":"Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks -- at the level of small network subgraphs -- remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns. Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116690418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Johnson, Y. Kanza, L. Lakshmanan, Vladislav Shkapenyuk
Communication networks are typically large, dynamic and extremely complicated. To deploy, maintain, and trouble-shoot such networks, it is essential to understand how network elements---such as servers, switches, virtual machines, and virtual network functions---are connected to one another, and be able to discover communication paths between them. It is also essential to understand how connections change over time, and be able to pose time-travel queries to retrieve information about past network states. This problem is becoming more acute with the advent of software defined networks, where network functions are virtualized and managed in a cloud infrastructure. We represent a communication network inventory as a graph where the nodes are network entities and edges represent relationships between them, e.g. hosted-on, communicates-with, etc. Querying such a graph, e.g. for troubleshooting, using existing graph query languages is too cumbersome for network analysts. Thus, in this paper we present Nepal---a network path query language, which is designed to effectively retrieve desired paths from a network graph. The main novelty of Nepal is to consider paths as first-class citizens of the language, which achieves closure under composition while maintaining simplicity. We demonstrate the capabilities of Nepal by examples and discuss query evaluation. We illustrate how path queries can simplify the extraction of information from a dynamic inventory of a multi-layer network and can be used for troubleshooting.
{"title":"Nepal: a path query language for communication networks","authors":"T. Johnson, Y. Kanza, L. Lakshmanan, Vladislav Shkapenyuk","doi":"10.1145/2980523.2980530","DOIUrl":"https://doi.org/10.1145/2980523.2980530","url":null,"abstract":"Communication networks are typically large, dynamic and extremely complicated. To deploy, maintain, and trouble-shoot such networks, it is essential to understand how network elements---such as servers, switches, virtual machines, and virtual network functions---are connected to one another, and be able to discover communication paths between them. It is also essential to understand how connections change over time, and be able to pose time-travel queries to retrieve information about past network states. This problem is becoming more acute with the advent of software defined networks, where network functions are virtualized and managed in a cloud infrastructure. We represent a communication network inventory as a graph where the nodes are network entities and edges represent relationships between them, e.g. hosted-on, communicates-with, etc. Querying such a graph, e.g. for troubleshooting, using existing graph query languages is too cumbersome for network analysts. Thus, in this paper we present Nepal---a network path query language, which is designed to effectively retrieve desired paths from a network graph. The main novelty of Nepal is to consider paths as first-class citizens of the language, which achieves closure under composition while maintaining simplicity. We demonstrate the capabilities of Nepal by examples and discuss query evaluation. We illustrate how path queries can simplify the extraction of information from a dynamic inventory of a multi-layer network and can be used for troubleshooting.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132250279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.
{"title":"NScaleSpark: subgraph-centric graph analytics on Apache Spark","authors":"A. Quamar, A. Deshpande","doi":"10.1145/2980523.2980529","DOIUrl":"https://doi.org/10.1145/2980523.2980529","url":null,"abstract":"In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks -- these systems espouse a \"think-like-a-vertex\" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a \"think-like-a-subgraph\" paradigm (also recently called \"think-like-an-embedding\" [23]). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system [18], which was built on top of the Hadoop MapReduce system. We describe how we reimplemented NScale on the Apache Spark platform, the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124633900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networks are prevalent in today's electronic world in a wide variety of domains ranging from Engineering to Social Sciences, Life Sciences to Physical Sciences, and so on. Researchers and practitioners have studied networks in multiple ways like defining network metrics, providing theoretical results and examining problems like pattern mining, link prediction etc. The NDA workshop is a forum for exchanging ideas and methods for mining, querying and learning with real-world networks, developing new common understandings of the problems at hand, sharing of data sets where applicable, and leveraging existing knowledge from different disciplines. The purpose of this workshop is to bring together researchers from academia, industry, and government, to create a forum for discussing recent advances in (large-scale) graph analysis, as well as propose and discuss novel methods and techniques towards addressing domain specific challenges.
{"title":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","authors":"Akhil Arora, Shourya Roy, S. Mehta","doi":"10.1145/2980523","DOIUrl":"https://doi.org/10.1145/2980523","url":null,"abstract":"Networks are prevalent in today's electronic world in a wide variety of domains ranging from Engineering to Social Sciences, Life Sciences to Physical Sciences, and so on. Researchers and practitioners have studied networks in multiple ways like defining network metrics, providing theoretical results and examining problems like pattern mining, link prediction etc. The NDA workshop is a forum for exchanging ideas and methods for mining, querying and learning with real-world networks, developing new common understandings of the problems at hand, sharing of data sets where applicable, and leveraging existing knowledge from different disciplines. The purpose of this workshop is to bring together researchers from academia, industry, and government, to create a forum for discussing recent advances in (large-scale) graph analysis, as well as propose and discuss novel methods and techniques towards addressing domain specific challenges.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115585812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the last decade, there has been considerable excitement and research on the study and exploitation of the spread of information and influence over networks. Tremendous advances have been made on the prototypical problem of selecting a small number of seed users to activate over a social network such that the number of activated nodes in an expected sense is maximized, under several standard information diffusion models. Scalable heuristics, but more notably scalable approximation algorithms, have been developed in the recent years. Unfortunately, the state of the art has several shortcomings. Firstly, most of the research has focused on a simplistic setting where one marketing campaign is active at a time. While there has been some work on modeling and optimizing for competing diffusions, the key role played by the network owner in a campaign has been overlooked. Secondly, the relationship and contract needed between the network owner and the advertisers is not captured. Thirdly, in real life, relationships between multiple campaigns may be more complex than just pure competition. Finally, most of the studies assume that the seeds must be chosen all at once before the campaign starts with no opportunity to observe the performance of seeds chosen earlier and course-correct as needed. We make a call to arms for opening up the framework of viral marketing to allow for more expressive business models and seed selection strategies, and present some recent research from our group that addresses the modeling and computational challenges.
{"title":"Viral marketing 2.0","authors":"L. Lakshmanan","doi":"10.1145/2980523.2980526","DOIUrl":"https://doi.org/10.1145/2980523.2980526","url":null,"abstract":"Over the last decade, there has been considerable excitement and research on the study and exploitation of the spread of information and influence over networks. Tremendous advances have been made on the prototypical problem of selecting a small number of seed users to activate over a social network such that the number of activated nodes in an expected sense is maximized, under several standard information diffusion models. Scalable heuristics, but more notably scalable approximation algorithms, have been developed in the recent years. Unfortunately, the state of the art has several shortcomings. Firstly, most of the research has focused on a simplistic setting where one marketing campaign is active at a time. While there has been some work on modeling and optimizing for competing diffusions, the key role played by the network owner in a campaign has been overlooked. Secondly, the relationship and contract needed between the network owner and the advertisers is not captured. Thirdly, in real life, relationships between multiple campaigns may be more complex than just pure competition. Finally, most of the studies assume that the seeds must be chosen all at once before the campaign starts with no opportunity to observe the performance of seeds chosen earlier and course-correct as needed. We make a call to arms for opening up the framework of viral marketing to allow for more expressive business models and seed selection strategies, and present some recent research from our group that addresses the modeling and computational challenges.","PeriodicalId":246127,"journal":{"name":"Proceedings of the 1st ACM SIGMOD Workshop on Network Data Analytics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117142160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}