Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498313
Lizi Liao, Qirong Ho, Jing Jiang, Ee-Peng Lim
Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia - to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons - for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete - in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine implementation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.
{"title":"SLR: A scalable latent role model for attribute completion and tie prediction in social networks","authors":"Lizi Liao, Qirong Ho, Jing Jiang, Ee-Peng Lim","doi":"10.1109/ICDE.2016.7498313","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498313","url":null,"abstract":"Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia - to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons - for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete - in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine implementation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1062-1073"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78505295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today, people can access various services with smart carry-on devices, e.g., surf the web with smart phones, make payments with credit cards, or ride a bus with commuting cards. In addition to the offered convenience, the access of such services can reveal their traveled trajectory to service providers. Very often, a user who has signed up for multiple services may expose her trajectory to more than one service providers. This state of affairs raises a privacy concern, but also an opportunity. On one hand, several colluding service providers, or a government agency that collects information from such service providers, may identify and reconstruct users' trajectories to an extent that can be threatening to personal privacy. On the other hand, the processing of such rich data may allow for the development of better services for the common good. In this paper, we take a neutral standpoint and investigate the potential for trajectories accumulated from different sources to be linked so as to reconstruct a larger trajectory of a single person. We develop a methodology, called fuzzy trajectory linking (FTL) that achieves this goal, and two instantiations thereof, one based on hypothesis testing and one on Naïve-Bayes. We provide a theoretical analysis for factors that affect FTL and use two real datasets to demonstrate that our algorithms effectively achieve their goals.
{"title":"Fuzzy trajectory linking","authors":"Huayu Wu, Mingqiang Xue, Jianneng Cao, Panagiotis Karras, W. Ng, Kee Kiat Koo","doi":"10.1109/ICDE.2016.7498296","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498296","url":null,"abstract":"Today, people can access various services with smart carry-on devices, e.g., surf the web with smart phones, make payments with credit cards, or ride a bus with commuting cards. In addition to the offered convenience, the access of such services can reveal their traveled trajectory to service providers. Very often, a user who has signed up for multiple services may expose her trajectory to more than one service providers. This state of affairs raises a privacy concern, but also an opportunity. On one hand, several colluding service providers, or a government agency that collects information from such service providers, may identify and reconstruct users' trajectories to an extent that can be threatening to personal privacy. On the other hand, the processing of such rich data may allow for the development of better services for the common good. In this paper, we take a neutral standpoint and investigate the potential for trajectories accumulated from different sources to be linked so as to reconstruct a larger trajectory of a single person. We develop a methodology, called fuzzy trajectory linking (FTL) that achieves this goal, and two instantiations thereof, one based on hypothesis testing and one on Naïve-Bayes. We provide a theoretical analysis for factors that affect FTL and use two real datasets to demonstrate that our algorithms effectively achieve their goals.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"859-870"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86978905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498246
K. Gouda, M. Hassaan
Graph similarity is a basic and essential operation in many applications. In this paper, we are interested in computing graph similarity based on edit distance. Existing graph edit distance computation methods adopt the best first search paradigm A*. These methods are time and space bound. In practice, they can compute the edit distance of graphs containing 12 vertices at most. To enable graph edit similarity computation on larger and distant graphs, we present CSI_GED, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing the edit distance on small as well as large and distant graphs. Furthermore, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.
{"title":"CSI_GED: An efficient approach for graph edit similarity computation","authors":"K. Gouda, M. Hassaan","doi":"10.1109/ICDE.2016.7498246","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498246","url":null,"abstract":"Graph similarity is a basic and essential operation in many applications. In this paper, we are interested in computing graph similarity based on edit distance. Existing graph edit distance computation methods adopt the best first search paradigm A*. These methods are time and space bound. In practice, they can compute the edit distance of graphs containing 12 vertices at most. To enable graph edit similarity computation on larger and distant graphs, we present CSI_GED, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing the edit distance on small as well as large and distant graphs. Furthermore, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"135 1","pages":"265-276"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86302099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498268
Huy Pham, C. Shahabi
Finding influential people in a society has been the focus of social studies for decades due to its numerous applications, such as viral marketing or spreading ideas and practices. A critical first step is to quantify the amount of influence an individual exerts on another, termed pairwise influence. Early social studies had to confine themselves to surveys and manual data collections for this purpose; more recent studies have exploited web data (e.g., blogs). In this paper, for the first time, we utilize people's movement in the real world (aka spatiotemporal data) to derive pairwise influence. We first define followship to capture the phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence of another individual who has visited that same location in the past. Subsequently, we coin the term spatial influence as the concept of inferring pairwise influence from spatiotemporal data by quantifying the amount of followship influence that an individual has on others. We then propose the Temporal and Locational Followship Model (TLFM) to estimate spatial influence, in which we study three factors that impact followship: the time delay between the visits, the popularity of the location, and the inherent coincidences in individuals' visiting behaviors. We conducted extensive experiments using various real-world datasets, which demonstrate the effectiveness of our TLFM model in quantifying spatial influence.
{"title":"Spatial influence - measuring followship in the real world","authors":"Huy Pham, C. Shahabi","doi":"10.1109/ICDE.2016.7498268","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498268","url":null,"abstract":"Finding influential people in a society has been the focus of social studies for decades due to its numerous applications, such as viral marketing or spreading ideas and practices. A critical first step is to quantify the amount of influence an individual exerts on another, termed pairwise influence. Early social studies had to confine themselves to surveys and manual data collections for this purpose; more recent studies have exploited web data (e.g., blogs). In this paper, for the first time, we utilize people's movement in the real world (aka spatiotemporal data) to derive pairwise influence. We first define followship to capture the phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence of another individual who has visited that same location in the past. Subsequently, we coin the term spatial influence as the concept of inferring pairwise influence from spatiotemporal data by quantifying the amount of followship influence that an individual has on others. We then propose the Temporal and Locational Followship Model (TLFM) to estimate spatial influence, in which we study three factors that impact followship: the time delay between the visits, the popularity of the location, and the inherent coincidences in individuals' visiting behaviors. We conducted extensive experiments using various real-world datasets, which demonstrate the effectiveness of our TLFM model in quantifying spatial influence.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"529-540"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87688324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498275
Chen Xu, M. Holzemer, Manohar Kaul, V. Markl
Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.
{"title":"Efficient fault-tolerance for iterative graph processing on distributed dataflow systems","authors":"Chen Xu, M. Holzemer, Manohar Kaul, V. Markl","doi":"10.1109/ICDE.2016.7498275","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498275","url":null,"abstract":"Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"36 1","pages":"613-624"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82825012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498348
Victor C. Liang, Richard T. B. Ma, W. Ng, Li Wang, M. Winslett, Huayu Wu, Shanshan Ying, Zhenjie Zhang
Telecommunication companies possess mobility information of their phone users, containing accurate locations and velocities of commuters travelling in public transportation system. Although the value of telecommunication data is well believed under the smart city vision, there is no existing solution to transform the data into actionable items for better transportation, mainly due to the lack of appropriate data utilization scheme and the limited processing capability on massive data. This paper presents the first ever system implementation of real-time public transportation crowd prediction based on telecommunication data, relying on the analytical power of advanced neural network models and the computation power of parallel streaming analytic engines. By analyzing the feeds of caller detail record (CDR) from mobile users in interested regions, our system is able to predict the number of metro passengers entering stations, the number of waiting passengers on the platforms and other important metrics on the crowd density. New techniques, including geographical-spatial data processing, weight-sharing recurrent neural network, and parallel streaming analytical programming, are employed in the system. These new techniques enable accurate and efficient prediction outputs, to meet the real-world business requirements from public transportation system.
{"title":"Mercury: Metro density prediction with recurrent neural network on streaming CDR data","authors":"Victor C. Liang, Richard T. B. Ma, W. Ng, Li Wang, M. Winslett, Huayu Wu, Shanshan Ying, Zhenjie Zhang","doi":"10.1109/ICDE.2016.7498348","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498348","url":null,"abstract":"Telecommunication companies possess mobility information of their phone users, containing accurate locations and velocities of commuters travelling in public transportation system. Although the value of telecommunication data is well believed under the smart city vision, there is no existing solution to transform the data into actionable items for better transportation, mainly due to the lack of appropriate data utilization scheme and the limited processing capability on massive data. This paper presents the first ever system implementation of real-time public transportation crowd prediction based on telecommunication data, relying on the analytical power of advanced neural network models and the computation power of parallel streaming analytic engines. By analyzing the feeds of caller detail record (CDR) from mobile users in interested regions, our system is able to predict the number of metro passengers entering stations, the number of waiting passengers on the platforms and other important metrics on the crowd density. New techniques, including geographical-spatial data processing, weight-sharing recurrent neural network, and parallel streaming analytical programming, are employed in the system. These new techniques enable accurate and efficient prediction outputs, to meet the real-world business requirements from public transportation system.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"1374-1377"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87834581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498317
Shenlu Wang, M. A. Cheema, Xuemin Lin, Ying Zhang, Dongxi Liu
Given a set of facilities F, a set of users U and a query facility q, a reverse k furthest neighbors (RkFN) query retrieves every user u ∈ U for which q is one of its k-furthest facilities. RkFN query is the natural complement of reverse k-nearest neighbors (RkNN) query that returns every user u for which q is one of its k-nearest facilities. While RkNN query returns the users that are highly influenced by a query q, RkFN query aims at finding the users that are least influenced by a query q. RkFN query has many applications in location-based services, marketing, facility location, clustering, and recommendation systems etc. While there exist several algorithms that answer RkFN query for k = 1, we are the first to propose a solution for arbitrary value of k. Based on several interesting observations, we present an efficient algorithm to process the RkFN queries. We also present a rigorous theoretical analysis to study various important aspects of the problem and our algorithm. An extensive experimental study is conducted using both real and synthetic data sets, demonstrating that our algorithm outperforms the state-of-the-art algorithm even for k = 1. The accuracy of our theoretical analysis is also verified by the experiments.
{"title":"Efficiently computing reverse k furthest neighbors","authors":"Shenlu Wang, M. A. Cheema, Xuemin Lin, Ying Zhang, Dongxi Liu","doi":"10.1109/ICDE.2016.7498317","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498317","url":null,"abstract":"Given a set of facilities F, a set of users U and a query facility q, a reverse k furthest neighbors (RkFN) query retrieves every user u ∈ U for which q is one of its k-furthest facilities. RkFN query is the natural complement of reverse k-nearest neighbors (RkNN) query that returns every user u for which q is one of its k-nearest facilities. While RkNN query returns the users that are highly influenced by a query q, RkFN query aims at finding the users that are least influenced by a query q. RkFN query has many applications in location-based services, marketing, facility location, clustering, and recommendation systems etc. While there exist several algorithms that answer RkFN query for k = 1, we are the first to propose a solution for arbitrary value of k. Based on several interesting observations, we present an efficient algorithm to process the RkFN queries. We also present a rigorous theoretical analysis to study various important aspects of the problem and our algorithm. An extensive experimental study is conducted using both real and synthetic data sets, demonstrating that our algorithm outperforms the state-of-the-art algorithm even for k = 1. The accuracy of our theoretical analysis is also verified by the experiments.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"36 1","pages":"1110-1121"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75085311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study a new type of Geo-Social K-Cover Group (GSKCG) queries that, given a set of query points and a social network, retrieves a minimum user group in which each user is socially related to at least k other users and the users' associated regions (e.g., familiar regions or service regions) can jointly cover all the query points. Albeit its practical usefulness, the GSKCG query problem is NP-hard. We consequently explore a set of effective pruning strategies to derive an efficient algorithm for finding the optimal solution. Moreover, we design a novel index structure tailored to our problem to further accelerate query processing. Extensive experiments demonstrate that our algorithm achieves desirable performance on real-life datasets.
{"title":"Geo-Social K-Cover Group queries for collaborative spatial computing","authors":"Yafei Li, Rui Chen, Jianliang Xu, Qiao Huang, Haibo Hu, Byron Choi","doi":"10.1109/ICDE.2016.7498399","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498399","url":null,"abstract":"In this paper, we study a new type of Geo-Social K-Cover Group (GSKCG) queries that, given a set of query points and a social network, retrieves a minimum user group in which each user is socially related to at least k other users and the users' associated regions (e.g., familiar regions or service regions) can jointly cover all the query points. Albeit its practical usefulness, the GSKCG query problem is NP-hard. We consequently explore a set of effective pruning strategies to derive an efficient algorithm for finding the optimal solution. Moreover, we design a novel index structure tailored to our problem to further accelerate query processing. Extensive experiments demonstrate that our algorithm achieves desirable performance on real-life datasets.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"41 1","pages":"1510-1511"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78075115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498226
Sungsu Lim, Junghoon Kim, Jae-Gil Lee
With regard to social network analysis, we concentrate on two widely-accepted building blocks: community detection and graph drawing. Although community detection and graph drawing have been studied separately, they have a great commonality, which means that it is possible to advance one field using the techniques of the other. In this paper, we propose a novel community detection algorithm for undirected graphs, called BlackHole, by importing a geometric embedding technique from graph drawing. Our proposed algorithm transforms the vertices of a graph to a set of points on a low-dimensional space whose coordinates are determined by a variant of graph drawing algorithms, following the overall procedure of spectral clustering. The set of points are then clustered using a conventional clustering algorithm to form communities. Our primary contribution is to prove that a common idea in graph drawing, which is characterized by consideration of repulsive forces in addition to attractive forces, improves the clusterability of an embedding. As a result, our algorithm has the advantages of being robust especially when the community structure is not easily detectable. Through extensive experiments, we have shown that BlackHole achieves the accuracy higher than or comparable to the state-of-the-art algorithms.
{"title":"BlackHole: Robust community detection inspired by graph drawing","authors":"Sungsu Lim, Junghoon Kim, Jae-Gil Lee","doi":"10.1109/ICDE.2016.7498226","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498226","url":null,"abstract":"With regard to social network analysis, we concentrate on two widely-accepted building blocks: community detection and graph drawing. Although community detection and graph drawing have been studied separately, they have a great commonality, which means that it is possible to advance one field using the techniques of the other. In this paper, we propose a novel community detection algorithm for undirected graphs, called BlackHole, by importing a geometric embedding technique from graph drawing. Our proposed algorithm transforms the vertices of a graph to a set of points on a low-dimensional space whose coordinates are determined by a variant of graph drawing algorithms, following the overall procedure of spectral clustering. The set of points are then clustered using a conventional clustering algorithm to form communities. Our primary contribution is to prove that a common idea in graph drawing, which is characterized by consideration of repulsive forces in addition to attractive forces, improves the clusterability of an embedding. As a result, our algorithm has the advantages of being robust especially when the community structure is not easily detectable. Through extensive experiments, we have shown that BlackHole achieves the accuracy higher than or comparable to the state-of-the-art algorithms.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"43 1","pages":"25-36"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79431904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-16DOI: 10.1109/ICDE.2016.7498307
Shengqi Yang, Fangqiu Han, Yinghui Wu, Xifeng Yan
Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5-10 times faster than the state-of-the-art TA-style graph search algorithm, and 10-100 times faster than a belief propagation approach.
{"title":"Fast top-k search in knowledge graphs","authors":"Shengqi Yang, Fangqiu Han, Yinghui Wu, Xifeng Yan","doi":"10.1109/ICDE.2016.7498307","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498307","url":null,"abstract":"Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5-10 times faster than the state-of-the-art TA-style graph search algorithm, and 10-100 times faster than a belief propagation approach.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"23 1","pages":"990-1001"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81304316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}