Betweenness centrality measures the importance of a vertex by quantifying the number of times it acts as a midpoint of the shortest paths between other vertices. This measure is widely used in network analysis. In many applications, we wish to choose the k vertices with the maximum adaptive betweenness centrality, which is the betweenness centrality without considering the shortest paths that have been taken into account by already-chosen vertices. All previous methods are designed to compute the betweenness centrality in a fixed graph. Thus, to solve the above task, we have to run these methods $k$ times. In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. Our method first constructs a hypergraph that encodes the betweenness centrality, and then computes the adaptive betweenness centrality by examining this graph. Our technique can be utilized to handle other centrality measures. We theoretically prove that our method is very accurate, and experimentally confirm that it is three orders of magnitude faster than previous methods. Relying on the scalability of our method, we experimentally demonstrate that strategies based on adaptive betweenness centrality are effective in important applications studied in the network science and database communities.
{"title":"Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches","authors":"Yuichi Yoshida","doi":"10.1145/2623330.2623626","DOIUrl":"https://doi.org/10.1145/2623330.2623626","url":null,"abstract":"Betweenness centrality measures the importance of a vertex by quantifying the number of times it acts as a midpoint of the shortest paths between other vertices. This measure is widely used in network analysis. In many applications, we wish to choose the k vertices with the maximum adaptive betweenness centrality, which is the betweenness centrality without considering the shortest paths that have been taken into account by already-chosen vertices. All previous methods are designed to compute the betweenness centrality in a fixed graph. Thus, to solve the above task, we have to run these methods $k$ times. In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. Our method first constructs a hypergraph that encodes the betweenness centrality, and then computes the adaptive betweenness centrality by examining this graph. Our technique can be utilized to handle other centrality measures. We theoretically prove that our method is very accurate, and experimentally confirm that it is three orders of magnitude faster than previous methods. Relying on the scalability of our method, we experimentally demonstrate that strategies based on adaptive betweenness centrality are effective in important applications studied in the network science and database communities.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78291398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sofus A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani
It is our great pleasure to welcome you to the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). The annual ACM SIGKDD conference is the premier international forum for data science, data mining, knowledge discovery and big data. It brings together researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2014 features plenary presentations, paper presentations, poster sessions, workshops, tutorials, exhibits, and the KDD Cup competition. We are happy to announce that this year we are partnering with Bloomberg to emphasize our theme of Data Science for Social Good. To this end, part of our workshop and tutorial program will be held at the Bloomberg facilities together with Bloomberg-specific events, all focusing on issues pertaining to social good. Today, you hear a lot about data science, big data and data intensive computing. The core of this work is extracting knowledge and useful information from data, which for science leads to beautiful insights, and for applications leads to actions, alerts and decisions. The KDD community has always been at the center of this activity and it is clear from this conference that it will continue to drive this broader field of data science. This year we had a record number of submissions. There were 1036 submissions to the Research Track, and 151 papers were accepted. There were 197 submissions to the Industry and Government Track, and 44 papers were accepted. KDD also has a history of inviting talks that are of broad interest to the KDD community. This year we chose to have 4 plenary talks. A program committee also selected 8 talks to present at the Industry and Government track. A strength of the KDD conference is the number of workshops and tutorials that are co-located with it. This year there were 9 full-day workshops, 16 half-day workshops, and 12 tutorials. As part of our partnership with Bloomberg on the theme of social good, Bloomberg will have 3 workshops jointly located with our workshops at their New York Office. Our community is a unique blend of industry and academia, ranging from people starting their career to leaders in their respective fields. This year, we are piloting programs to facilitate networking amongst these groups. Specifically, we have a networking lounge for industry and job-seekers to meet and we helped find good matches. We also have a networking event focused on defining what a data science career looks like and have senior members meet young people to help them understand the skills needed and what a job in this discipline might entail.
{"title":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","authors":"Sofus A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani","doi":"10.1145/2623330","DOIUrl":"https://doi.org/10.1145/2623330","url":null,"abstract":"It is our great pleasure to welcome you to the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). The annual ACM SIGKDD conference is the premier international forum for data science, data mining, knowledge discovery and big data. It brings together researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2014 features plenary presentations, paper presentations, poster sessions, workshops, tutorials, exhibits, and the KDD Cup competition. We are happy to announce that this year we are partnering with Bloomberg to emphasize our theme of Data Science for Social Good. To this end, part of our workshop and tutorial program will be held at the Bloomberg facilities together with Bloomberg-specific events, all focusing on issues pertaining to social good. \u0000 \u0000Today, you hear a lot about data science, big data and data intensive computing. The core of this work is extracting knowledge and useful information from data, which for science leads to beautiful insights, and for applications leads to actions, alerts and decisions. The KDD community has always been at the center of this activity and it is clear from this conference that it will continue to drive this broader field of data science. \u0000 \u0000This year we had a record number of submissions. There were 1036 submissions to the Research Track, and 151 papers were accepted. There were 197 submissions to the Industry and Government Track, and 44 papers were accepted. \u0000 \u0000KDD also has a history of inviting talks that are of broad interest to the KDD community. This year we chose to have 4 plenary talks. A program committee also selected 8 talks to present at the Industry and Government track. \u0000 \u0000A strength of the KDD conference is the number of workshops and tutorials that are co-located with it. This year there were 9 full-day workshops, 16 half-day workshops, and 12 tutorials. As part of our partnership with Bloomberg on the theme of social good, Bloomberg will have 3 workshops jointly located with our workshops at their New York Office. \u0000 \u0000Our community is a unique blend of industry and academia, ranging from people starting their career to leaders in their respective fields. This year, we are piloting programs to facilitate networking amongst these groups. Specifically, we have a networking lounge for industry and job-seekers to meet and we helped find good matches. We also have a networking event focused on defining what a data science career looks like and have senior members meet young people to help them understand the skills needed and what a job in this discipline might entail.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.
{"title":"Differentially private network data release via structural inference","authors":"Qian Xiao, Rui Chen, K. Tan","doi":"10.1145/2623330.2623642","DOIUrl":"https://doi.org/10.1145/2623330.2623642","url":null,"abstract":"Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88217856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Feng, M. Ghassemi, Thomas Brennan, John Ellenberger, I. Hussain, R. Mark
Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional database management systems (DBMS) often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analysis use separate software platforms. Exporting and importing large amounts of data across platforms require a significant amount of computational and I/O resources, as well as potentially putting sensitive data at a security risk. In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP's cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. MIMIC is an open-access critical care EHR archive (over 4TB in size) and consists of structured, unstructured and waveform data. Furthermore, this tutorial will seek to educate the participants on how a combination of dynamic querying, and in-memory DBMS may enhance the management and analysis of complex clinical data.
{"title":"Management and analytic of biomedical big data with cloud-based in-memory database and dynamic querying: a hands-on experience with real-world data","authors":"M. Feng, M. Ghassemi, Thomas Brennan, John Ellenberger, I. Hussain, R. Mark","doi":"10.1145/2623330.2630806","DOIUrl":"https://doi.org/10.1145/2623330.2630806","url":null,"abstract":"Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional database management systems (DBMS) often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analysis use separate software platforms. Exporting and importing large amounts of data across platforms require a significant amount of computational and I/O resources, as well as potentially putting sensitive data at a security risk. In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP's cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. MIMIC is an open-access critical care EHR archive (over 4TB in size) and consists of structured, unstructured and waveform data. Furthermore, this tutorial will seek to educate the participants on how a combination of dynamic querying, and in-memory DBMS may enhance the management and analysis of complex clinical data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87211686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online social networks offering various services have become ubiquitous in our daily life. Meanwhile, users nowadays are usually involved in multiple online social networks simultaneously to enjoy specific services provided by different networks. Formally, social networks that share some common users are named as partially aligned networks. In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. In multiple partially aligned social networks, users can be extensively correlated with each other by various connections. To categorize these diverse connections among users, 7 "intra-network social meta paths" and 4 categories of "inter-network social meta paths" are proposed in this paper. These "social meta paths" can cover a wide variety of connection information in the network, some of which can be helpful for solving the multi-network link prediction problem but some can be not. To utilize useful connection, a subset of the most informative "social meta paths" are picked, the process of which is formally defined as "social meta path selection" in this paper. An effective general link formation prediction framework, Mli (Multi-network Link Identifier), is proposed in this paper to solve the multi-network link (formation) prediction problem. Built with heterogenous topological features extracted based on the selected "social meta paths" in the multiple partially aligned social networks, Mli can help refine and disambiguate the prediction results reciprocally in all aligned networks. Extensive experiments conducted on real-world partially aligned heterogeneous networks, Foursquare and Twitter, demonstrate that Mli can solve the multi-network link prediction problem very well.
提供各种服务的在线社交网络在我们的日常生活中无处不在。同时,现在的用户通常同时参与多个在线社交网络,以享受不同网络提供的特定服务。正式地,共享一些共同用户的社交网络被命名为部分对齐网络。在本文中,我们想要同时预测多个部分对齐的社会网络中社会链接的形成,这被正式定义为多网络链接(形成)预测问题。在多个部分对齐的社交网络中,用户可以通过各种连接广泛地相互关联。为了对用户之间的这些不同连接进行分类,本文提出了7种“网络内社交元路径”和4种“网络间社交元路径”。这些“社会元路径”可以涵盖网络中各种各样的连接信息,其中有些可以帮助解决多网络链接预测问题,有些则不能。为了利用有用的连接,选取了信息量最大的“社会元路径”子集,本文将其正式定义为“社会元路径选择”。为了解决多网络链路(形成)预测问题,本文提出了一种有效的通用链路形成预测框架Mli (Multi-network link Identifier)。基于在多个部分对齐的社交网络中所选择的“社交元路径”提取的异构拓扑特征,Mli可以帮助在所有对齐的网络中相互改进和消除预测结果的歧义。在现实世界的部分对齐异构网络(Foursquare和Twitter)上进行的大量实验表明,Mli可以很好地解决多网络链接预测问题。
{"title":"Meta-path based multi-network collective link prediction","authors":"Jiawei Zhang, Philip S. Yu, Zhi-Hua Zhou","doi":"10.1145/2623330.2623645","DOIUrl":"https://doi.org/10.1145/2623330.2623645","url":null,"abstract":"Online social networks offering various services have become ubiquitous in our daily life. Meanwhile, users nowadays are usually involved in multiple online social networks simultaneously to enjoy specific services provided by different networks. Formally, social networks that share some common users are named as partially aligned networks. In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. In multiple partially aligned social networks, users can be extensively correlated with each other by various connections. To categorize these diverse connections among users, 7 \"intra-network social meta paths\" and 4 categories of \"inter-network social meta paths\" are proposed in this paper. These \"social meta paths\" can cover a wide variety of connection information in the network, some of which can be helpful for solving the multi-network link prediction problem but some can be not. To utilize useful connection, a subset of the most informative \"social meta paths\" are picked, the process of which is formally defined as \"social meta path selection\" in this paper. An effective general link formation prediction framework, Mli (Multi-network Link Identifier), is proposed in this paper to solve the multi-network link (formation) prediction problem. Built with heterogenous topological features extracted based on the selected \"social meta paths\" in the multiple partially aligned social networks, Mli can help refine and disambiguate the prediction results reciprocally in all aligned networks. Extensive experiments conducted on real-world partially aligned heterogeneous networks, Foursquare and Twitter, demonstrate that Mli can solve the multi-network link prediction problem very well.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"147 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91554326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many diverse settings, aggregated opinions of others play an increasingly dominant role in shaping individual decision making. One key prerequisite of harnessing the "crowd wisdom" is the independency of individuals' opinions, yet in real settings collective opinions are rarely simple aggregations of independent minds. Recent experimental studies document that disclosing prior collective opinions distorts individuals' decision making as well as their perceptions of quality and value, highlighting a fundamental disconnect from current modeling efforts: How to model social influence and its impact on systems that are constantly evolving? In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. We find our method successfully captures the dynamics of rating growth, helping us separate social influence bias from inherent values. Using large-scale longitudinal customer rating datasets, we demonstrate that our model not only effectively assesses social influence bias, but also accurately predicts long-term cumulative growth of ratings solely based on early rating trajectories. We believe our framework will play an increasingly important role as our understanding of social processes deepens. It promotes strategies to untangle manipulations and social biases and provides insights towards a more reliable and effective design of social platforms.
{"title":"Quantifying herding effects in crowd wisdom","authors":"Ting Wang, Dashun Wang, Fei Wang","doi":"10.1145/2623330.2623720","DOIUrl":"https://doi.org/10.1145/2623330.2623720","url":null,"abstract":"In many diverse settings, aggregated opinions of others play an increasingly dominant role in shaping individual decision making. One key prerequisite of harnessing the \"crowd wisdom\" is the independency of individuals' opinions, yet in real settings collective opinions are rarely simple aggregations of independent minds. Recent experimental studies document that disclosing prior collective opinions distorts individuals' decision making as well as their perceptions of quality and value, highlighting a fundamental disconnect from current modeling efforts: How to model social influence and its impact on systems that are constantly evolving? In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. We find our method successfully captures the dynamics of rating growth, helping us separate social influence bias from inherent values. Using large-scale longitudinal customer rating datasets, we demonstrate that our model not only effectively assesses social influence bias, but also accurately predicts long-term cumulative growth of ratings solely based on early rating trajectories. We believe our framework will play an increasingly important role as our understanding of social processes deepens. It promotes strategies to untangle manipulations and social biases and provides insights towards a more reliable and effective design of social platforms.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82018007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW ("Who to Follow and Why"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a "topical" or a "social" link, and depending on this decision it produces a different type of explanation. A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation. Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.
{"title":"Who to follow and why: link prediction with explanations","authors":"Nicola Barbieri, F. Bonchi, G. Manco","doi":"10.1145/2623330.2623733","DOIUrl":"https://doi.org/10.1145/2623330.2623733","url":null,"abstract":"User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW (\"Who to Follow and Why\"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a \"topical\" or a \"social\" link, and depending on this decision it produces a different type of explanation. A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation. Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88935169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the past few years there has been an explosion of social networks in the online world. Users flock these networks, creating profiles and linking themselves to other individuals. Connecting online has a small cost compared to the physical world, leading to a proliferation of connections, many of which carry little value or importance. Understanding the strength and nature of these relationships is paramount to anyone interesting in making use of the online social network data. In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. The Strong Triadic Closure principle stipulates that it is not possible for two individuals to have a strong relationship with a common friend and not know each other. We consider the problem of labeling the ties of a social network as strong or weak so as to enforce the Strong Triadic Closure property. We formulate the problem as a novel combinatorial optimization problem, and we study it theoretically. Although the problem is NP-hard, we are able to identify cases where there exist efficient algorithms with provable approximation guarantees. We perform experiments on real data, and we show that there is a correlation between the labeling we obtain and empirical metrics of tie strength, and that weak edges act as bridges between different communities in the network. Finally, we study extensions and variations of our problem both theoretically and experimentally.
{"title":"Using strong triadic closure to characterize ties in social networks","authors":"Stavros Sintos, Panayiotis Tsaparas","doi":"10.1145/2623330.2623664","DOIUrl":"https://doi.org/10.1145/2623330.2623664","url":null,"abstract":"In the past few years there has been an explosion of social networks in the online world. Users flock these networks, creating profiles and linking themselves to other individuals. Connecting online has a small cost compared to the physical world, leading to a proliferation of connections, many of which carry little value or importance. Understanding the strength and nature of these relationships is paramount to anyone interesting in making use of the online social network data. In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. The Strong Triadic Closure principle stipulates that it is not possible for two individuals to have a strong relationship with a common friend and not know each other. We consider the problem of labeling the ties of a social network as strong or weak so as to enforce the Strong Triadic Closure property. We formulate the problem as a novel combinatorial optimization problem, and we study it theoretically. Although the problem is NP-hard, we are able to identify cases where there exist efficient algorithms with provable approximation guarantees. We perform experiments on real data, and we show that there is a correlation between the labeling we obtain and empirical metrics of tie strength, and that weak edges act as bridges between different communities in the network. Finally, we study extensions and variations of our problem both theoretically and experimentally.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89129980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Location-based data is increasingly prevalent with the rapid increase and adoption of mobile devices. In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data. Modeling and predicting a spatial distribution for an individual is a challenging problem given both (a) the typical sparsity of data at the individual level and (b) the heterogeneity of spatial mobility patterns across individuals. We investigate the application of kernel density estimation (KDE) to this problem using a mixture model approach that can interpolate between an individual's data and broader patterns in the population as a whole. The mixture-KDE approach is evaluated on two large geolocation/check-in data sets, from Twitter and Gowalla, with comparisons to non-KDE baselines, using both log-likelihood and detection of simulated identity theft as evaluation metrics. Our experimental results indicate that the mixture-KDE method provides a useful and accurate methodology for capturing and predicting individual-level spatial patterns in the presence of noisy and sparse data.
{"title":"Modeling human location data with mixtures of kernel densities","authors":"Moshe Lichman, Padhraic Smyth","doi":"10.1145/2623330.2623681","DOIUrl":"https://doi.org/10.1145/2623330.2623681","url":null,"abstract":"Location-based data is increasingly prevalent with the rapid increase and adoption of mobile devices. In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data. Modeling and predicting a spatial distribution for an individual is a challenging problem given both (a) the typical sparsity of data at the individual level and (b) the heterogeneity of spatial mobility patterns across individuals. We investigate the application of kernel density estimation (KDE) to this problem using a mixture model approach that can interpolate between an individual's data and broader patterns in the population as a whole. The mixture-KDE approach is evaluated on two large geolocation/check-in data sets, from Twitter and Gowalla, with comparisons to non-KDE baselines, using both log-likelihood and detection of simulated identity theft as evaluation metrics. Our experimental results indicate that the mixture-KDE method provides a useful and accurate methodology for capturing and predicting individual-level spatial patterns in the presence of noisy and sparse data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87195669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.
从人群规模的临床数据推断表型模式是个体化医疗发展的核心计算任务。进行此类研究的一个重要数据来源是患者电子医疗记录(EMR)。然而,患者的电子病历通常是稀疏和嘈杂的,如果我们直接使用它们来表示患者的表型,这将带来重大挑战。在本文中,我们提出了一个名为Pacifier (PAtient reCord densIFIER)的数据驱动表型框架,其中我们将每个患者的纵向EMR数据解释为具有特征维度和时间维度的稀疏矩阵,并通过探索这些矩阵的潜在结构来获得更健壮的患者表型。具体来说,我们假设每个衍生表型由原始患者EMR中包含的医学特征的子集组成,其价值随着时间的推移而平稳发展。我们提出两种方案来实现这一目标。一种是个体基础方法(IBA),它假设每个患者的表型是不同的。另一种是共享基础方法(SBA),它假设患者群体共享一组共同的表型。我们开发了一种高效的优化算法,能够有效地解决这两个问题。最后,我们在两个真实世界的EMR队列中验证了安抚奶嘴对充血性心力衰竭(CHF)和终末期肾脏疾病(ESRD)的早期预测任务。我们的结果表明,所提出的算法可以显著提高这两个任务的预测性能(在诊断组粒度上,CHF的平均AUC得分分别从0.689提高到0.816,ESRD的平均AUC得分分别从0.756提高到0.838)。我们还说明了从我们的数据中得出的一些有趣的表型。
{"title":"From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records","authors":"Jiayu Zhou, Fei Wang, Jianying Hu, Jieping Ye","doi":"10.1145/2623330.2623711","DOIUrl":"https://doi.org/10.1145/2623330.2623711","url":null,"abstract":"Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86547565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}