Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第8页

Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches 使用超图草图的自适应间性中心性的几乎线性时间算法

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623626

Yuichi Yoshida

Betweenness centrality measures the importance of a vertex by quantifying the number of times it acts as a midpoint of the shortest paths between other vertices. This measure is widely used in network analysis. In many applications, we wish to choose the k vertices with the maximum adaptive betweenness centrality, which is the betweenness centrality without considering the shortest paths that have been taken into account by already-chosen vertices. All previous methods are designed to compute the betweenness centrality in a fixed graph. Thus, to solve the above task, we have to run these methods $k$ times. In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. Our method first constructs a hypergraph that encodes the betweenness centrality, and then computes the adaptive betweenness centrality by examining this graph. Our technique can be utilized to handle other centrality measures. We theoretically prove that our method is very accurate, and experimentally confirm that it is three orders of magnitude faster than previous methods. Relying on the scalability of our method, we experimentally demonstrate that strategies based on adaptive betweenness centrality are effective in important applications studied in the network science and database communities.

中间性中心性通过量化一个顶点作为其他顶点之间最短路径中点的次数来衡量它的重要性。该方法在网络分析中得到了广泛的应用。在许多应用中，我们希望选择k个具有最大自适应中间度中心性的顶点，这是不考虑已经选择的顶点所考虑的最短路径的中间度中心性。以往的方法都是计算固定图的中间度中心性。因此，为了解决上面的任务，我们必须运行这些方法k次。在本文中，我们提出了一种直接解决该任务的方法，无论k的值有多大，其运行时间几乎都是线性的。我们的方法首先构造一个超图，对中间性中心性进行编码，然后通过检查该图来计算自适应中间性中心性。我们的技术可用于处理其他中心性度量。我们从理论上证明了我们的方法是非常精确的，并且实验证实了它比以前的方法快了三个数量级。基于我们方法的可扩展性，我们通过实验证明了基于自适应间性中心性的策略在网络科学和数据库社区研究的重要应用中是有效的。

{"title":"Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches","authors":"Yuichi Yoshida","doi":"10.1145/2623330.2623626","DOIUrl":"https://doi.org/10.1145/2623330.2623626","url":null,"abstract":"Betweenness centrality measures the importance of a vertex by quantifying the number of times it acts as a midpoint of the shortest paths between other vertices. This measure is widely used in network analysis. In many applications, we wish to choose the k vertices with the maximum adaptive betweenness centrality, which is the betweenness centrality without considering the shortest paths that have been taken into account by already-chosen vertices. All previous methods are designed to compute the betweenness centrality in a fixed graph. Thus, to solve the above task, we have to run these methods $k$ times. In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. Our method first constructs a hypergraph that encodes the betweenness centrality, and then computes the adaptive betweenness centrality by examining this graph. Our technique can be utilized to handle other centrality measures. We theoretically prove that our method is very accurate, and experimentally confirm that it is three orders of magnitude faster than previous methods. Relying on the scalability of our method, we experimentally demonstrate that strategies based on adaptive betweenness centrality are effective in important applications studied in the network science and database communities.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"121 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78291398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 第20届ACM SIGKDD知识发现与数据挖掘国际会议论文集

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330

Sofus A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani

It is our great pleasure to welcome you to the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). The annual ACM SIGKDD conference is the premier international forum for data science, data mining, knowledge discovery and big data. It brings together researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2014 features plenary presentations, paper presentations, poster sessions, workshops, tutorials, exhibits, and the KDD Cup competition. We are happy to announce that this year we are partnering with Bloomberg to emphasize our theme of Data Science for Social Good. To this end, part of our workshop and tutorial program will be held at the Bloomberg facilities together with Bloomberg-specific events, all focusing on issues pertaining to social good. Today, you hear a lot about data science, big data and data intensive computing. The core of this work is extracting knowledge and useful information from data, which for science leads to beautiful insights, and for applications leads to actions, alerts and decisions. The KDD community has always been at the center of this activity and it is clear from this conference that it will continue to drive this broader field of data science. This year we had a record number of submissions. There were 1036 submissions to the Research Track, and 151 papers were accepted. There were 197 submissions to the Industry and Government Track, and 44 papers were accepted. KDD also has a history of inviting talks that are of broad interest to the KDD community. This year we chose to have 4 plenary talks. A program committee also selected 8 talks to present at the Industry and Government track. A strength of the KDD conference is the number of workshops and tutorials that are co-located with it. This year there were 9 full-day workshops, 16 half-day workshops, and 12 tutorials. As part of our partnership with Bloomberg on the theme of social good, Bloomberg will have 3 workshops jointly located with our workshops at their New York Office. Our community is a unique blend of industry and academia, ranging from people starting their career to leaders in their respective fields. This year, we are piloting programs to facilitate networking amongst these groups. Specifically, we have a networking lounge for industry and job-seekers to meet and we helped find good matches. We also have a networking event focused on defining what a data science career looks like and have senior members meet young people to help them understand the skills needed and what a job in this discipline might entail.

我们非常高兴地欢迎您参加第20届ACM SIGKDD知识发现和数据挖掘会议(KDD)。一年一度的ACM SIGKDD会议是数据科学、数据挖掘、知识发现和大数据的主要国际论坛。它汇集了来自学术界、工业界和政府的研究人员和实践者，分享他们的想法、研究成果和经验。KDD-2014以全体会议报告、论文报告、海报会议、讲习班、教程、展览和KDD杯比赛为特色。我们很高兴地宣布，今年我们将与彭博社合作，强调我们的主题“数据科学促进社会公益”。为此，我们的部分研讨会和教程计划将在彭博社的设施内举行，并与彭博社的特定活动一起举行，所有这些活动都集中在与社会公益有关的问题上。今天，你会听到很多关于数据科学、大数据和数据密集型计算的说法。这项工作的核心是从数据中提取知识和有用的信息，对于科学来说，这导致了美丽的见解，对于应用程序来说，这导致了行动、警报和决策。KDD社区一直是这一活动的中心，从这次会议中可以清楚地看到，它将继续推动数据科学这一更广泛的领域。今年我们收到了创纪录的申请。共有1036篇论文提交给研究轨道，151篇论文被接受。共有197份意见书提交给工业和政府部门，其中44篇论文被接受。KDD也有邀请对KDD社区有广泛兴趣的会谈的历史。今年我们选择了4次全体会谈。一个项目委员会还选择了8场演讲在工业和政府轨道上进行。KDD会议的优势之一是与它同时举办的研讨会和教程的数量。今年有9个全天工作坊，16个半天工作坊和12个辅导课。作为我们与彭博社在社会公益主题上合作的一部分，彭博社将在其纽约办事处与我们的工作坊共同举办3个研讨会。我们的社区是工业界和学术界的独特融合，从开始职业生涯的人到各自领域的领导者。今年，我们正在试行一些项目，以促进这些群体之间的联系。具体来说，我们为行业和求职者提供了一个社交休息室，我们帮助他们找到了合适的人选。我们还举办了一个社交活动，重点是定义数据科学职业是什么样子，并让资深成员与年轻人见面，帮助他们了解所需的技能，以及这一学科的工作可能需要什么。

{"title":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","authors":"Sofus A. Macskassy, C. Perlich, J. Leskovec, W. Wang, R. Ghani","doi":"10.1145/2623330","DOIUrl":"https://doi.org/10.1145/2623330","url":null,"abstract":"It is our great pleasure to welcome you to the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). The annual ACM SIGKDD conference is the premier international forum for data science, data mining, knowledge discovery and big data. It brings together researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2014 features plenary presentations, paper presentations, poster sessions, workshops, tutorials, exhibits, and the KDD Cup competition. We are happy to announce that this year we are partnering with Bloomberg to emphasize our theme of Data Science for Social Good. To this end, part of our workshop and tutorial program will be held at the Bloomberg facilities together with Bloomberg-specific events, all focusing on issues pertaining to social good. \u0000 \u0000Today, you hear a lot about data science, big data and data intensive computing. The core of this work is extracting knowledge and useful information from data, which for science leads to beautiful insights, and for applications leads to actions, alerts and decisions. The KDD community has always been at the center of this activity and it is clear from this conference that it will continue to drive this broader field of data science. \u0000 \u0000This year we had a record number of submissions. There were 1036 submissions to the Research Track, and 151 papers were accepted. There were 197 submissions to the Industry and Government Track, and 44 papers were accepted. \u0000 \u0000KDD also has a history of inviting talks that are of broad interest to the KDD community. This year we chose to have 4 plenary talks. A program committee also selected 8 talks to present at the Industry and Government track. \u0000 \u0000A strength of the KDD conference is the number of workshops and tutorials that are co-located with it. This year there were 9 full-day workshops, 16 half-day workshops, and 12 tutorials. As part of our partnership with Bloomberg on the theme of social good, Bloomberg will have 3 workshops jointly located with our workshops at their New York Office. \u0000 \u0000Our community is a unique blend of industry and academia, ranging from people starting their career to leaders in their respective fields. This year, we are piloting programs to facilitate networking amongst these groups. Specifically, we have a networking lounge for industry and job-seekers to meet and we helped find good matches. We also have a networking event focused on defining what a data science career looks like and have senior members meet young people to help them understand the skills needed and what a job in this discipline might entail.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Differentially private network data release via structural inference 通过结构推理的差分专用网络数据释放

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623642

Qian Xiao, Rui Chen, K. Tan

Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.

信息网络，如社交媒体和电子邮件网络，经常包含敏感信息。泄露此类网络数据可能严重危害个人隐私。因此，我们需要在发布之前对网络数据进行清理。在本文中，我们提出了一种新的数据清理解决方案，以一种不同的私有方式推断网络的结构。我们观察到，通过估计顶点之间的连接概率，而不是直接考虑观察到的边缘，可以大大降低差分隐私所带来的噪声尺度。我们提出的方法是利用统计层次随机图(HRG)模型来推断网络结构。差分隐私的保证是通过马尔可夫链蒙特卡罗(MCMC)对模型空间中可能的HRG结构进行采样来实现的。我们从理论上证明了这种推理的灵敏度仅为O(log n)，其中n为网络中的顶点数。这个范围意味着注入的噪音要比现有工程的噪音小。我们在四个现实网络数据集上对我们的方法进行了实验评估，并表明我们的解决方案有效地保留了基本的网络结构属性，如度分布、最短路径长度分布和影响节点。

{"title":"Differentially private network data release via structural inference","authors":"Qian Xiao, Rui Chen, K. Tan","doi":"10.1145/2623330.2623642","DOIUrl":"https://doi.org/10.1145/2623330.2623642","url":null,"abstract":"Information networks, such as social media and email networks, often contain sensitive information. Releasing such network data could seriously jeopardize individual privacy. Therefore, we need to sanitize network data before the release. In this paper, we present a novel data sanitization solution that infers a network's structure in a differentially private manner. We observe that, by estimating the connection probabilities between vertices instead of considering the observed edges directly, the noise scale enforced by differential privacy can be greatly reduced. Our proposed method infers the network structure by using a statistical hierarchical random graph (HRG) model. The guarantee of differential privacy is achieved by sampling possible HRG structures in the model space via Markov chain Monte Carlo (MCMC). We theoretically prove that the sensitivity of such inference is only O(log n), where n is the number of vertices in a network. This bound implies less noise to be injected than those of existing works. We experimentally evaluate our approach on four real-life network datasets and show that our solution effectively preserves essential network structural properties like degree distribution, shortest path length distribution and influential nodes.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88217856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 142

Management and analytic of biomedical big data with cloud-based in-memory database and dynamic querying: a hands-on experience with real-world data 使用基于云的内存数据库和动态查询管理和分析生物医学大数据:实际数据的实践经验

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2630806

M. Feng, M. Ghassemi, Thomas Brennan, John Ellenberger, I. Hussain, R. Mark

Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional database management systems (DBMS) often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analysis use separate software platforms. Exporting and importing large amounts of data across platforms require a significant amount of computational and I/O resources, as well as potentially putting sensitive data at a security risk. In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP's cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. MIMIC is an open-access critical care EHR archive (over 4TB in size) and consists of structured, unstructured and waveform data. Furthermore, this tutorial will seek to educate the participants on how a combination of dynamic querying, and in-memory DBMS may enhance the management and analysis of complex clinical data.

由于生物医学大数据的高维和大数据量，分析生物医学大数据的计算成本很高。传统数据库管理系统(DBMS)的性能和可伸缩性问题经常限制更复杂的数据查询和分析模型的使用。此外，在常规设置中，数据管理和分析使用单独的软件平台。跨平台导出和导入大量数据需要大量的计算和I/O资源，并且可能将敏感数据置于安全风险中。在本教程中，参与者将通过使用SAP的基于云的HANA内存DBMS以及多参数重症监护智能监控(MIMIC)数据集的实践练习，了解内存DBMS和传统DBMS之间的区别。MIMIC是一个开放获取的重症监护电子病历档案(大小超过4TB)，由结构化、非结构化和波形数据组成。此外，本教程将试图教育参与者动态查询和内存DBMS的组合如何增强复杂临床数据的管理和分析。

{"title":"Management and analytic of biomedical big data with cloud-based in-memory database and dynamic querying: a hands-on experience with real-world data","authors":"M. Feng, M. Ghassemi, Thomas Brennan, John Ellenberger, I. Hussain, R. Mark","doi":"10.1145/2623330.2630806","DOIUrl":"https://doi.org/10.1145/2623330.2630806","url":null,"abstract":"Analyzing Biomedical Big Data (BBD) is computationally expensive due to high dimensionality and large data volume. Performance and scalability issues of traditional database management systems (DBMS) often limit the usage of more sophisticated and complex data queries and analytic models. Moreover, in the conventional setting, data management and analysis use separate software platforms. Exporting and importing large amounts of data across platforms require a significant amount of computational and I/O resources, as well as potentially putting sensitive data at a security risk. In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP's cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. MIMIC is an open-access critical care EHR archive (over 4TB in size) and consists of structured, unstructured and waveform data. Furthermore, this tutorial will seek to educate the participants on how a combination of dynamic querying, and in-memory DBMS may enhance the management and analysis of complex clinical data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87211686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Meta-path based multi-network collective link prediction 基于元路径的多网络集合链路预测

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623645

Jiawei Zhang, Philip S. Yu, Zhi-Hua Zhou

Online social networks offering various services have become ubiquitous in our daily life. Meanwhile, users nowadays are usually involved in multiple online social networks simultaneously to enjoy specific services provided by different networks. Formally, social networks that share some common users are named as partially aligned networks. In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. In multiple partially aligned social networks, users can be extensively correlated with each other by various connections. To categorize these diverse connections among users, 7 "intra-network social meta paths" and 4 categories of "inter-network social meta paths" are proposed in this paper. These "social meta paths" can cover a wide variety of connection information in the network, some of which can be helpful for solving the multi-network link prediction problem but some can be not. To utilize useful connection, a subset of the most informative "social meta paths" are picked, the process of which is formally defined as "social meta path selection" in this paper. An effective general link formation prediction framework, Mli (Multi-network Link Identifier), is proposed in this paper to solve the multi-network link (formation) prediction problem. Built with heterogenous topological features extracted based on the selected "social meta paths" in the multiple partially aligned social networks, Mli can help refine and disambiguate the prediction results reciprocally in all aligned networks. Extensive experiments conducted on real-world partially aligned heterogeneous networks, Foursquare and Twitter, demonstrate that Mli can solve the multi-network link prediction problem very well.

提供各种服务的在线社交网络在我们的日常生活中无处不在。同时，现在的用户通常同时参与多个在线社交网络，以享受不同网络提供的特定服务。正式地，共享一些共同用户的社交网络被命名为部分对齐网络。在本文中，我们想要同时预测多个部分对齐的社会网络中社会链接的形成，这被正式定义为多网络链接(形成)预测问题。在多个部分对齐的社交网络中，用户可以通过各种连接广泛地相互关联。为了对用户之间的这些不同连接进行分类，本文提出了7种“网络内社交元路径”和4种“网络间社交元路径”。这些“社会元路径”可以涵盖网络中各种各样的连接信息，其中有些可以帮助解决多网络链接预测问题，有些则不能。为了利用有用的连接，选取了信息量最大的“社会元路径”子集，本文将其正式定义为“社会元路径选择”。为了解决多网络链路(形成)预测问题，本文提出了一种有效的通用链路形成预测框架Mli (Multi-network link Identifier)。基于在多个部分对齐的社交网络中所选择的“社交元路径”提取的异构拓扑特征，Mli可以帮助在所有对齐的网络中相互改进和消除预测结果的歧义。在现实世界的部分对齐异构网络(Foursquare和Twitter)上进行的大量实验表明，Mli可以很好地解决多网络链接预测问题。

{"title":"Meta-path based multi-network collective link prediction","authors":"Jiawei Zhang, Philip S. Yu, Zhi-Hua Zhou","doi":"10.1145/2623330.2623645","DOIUrl":"https://doi.org/10.1145/2623330.2623645","url":null,"abstract":"Online social networks offering various services have become ubiquitous in our daily life. Meanwhile, users nowadays are usually involved in multiple online social networks simultaneously to enjoy specific services provided by different networks. Formally, social networks that share some common users are named as partially aligned networks. In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. In multiple partially aligned social networks, users can be extensively correlated with each other by various connections. To categorize these diverse connections among users, 7 \"intra-network social meta paths\" and 4 categories of \"inter-network social meta paths\" are proposed in this paper. These \"social meta paths\" can cover a wide variety of connection information in the network, some of which can be helpful for solving the multi-network link prediction problem but some can be not. To utilize useful connection, a subset of the most informative \"social meta paths\" are picked, the process of which is formally defined as \"social meta path selection\" in this paper. An effective general link formation prediction framework, Mli (Multi-network Link Identifier), is proposed in this paper to solve the multi-network link (formation) prediction problem. Built with heterogenous topological features extracted based on the selected \"social meta paths\" in the multiple partially aligned social networks, Mli can help refine and disambiguate the prediction results reciprocally in all aligned networks. Extensive experiments conducted on real-world partially aligned heterogeneous networks, Foursquare and Twitter, demonstrate that Mli can solve the multi-network link prediction problem very well.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"147 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91554326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 186

Quantifying herding effects in crowd wisdom 量化群体智慧中的羊群效应

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623720

Ting Wang, Dashun Wang, Fei Wang

In many diverse settings, aggregated opinions of others play an increasingly dominant role in shaping individual decision making. One key prerequisite of harnessing the "crowd wisdom" is the independency of individuals' opinions, yet in real settings collective opinions are rarely simple aggregations of independent minds. Recent experimental studies document that disclosing prior collective opinions distorts individuals' decision making as well as their perceptions of quality and value, highlighting a fundamental disconnect from current modeling efforts: How to model social influence and its impact on systems that are constantly evolving? In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. We find our method successfully captures the dynamics of rating growth, helping us separate social influence bias from inherent values. Using large-scale longitudinal customer rating datasets, we demonstrate that our model not only effectively assesses social influence bias, but also accurately predicts long-term cumulative growth of ratings solely based on early rating trajectories. We believe our framework will play an increasingly important role as our understanding of social processes deepens. It promotes strategies to untangle manipulations and social biases and provides insights towards a more reliable and effective design of social platforms.

在许多不同的环境中，他人的综合意见在塑造个人决策方面发挥着越来越重要的作用。利用“群体智慧”的一个关键先决条件是个人意见的独立性，然而在现实环境中，集体意见很少是独立思想的简单集合。最近的实验研究证明，披露先前的集体意见会扭曲个人的决策，以及他们对质量和价值的看法，突出了与当前建模工作的根本脱节:如何对不断发展的社会影响及其对系统的影响进行建模?在本文中，我们开发了一个机制框架来模拟先前集体意见(例如，在线产品评级)对随后个人决策的社会影响。我们发现我们的方法成功地捕捉到了评级增长的动态，帮助我们将社会影响偏见与固有价值区分开来。使用大规模纵向客户评级数据集，我们证明了我们的模型不仅有效地评估了社会影响偏差，而且仅基于早期评级轨迹就能准确预测评级的长期累积增长。我们相信，随着我们对社会过程理解的加深，我们的框架将发挥越来越重要的作用。它促进了解决操纵和社会偏见的策略，并为更可靠、更有效的社交平台设计提供了见解。

{"title":"Quantifying herding effects in crowd wisdom","authors":"Ting Wang, Dashun Wang, Fei Wang","doi":"10.1145/2623330.2623720","DOIUrl":"https://doi.org/10.1145/2623330.2623720","url":null,"abstract":"In many diverse settings, aggregated opinions of others play an increasingly dominant role in shaping individual decision making. One key prerequisite of harnessing the \"crowd wisdom\" is the independency of individuals' opinions, yet in real settings collective opinions are rarely simple aggregations of independent minds. Recent experimental studies document that disclosing prior collective opinions distorts individuals' decision making as well as their perceptions of quality and value, highlighting a fundamental disconnect from current modeling efforts: How to model social influence and its impact on systems that are constantly evolving? In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. We find our method successfully captures the dynamics of rating growth, helping us separate social influence bias from inherent values. Using large-scale longitudinal customer rating datasets, we demonstrate that our model not only effectively assesses social influence bias, but also accurately predicts long-term cumulative growth of ratings solely based on early rating trajectories. We believe our framework will play an increasingly important role as our understanding of social processes deepens. It promotes strategies to untangle manipulations and social biases and provides insights towards a more reliable and effective design of social platforms.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82018007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Who to follow and why: link prediction with explanations 跟随谁，为什么:将预测与解释联系起来

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623733

Nicola Barbieri, F. Bonchi, G. Manco

User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW ("Who to Follow and Why"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a "topical" or a "social" link, and depending on this decision it produces a different type of explanation. A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation. Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.

用户推荐系统是任何在线社交网络平台的关键组成部分:它们帮助用户更快地扩展他们的网络，从而提高参与度和忠诚度。本文研究了社交网络中用户推荐的链接预测和解释。对于这个问题，我们提出了WTFW(“跟随谁和为什么”)，这是一个用于有向图和节点属性图链接预测的随机主题模型。我们的模型不仅预测链接，而且对于每个预测的链接，它决定它是“主题”链接还是“社会”链接，并根据这个决定产生不同类型的解释。建议在对某个主题感兴趣的用户和该主题的权威用户之间建立主题链接:在这种情况下，解释是一组二进制特征，描述负责创建链接的主题。社交链接被推荐给共享一个大的社交邻居的用户:在这种情况下，解释是邻居的集合更有可能负责链接的创建。我们对真实世界数据的实验评估证实了WTFW在链接预测中的准确性和相关解释的质量。

{"title":"Who to follow and why: link prediction with explanations","authors":"Nicola Barbieri, F. Bonchi, G. Manco","doi":"10.1145/2623330.2623733","DOIUrl":"https://doi.org/10.1145/2623330.2623733","url":null,"abstract":"User recommender systems are a key component in any on-line social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW (\"Who to Follow and Why\"), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a \"topical\" or a \"social\" link, and depending on this decision it produces a different type of explanation. A topical link is recommended between a user interested in a topic and a user authoritative in that topic: the explanation in this case is a set of binary features describing the topic responsible of the link creation. A social link is recommended between users which share a large social neighborhood: in this case the explanation is the set of neighbors which are more likely to be responsible for the link creation. Our experimental assessment on real-world data confirms the accuracy of WTFW in the link prediction and the quality of the associated explanations.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88935169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Using strong triadic closure to characterize ties in social networks 使用强三合一封闭来描述社交网络中的关系

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623664

Stavros Sintos, Panayiotis Tsaparas

In the past few years there has been an explosion of social networks in the online world. Users flock these networks, creating profiles and linking themselves to other individuals. Connecting online has a small cost compared to the physical world, leading to a proliferation of connections, many of which carry little value or importance. Understanding the strength and nature of these relationships is paramount to anyone interesting in making use of the online social network data. In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. The Strong Triadic Closure principle stipulates that it is not possible for two individuals to have a strong relationship with a common friend and not know each other. We consider the problem of labeling the ties of a social network as strong or weak so as to enforce the Strong Triadic Closure property. We formulate the problem as a novel combinatorial optimization problem, and we study it theoretically. Although the problem is NP-hard, we are able to identify cases where there exist efficient algorithms with provable approximation guarantees. We perform experiments on real data, and we show that there is a correlation between the labeling we obtain and empirical metrics of tie strength, and that weak edges act as bridges between different communities in the network. Finally, we study extensions and variations of our problem both theoretically and experimentally.

在过去的几年里，网络世界的社交网络出现了爆炸式增长。用户涌向这些网络，创建个人资料，并将自己与其他个人联系起来。与现实世界相比，在线连接的成本很低，这导致了连接的激增，其中许多连接几乎没有价值或重要性。了解这些关系的强度和性质对任何有兴趣利用在线社交网络数据的人来说都是至关重要的。在本文中，我们使用强三合一封闭原则来表征社会网络中关系的强度。强三位一体封闭原则规定，两个人不可能与一个共同的朋友建立牢固的关系，而彼此不认识。我们考虑将社会网络的联系标记为强或弱的问题，以强制执行强三合一闭包属性。本文将该问题形式化为一个新的组合优化问题，并对其进行了理论研究。虽然问题是np困难的，但我们能够识别出存在具有可证明的近似保证的有效算法的情况。我们对真实数据进行了实验，并表明我们获得的标签与联系强度的经验指标之间存在相关性，并且弱边充当网络中不同社区之间的桥梁。最后，我们从理论上和实验上研究了问题的扩展和变化。

{"title":"Using strong triadic closure to characterize ties in social networks","authors":"Stavros Sintos, Panayiotis Tsaparas","doi":"10.1145/2623330.2623664","DOIUrl":"https://doi.org/10.1145/2623330.2623664","url":null,"abstract":"In the past few years there has been an explosion of social networks in the online world. Users flock these networks, creating profiles and linking themselves to other individuals. Connecting online has a small cost compared to the physical world, leading to a proliferation of connections, many of which carry little value or importance. Understanding the strength and nature of these relationships is paramount to anyone interesting in making use of the online social network data. In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. The Strong Triadic Closure principle stipulates that it is not possible for two individuals to have a strong relationship with a common friend and not know each other. We consider the problem of labeling the ties of a social network as strong or weak so as to enforce the Strong Triadic Closure property. We formulate the problem as a novel combinatorial optimization problem, and we study it theoretically. Although the problem is NP-hard, we are able to identify cases where there exist efficient algorithms with provable approximation guarantees. We perform experiments on real data, and we show that there is a correlation between the labeling we obtain and empirical metrics of tie strength, and that weak edges act as bridges between different communities in the network. Finally, we study extensions and variations of our problem both theoretically and experimentally.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89129980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Modeling human location data with mixtures of kernel densities 用混合核密度建模人类位置数据

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623681

Moshe Lichman, Padhraic Smyth

Location-based data is increasingly prevalent with the rapid increase and adoption of mobile devices. In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data. Modeling and predicting a spatial distribution for an individual is a challenging problem given both (a) the typical sparsity of data at the individual level and (b) the heterogeneity of spatial mobility patterns across individuals. We investigate the application of kernel density estimation (KDE) to this problem using a mixture model approach that can interpolate between an individual's data and broader patterns in the population as a whole. The mixture-KDE approach is evaluated on two large geolocation/check-in data sets, from Twitter and Gowalla, with comparisons to non-KDE baselines, using both log-likelihood and detection of simulated identity theft as evaluation metrics. Our experimental results indicate that the mixture-KDE method provides a useful and accurate methodology for capturing and predicting individual-level spatial patterns in the presence of noisy and sparse data.

随着移动设备的快速增长和采用，基于位置的数据越来越普遍。在本文中，我们解决了学习空间密度模型的问题，特别关注个人层面的数据。考虑到(a)个体层面数据的典型稀疏性和(b)个体间空间流动模式的异质性，对个体的空间分布进行建模和预测是一个具有挑战性的问题。我们使用混合模型方法研究核密度估计(KDE)在这个问题上的应用，该方法可以在个体数据和总体中更广泛的模式之间进行插值。混合kde方法在来自Twitter和Gowalla的两个大型地理位置/登记数据集上进行评估，并与非kde基线进行比较，使用对数可能性和检测模拟身份盗窃作为评估指标。我们的实验结果表明，混合kde方法为捕获和预测存在噪声和稀疏数据的个人层面的空间模式提供了一种有用和准确的方法。

引用次数: 133

From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records 从微观到宏观:纵向电子病历致密化的数据驱动表型

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2014-08-24 DOI: 10.1145/2623330.2623711

Jiayu Zhou, Fei Wang, Jianying Hu, Jieping Ye

Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.

从人群规模的临床数据推断表型模式是个体化医疗发展的核心计算任务。进行此类研究的一个重要数据来源是患者电子医疗记录(EMR)。然而，患者的电子病历通常是稀疏和嘈杂的，如果我们直接使用它们来表示患者的表型，这将带来重大挑战。在本文中，我们提出了一个名为Pacifier (PAtient reCord densIFIER)的数据驱动表型框架，其中我们将每个患者的纵向EMR数据解释为具有特征维度和时间维度的稀疏矩阵，并通过探索这些矩阵的潜在结构来获得更健壮的患者表型。具体来说，我们假设每个衍生表型由原始患者EMR中包含的医学特征的子集组成，其价值随着时间的推移而平稳发展。我们提出两种方案来实现这一目标。一种是个体基础方法(IBA)，它假设每个患者的表型是不同的。另一种是共享基础方法(SBA)，它假设患者群体共享一组共同的表型。我们开发了一种高效的优化算法，能够有效地解决这两个问题。最后，我们在两个真实世界的EMR队列中验证了安抚奶嘴对充血性心力衰竭(CHF)和终末期肾脏疾病(ESRD)的早期预测任务。我们的结果表明，所提出的算法可以显著提高这两个任务的预测性能(在诊断组粒度上，CHF的平均AUC得分分别从0.689提高到0.816,ESRD的平均AUC得分分别从0.756提高到0.838)。我们还说明了从我们的数据中得出的一些有趣的表型。

{"title":"From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records","authors":"Jiayu Zhou, Fei Wang, Jianying Hu, Jieping Ye","doi":"10.1145/2623330.2623711","DOIUrl":"https://doi.org/10.1145/2623330.2623711","url":null,"abstract":"Inferring phenotypic patterns from population-scale clinical data is a core computational task in the development of personalized medicine. One important source of data on which to conduct this type of research is patient Electronic Medical Records (EMR). However, the patient EMRs are typically sparse and noisy, which creates significant challenges if we use them directly to represent patient phenotypes. In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. Specifically, we assume that each derived phenotype is composed of a subset of the medical features contained in original patient EMR, whose value evolves smoothly over time. We propose two formulations to achieve such goal. One is Individual Basis Approach (IBA), which assumes the phenotypes are different for every patient. The other is Shared Basis Approach (SBA), which assumes the patient population shares a common set of phenotypes. We develop an efficient optimization algorithm that is capable of resolving both problems efficiently. Finally we validate Pacifier on two real world EMR cohorts for the tasks of early prediction of Congestive Heart Failure (CHF) and End Stage Renal Disease (ESRD). Our results show that the predictive performance in both tasks can be improved significantly by the proposed algorithms (average AUC score improved from 0.689 to 0.816 on CHF, and from 0.756 to 0.838 on ESRD respectively, on diagnosis group granularity). We also illustrate some interesting phenotypes derived from our data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86547565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 123