Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第2页

Adaptive collective routing using gaussian process dynamic congestion models 基于高斯过程动态拥塞模型的自适应集体路由

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487598

Siyuan Liu, Yisong Yue, R. Krishnan

We consider the problem of adaptively routing a fleet of cooperative vehicles within a road network in the presence of uncertain and dynamic congestion conditions. To tackle this problem, we first propose a Gaussian Process Dynamic Congestion Model that can effectively characterize both the dynamics and the uncertainty of congestion conditions. Our model is efficient and thus facilitates real-time adaptive routing in the face of uncertainty. Using this congestion model, we develop an efficient algorithm for non-myopic adaptive routing to minimize the collective travel time of all vehicles in the system. A key property of our approach is the ability to efficiently reason about the long-term value of exploration, which enables collectively balancing the exploration/exploitation trade-off for entire fleets of vehicles. We validate our approach based on traffic data from two large Asian cities. We show that our congestion model is effective in modeling dynamic congestion conditions. We also show that our routing algorithm generates significantly faster routes compared to standard baselines, and achieves near-optimal performance compared to an omniscient routing algorithm. We also present the results from a preliminary field study, which showcases the efficacy of our approach.

研究了存在不确定和动态拥堵条件下的道路网络中合作车队的自适应路由问题。为了解决这个问题，我们首先提出了一个高斯过程动态拥塞模型，该模型可以有效地表征拥塞条件的动态性和不确定性。我们的模型是有效的，因此可以在面对不确定性的情况下实现实时自适应路由。利用该拥塞模型，提出了一种有效的非近视眼自适应路径算法，使系统中所有车辆的总行驶时间最小。我们的方法的一个关键特性是能够有效地推断勘探的长期价值，这使得整个车队能够共同平衡勘探/开发之间的权衡。我们基于两个亚洲大城市的交通数据验证了我们的方法。我们的拥塞模型对动态拥塞条件的建模是有效的。我们还表明，与标准基线相比，我们的路由算法生成的路由明显更快，并且与全知路由算法相比，实现了近乎最佳的性能。我们还介绍了初步实地研究的结果，这表明了我们的方法的有效性。

{"title":"Adaptive collective routing using gaussian process dynamic congestion models","authors":"Siyuan Liu, Yisong Yue, R. Krishnan","doi":"10.1145/2487575.2487598","DOIUrl":"https://doi.org/10.1145/2487575.2487598","url":null,"abstract":"We consider the problem of adaptively routing a fleet of cooperative vehicles within a road network in the presence of uncertain and dynamic congestion conditions. To tackle this problem, we first propose a Gaussian Process Dynamic Congestion Model that can effectively characterize both the dynamics and the uncertainty of congestion conditions. Our model is efficient and thus facilitates real-time adaptive routing in the face of uncertainty. Using this congestion model, we develop an efficient algorithm for non-myopic adaptive routing to minimize the collective travel time of all vehicles in the system. A key property of our approach is the ability to efficiently reason about the long-term value of exploration, which enables collectively balancing the exploration/exploitation trade-off for entire fleets of vehicles. We validate our approach based on traffic data from two large Asian cities. We show that our congestion model is effective in modeling dynamic congestion conditions. We also show that our routing algorithm generates significantly faster routes compared to standard baselines, and achieves near-optimal performance compared to an omniscient routing algorithm. We also present the results from a preliminary field study, which showcases the efficacy of our approach.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82291052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

A tool for collecting provenance data in social media 在社交媒体中收集来源数据的工具

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487713

Pritam Gundecha, Suhas Ranganath, Zhuo Feng, Huan Liu

In recent years, social media sites have provided a large amount of information. Recipients of such information need mechanisms to know more about the received information, including the provenance. Previous research has shown that some attributes related to the received information provide additional context, so that a recipient can assess the amount of value, trust, and validity to be placed in the received information. Personal attributes of a user, including name, location, education, ethnicity, gender, and political and religious affiliations, can be found in social media sites. In this paper, we present a novel web-based tool for collecting the attributes of interest associated with a particular social media user related to the received information. This tool provides a way to combine different attributes available at different social media sites into a single user profile. Using different types of Twitter users, we also evaluate the performance of the tool in terms of number of attribute values collected, validity of these values, and total amount of retrieval time.

近年来，社交媒体网站提供了大量的信息。此类信息的接收者需要更多地了解所接收信息的机制，包括其来源。先前的研究表明，与接收到的信息相关的一些属性提供了额外的上下文，因此接收者可以评估接收到的信息的价值、信任和有效性。用户的个人属性，包括姓名、位置、教育程度、种族、性别、政治和宗教信仰，都可以在社交媒体网站上找到。在本文中，我们提出了一种新颖的基于web的工具，用于收集与接收到的信息相关的特定社交媒体用户相关的兴趣属性。该工具提供了一种将不同社交媒体网站上可用的不同属性组合到单个用户配置文件中的方法。使用不同类型的Twitter用户，我们还根据收集的属性值的数量、这些值的有效性和总检索时间来评估工具的性能。

引用次数: 30

Learning geographical preferences for point-of-interest recommendation 学习地理偏好以推荐兴趣点

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487673

B. Liu, Yanjie Fu, Zijun Yao, Hui Xiong

The problem of point of interest (POI) recommendation is to provide personalized recommendations of places of interests, such as restaurants, for mobile users. Due to its complexity and its connection to location based social networks (LBSNs), the decision process of a user choose a POI is complex and can be influenced by various factors, such as user preferences, geographical influences, and user mobility behaviors. While there are some studies on POI recommendations, it lacks of integrated analysis of the joint effect of multiple factors. To this end, in this paper, we propose a novel geographical probabilistic factor analysis framework which strategically takes various factors into consideration. Specifically, this framework allows to capture the geographical influences on a user's check-in behavior. Also, the user mobility behaviors can be effectively exploited in the recommendation model. Moreover, the recommendation model can effectively make use of user check-in count data as implicity user feedback for modeling user preferences. Finally, experimental results on real-world LBSNs data show that the proposed recommendation method outperforms state-of-the-art latent factor models with a significant margin.

兴趣点(POI)推荐的问题是为移动用户提供个性化的兴趣点推荐，如餐馆。由于其复杂性及其与基于位置的社交网络(LBSNs)的联系，用户选择POI的决策过程是复杂的，并可能受到各种因素的影响，如用户偏好、地理影响和用户移动行为。虽然有一些关于POI建议的研究，但缺乏对多因素共同作用的综合分析。为此，本文提出了一种战略性地综合考虑各种因素的新型地理概率因子分析框架。具体来说，这个框架允许捕捉对用户签入行为的地理影响。此外，用户的移动性行为也可以在推荐模型中得到有效的利用。此外，推荐模型可以有效地利用用户签入计数数据作为隐含用户反馈来建模用户偏好。最后，在真实LBSNs数据上的实验结果表明，所提出的推荐方法以显著的裕度优于最先进的潜在因素模型。

{"title":"Learning geographical preferences for point-of-interest recommendation","authors":"B. Liu, Yanjie Fu, Zijun Yao, Hui Xiong","doi":"10.1145/2487575.2487673","DOIUrl":"https://doi.org/10.1145/2487575.2487673","url":null,"abstract":"The problem of point of interest (POI) recommendation is to provide personalized recommendations of places of interests, such as restaurants, for mobile users. Due to its complexity and its connection to location based social networks (LBSNs), the decision process of a user choose a POI is complex and can be influenced by various factors, such as user preferences, geographical influences, and user mobility behaviors. While there are some studies on POI recommendations, it lacks of integrated analysis of the joint effect of multiple factors. To this end, in this paper, we propose a novel geographical probabilistic factor analysis framework which strategically takes various factors into consideration. Specifically, this framework allows to capture the geographical influences on a user's check-in behavior. Also, the user mobility behaviors can be effectively exploited in the recommendation model. Moreover, the recommendation model can effectively make use of user check-in count data as implicity user feedback for modeling user preferences. Finally, experimental results on real-world LBSNs data show that the proposed recommendation method outperforms state-of-the-art latent factor models with a significant margin.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80550347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 430

Information cascade at group scale 群体规模的信息级联

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487683

Milad Eftekhar, Y. Ganjali, Nick Koudas

Identifying the k most influential individuals in a social network is a well-studied problem. The objective is to detect k individuals in a (social) network who will influence the maximum number of people, if they are independently convinced of adopting a new strategy (product, idea, etc). There are cases in real life, however, where we aim to instigate groups instead of individuals to trigger network diffusion. Such cases abound, e.g., billboards, TV commercials and newspaper ads are utilized extensively to boost the popularity and raise awareness. In this paper, we generalize the "influential nodes" problem. Namely we are interested to locate the most "influential groups" in a network. As the first paper to address this problem: we (1) propose a fine-grained model of information diffusion for the group-based problem, (2) show that the process is submodular and present an algorithm to determine the influential groups under this model (with a precise approximation bound), (3) propose a coarse-grained model that inspects the network at group level (not individuals) significantly speeding up calculations for large networks, (4) show that the diffusion function we design here is submodular in general case, and propose an approximation algorithm for this coarse-grained model, and finally by conducting experiments on real datasets, (5) demonstrate that seeding members of selected groups to be the first adopters can broaden diffusion (when compared to the influential individuals case). Moreover, we can identify these influential groups much faster (up to 12 million times speedup), delivering a practical solution to this problem.

在一个社会网络中确定k个最有影响力的人是一个研究得很充分的问题。目标是在一个(社会)网络中发现k个个体，如果他们独立地相信采用一种新策略(产品、想法等)，他们将影响最大数量的人。然而，在现实生活中也有一些情况，我们的目标是煽动群体而不是个人来触发网络扩散。这样的例子比比皆是，例如，广告牌、电视广告和报纸广告被广泛利用，以提高知名度和认知度。本文推广了“影响节点”问题。也就是说，我们感兴趣的是在网络中找到最具“影响力的群体”。作为解决这个问题的第一篇论文:我们(1)为基于群体的问题提出了一个细粒度的信息扩散模型，(2)表明该过程是子模块的，并提出了一种算法来确定该模型下的有影响力的群体(具有精确的近似界)，(3)提出了一个粗粒度模型，该模型在群体层面(而不是个人)检查网络，大大加快了大型网络的计算速度，(4)表明我们在这里设计的扩散函数在一般情况下是子模块的。并提出了该粗粒度模型的近似算法，最后通过在真实数据集上进行实验，(5)证明了将选定群体的成员作为第一批采用者可以扩大传播(与有影响力的个人情况相比)。此外，我们可以更快地识别这些有影响力的群体(加速高达1200万倍)，为这个问题提供切实可行的解决方案。

{"title":"Information cascade at group scale","authors":"Milad Eftekhar, Y. Ganjali, Nick Koudas","doi":"10.1145/2487575.2487683","DOIUrl":"https://doi.org/10.1145/2487575.2487683","url":null,"abstract":"Identifying the k most influential individuals in a social network is a well-studied problem. The objective is to detect k individuals in a (social) network who will influence the maximum number of people, if they are independently convinced of adopting a new strategy (product, idea, etc). There are cases in real life, however, where we aim to instigate groups instead of individuals to trigger network diffusion. Such cases abound, e.g., billboards, TV commercials and newspaper ads are utilized extensively to boost the popularity and raise awareness. In this paper, we generalize the \"influential nodes\" problem. Namely we are interested to locate the most \"influential groups\" in a network. As the first paper to address this problem: we (1) propose a fine-grained model of information diffusion for the group-based problem, (2) show that the process is submodular and present an algorithm to determine the influential groups under this model (with a precise approximation bound), (3) propose a coarse-grained model that inspects the network at group level (not individuals) significantly speeding up calculations for large networks, (4) show that the diffusion function we design here is submodular in general case, and propose an approximation algorithm for this coarse-grained model, and finally by conducting experiments on real datasets, (5) demonstrate that seeding members of selected groups to be the first adopters can broaden diffusion (when compared to the influential individuals case). Moreover, we can identify these influential groups much faster (up to 12 million times speedup), delivering a practical solution to this problem.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83826684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

LCARS: a location-content-aware recommender system LCARS:一个位置内容感知推荐系统

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487608

Hongzhi Yin, Yizhou Sun, B. Cui, Zhiting Hu, Ling Chen

Newly emerging location-based and event-based social network services provide us with a new platform to understand users' preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge for traditional collaborative filtering-based recommender systems. The problem becomes more challenging when people travel to a new city where they have no activity history. In this paper, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate people's travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA-LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item co-occurrence patterns and exploiting item contents. The online recommendation part automatically combines the learnt interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up this online process, a scalable query processing technique is developed by extending the classic Threshold Algorithm (TA). We evaluate the performance of our recommender system on two large-scale real data sets, DoubanEvent and Foursquare. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency.

新兴的基于位置和基于事件的社交网络服务为我们提供了一个新的平台，可以根据用户的活动历史了解他们的偏好。用户只能访问有限数量的场地/活动，并且大多数都在有限的距离范围内，因此用户-物品矩阵非常稀疏，这对传统的基于协同过滤的推荐系统构成了很大的挑战。当人们前往一个没有活动历史的新城市时，这个问题变得更具挑战性。在本文中，我们提出了LCARS，这是一个位置内容感知推荐系统，通过考虑个人兴趣和当地偏好，为特定用户提供一组场地(例如，餐馆)或活动(例如，音乐会和展览)。这个推荐系统不仅可以方便人们在他们居住的地区附近旅行，还可以方便人们在他们陌生的城市旅行。具体来说，LCARS由离线建模和在线推荐两部分组成。离线建模部分称为LCA-LDA，旨在通过捕获项目共现模式和利用项目内容来学习每个用户的兴趣和每个城市的本地偏好。在线推荐部分自动结合查询用户的学习兴趣和查询城市的本地偏好，生成top-k推荐。为了加快在线查询处理的速度，通过扩展经典的阈值算法(TA)，开发了一种可扩展的查询处理技术。我们在豆瓣事件和Foursquare两个大规模真实数据集上评估了我们的推荐系统的性能。结果表明，LCARS在为用户推荐空间项目方面具有优势，特别是在前往新城市旅行时，在有效性和效率方面。

{"title":"LCARS: a location-content-aware recommender system","authors":"Hongzhi Yin, Yizhou Sun, B. Cui, Zhiting Hu, Ling Chen","doi":"10.1145/2487575.2487608","DOIUrl":"https://doi.org/10.1145/2487575.2487608","url":null,"abstract":"Newly emerging location-based and event-based social network services provide us with a new platform to understand users' preferences based on their activity history. A user can only visit a limited number of venues/events and most of them are within a limited distance range, so the user-item matrix is very sparse, which creates a big challenge for traditional collaborative filtering-based recommender systems. The problem becomes more challenging when people travel to a new city where they have no activity history. In this paper, we propose LCARS, a location-content-aware recommender system that offers a particular user a set of venues (e.g., restaurants) or events (e.g., concerts and exhibitions) by giving consideration to both personal interest and local preference. This recommender system can facilitate people's travel not only near the area in which they live, but also in a city that is new to them. Specifically, LCARS consists of two components: offline modeling and online recommendation. The offline modeling part, called LCA-LDA, is designed to learn the interest of each individual user and the local preference of each individual city by capturing item co-occurrence patterns and exploiting item contents. The online recommendation part automatically combines the learnt interest of the querying user and the local preference of the querying city to produce the top-k recommendations. To speed up this online process, a scalable query processing technique is developed by extending the classic Threshold Algorithm (TA). We evaluate the performance of our recommender system on two large-scale real data sets, DoubanEvent and Foursquare. The results show the superiority of LCARS in recommending spatial items for users, especially when traveling to new cities, in terms of both effectiveness and efficiency.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79055830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 354

Algorithmic techniques for modeling and mining large graphs (AMAzING) 建模和挖掘大型图的算法技术(AMAzING)

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2506176

A. Frieze, A. Gionis, Charalampos E. Tsourakakis

Network science has emerged over the last years as an interdisciplinary area spanning traditional domains including mathematics, computer science, sociology, biology and economics. Since complexity in social, biological and economical systems, and more generally in complex systems, arises through pairwise interactions there exists a surging interest in understanding networks. In this tutorial, we will provide an in-depth presentation of the most popular random-graph models used for modeling real-world networks. We will then discuss efficient algorithmic techniques for mining large graphs, with emphasis on the problems of extracting graph sparsifiers, partitioning graphs into densely connected components, and finding dense subgraphs. We will motivate the problems we will discuss and the algorithms we will present with real-world applications. Our aim is to survey important results in the areas of modeling and mining large graphs, to uncover the intuition behind the key ideas, and to present future research directions.

网络科学在过去的几年里已经成为一个跨学科的领域，跨越了数学、计算机科学、社会学、生物学和经济学等传统领域。由于社会、生物和经济系统的复杂性，以及更普遍的复杂系统的复杂性，是通过成对相互作用产生的，因此对理解网络的兴趣激增。在本教程中，我们将深入介绍用于建模现实世界网络的最流行的随机图模型。然后，我们将讨论挖掘大型图的有效算法技术，重点是提取图稀疏器，将图划分为密集连接的组件以及寻找密集子图的问题。我们将激励我们将讨论的问题和算法，我们将在现实世界的应用。我们的目标是调查大型图建模和挖掘领域的重要结果，揭示关键思想背后的直觉，并提出未来的研究方向。

引用次数: 4

The bang for the buck: fair competitive viral marketing from the host perspective 物有所值:从主机的角度来看，公平竞争的病毒式营销

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487649

Wei Lu, F. Bonchi, Amit Goyal, L. Lakshmanan

The key algorithmic problem in viral marketing is to identify a set of influential users (called seeds) in a social network, who, when convinced to adopt a product, shall influence other users in the network, leading to a large number of adoptions. When two or more players compete with similar products on the same network we talk about competitive viral marketing, which so far has been studied exclusively from the perspective of one of the competing players. In this paper we propose and study the novel problem of competitive viral marketing from the perspective of the host, i.e., the owner of the social network platform. The host sells viral marketing campaigns as a service to its customers, keeping control of the selection of seeds. Each company specifies its budget and the host allocates the seeds accordingly. From the host's perspective, it is important not only to choose the seeds to maximize the collective expected spread, but also to assign seeds to companies so that it guarantees the "bang for the buck" for all companies is nearly identical, which we formalize as the fair seed allocation problem. We propose a new propagation model capturing the competitive nature of viral marketing. Our model is intuitive and retains the desired properties of monotonicity and submodularity. We show that the fair seed allocation problem is NP-hard, and develop an efficient algorithm called Needy Greedy. We run experiments on three real-world social networks, showing that our algorithm is effective and scalable.

病毒式营销的关键算法问题是在社交网络中识别一组有影响力的用户(称为种子)，当他们被说服采用某种产品时，会影响网络中的其他用户，从而导致大量的采用。当两个或两个以上的玩家在同一网络上使用类似的产品竞争时，我们谈论的是竞争性病毒式营销，到目前为止，我们只从竞争对手之一的角度来研究这种营销。本文从宿主即社交网络平台所有者的角度出发，提出并研究了竞争性病毒式营销的新问题。主机将病毒式营销活动作为一种服务出售给客户，并保持对种子选择的控制。每个公司都指定了自己的预算，主持人根据预算分配种子。从主持人的角度来看，重要的是不仅要选择种子以最大化集体预期传播，而且要将种子分配给公司，以保证所有公司的“物有所值”几乎相同，我们将其正式化为公平的种子分配问题。我们提出了一种新的传播模型，捕捉病毒式营销的竞争本质。我们的模型是直观的，并且保留了期望的单调性和子模块性。我们证明了公平的种子分配问题是np困难的，并提出了一个有效的算法，称为需要贪婪。我们在三个现实世界的社交网络上进行了实验，表明我们的算法是有效的和可扩展的。

{"title":"The bang for the buck: fair competitive viral marketing from the host perspective","authors":"Wei Lu, F. Bonchi, Amit Goyal, L. Lakshmanan","doi":"10.1145/2487575.2487649","DOIUrl":"https://doi.org/10.1145/2487575.2487649","url":null,"abstract":"The key algorithmic problem in viral marketing is to identify a set of influential users (called seeds) in a social network, who, when convinced to adopt a product, shall influence other users in the network, leading to a large number of adoptions. When two or more players compete with similar products on the same network we talk about competitive viral marketing, which so far has been studied exclusively from the perspective of one of the competing players. In this paper we propose and study the novel problem of competitive viral marketing from the perspective of the host, i.e., the owner of the social network platform. The host sells viral marketing campaigns as a service to its customers, keeping control of the selection of seeds. Each company specifies its budget and the host allocates the seeds accordingly. From the host's perspective, it is important not only to choose the seeds to maximize the collective expected spread, but also to assign seeds to companies so that it guarantees the \"bang for the buck\" for all companies is nearly identical, which we formalize as the fair seed allocation problem. We propose a new propagation model capturing the competitive nature of viral marketing. Our model is intuitive and retains the desired properties of monotonicity and submodularity. We show that the fair seed allocation problem is NP-hard, and develop an efficient algorithm called Needy Greedy. We run experiments on three real-world social networks, showing that our algorithm is effective and scalable.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81180903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Understanding Twitter data with TweetXplorer 使用TweetXplorer理解Twitter数据

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487703

Fred Morstatter, Shamanth Kumar, Huan Liu, Ross Maciejewski

In the era of big data it is increasingly difficult for an analyst to extract meaningful knowledge from a sea of information. We present TweetXplorer, a system for analysts with little information about an event to gain knowledge through the use of effective visualization techniques. Using tweets collected during Hurricane Sandy as an example, we will lead the reader through a workflow that exhibits the functionality of the system.

在大数据时代，分析人员越来越难以从海量信息中提取有意义的知识。我们提出了TweetXplorer，这是一个系统，供分析人员通过使用有效的可视化技术来获取有关事件的信息。以飓风桑迪期间收集的tweet为例，我们将引导读者通过展示系统功能的工作流。

引用次数: 76

Multi-source learning with block-wise missing data for Alzheimer's disease prediction 基于块缺失数据的多源学习用于阿尔茨海默病预测

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487594

Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye

With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.

随着数据收集技术的进步和日益复杂，我们面临着在许多应用程序中从多个异构源收集大量数据的问题。例如，在阿尔茨海默病(AD)的研究中，经常收集和分析不同类型的测量数据，如神经图像、基因/蛋白质表达数据、遗传数据等，以提高预测能力。人们认为，多个数据源的联合学习是有益的，因为不同的数据源可能包含互补的信息，而特征修剪和数据源选择对于从高维数据中学习可解释模型至关重要。通常收集到的数据都有块丢失的条目;例如，未进行MRI扫描的患者在MRI数据块中没有任何信息，使其整体记录不完整。数据挖掘界对将传统的单源完整数据分析技术扩展到多源不完整数据的研究越来越感兴趣。关键的挑战是如何在存在块丢失数据的情况下有效地集成来自多个异构源的信息。本文首先研究了数据完备的情况，提出了一种统一的多源数据“双层次”学习模型。然后，我们将该模型自然扩展到具有不完整数据的更具挑战性的情况。我们的主要贡献有三个方面:(1)提出的模型以统一的形式处理特征级和源级分析，并将几种现有的特征学习方法作为特殊情况;(2)不完全数据模型避免了缺失元素的直接代入，具有较好的性能。此外，它可以很容易地推广到其他具有块丢失数据源的应用程序;(3)针对完全模型和不完全模型分别提出了高效的优化算法。我们对提出的模型在AD诊断中的应用进行了全面的评估。我们提出的模型与现有方法比较有利。

{"title":"Multi-source learning with block-wise missing data for Alzheimer's disease prediction","authors":"Shuo Xiang, Lei Yuan, Wei Fan, Yalin Wang, P. Thompson, Jieping Ye","doi":"10.1145/2487575.2487594","DOIUrl":"https://doi.org/10.1145/2487575.2487594","url":null,"abstract":"With the advances and increasing sophistication in data collection techniques, we are facing with large amounts of data collected from multiple heterogeneous sources in many applications. For example, in the study of Alzheimer's Disease (AD), different types of measurements such as neuroimages, gene/protein expression data, genetic data etc. are often collected and analyzed together for improved predictive power. It is believed that a joint learning of multiple data sources is beneficial as different data sources may contain complementary information, and feature-pruning and data source selection are critical for learning interpretable models from high-dimensional data. Very often the collected data comes with block-wise missing entries; for example, a patient without the MRI scan will have no information in the MRI data block, making his/her overall record incomplete. There has been a growing interest in the data mining community on expanding traditional techniques for single-source complete data analysis to the study of multi-source incomplete data. The key challenge is how to effectively integrate information from multiple heterogeneous sources in the presence of block-wise missing data. In this paper we first investigate the situation of complete data and present a unified ``bi-level\" learning model for multi-source data. Then we give a natural extension of this model to the more challenging case with incomplete data. Our major contributions are threefold: (1) the proposed models handle both feature-level and source-level analysis in a unified formulation and include several existing feature learning approaches as special cases; (2) the model for incomplete data avoids direct imputation of the missing elements and thus provides superior performances. Moreover, it can be easily generalized to other applications with block-wise missing data sources; (3) efficient optimization algorithms are presented for both the complete and incomplete models. We have performed comprehensive evaluations of the proposed models on the application of AD diagnosis. Our proposed models compare favorably against existing approaches.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90576930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Indexed block coordinate descent for large-scale linear classification with limited memory 有限内存下大规模线性分类的索引块坐标下降

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487626

I. E. Yen, Chun-Fu Chang, Ting-Wei Lin, Shan-Wei Lin, Shou-De Lin

Linear Classification has achieved complexity linear to the data size. However, in many applications, data contain large amount of samples that does not help improve the quality of model, but still cost much I/O and memory to process. In this paper, we show how a Block Coordinate Descent method based on Nearest-Neighbor Index can significantly reduce such cost when learning a dual-sparse model. In particular, we employ truncated loss function to induce a series of convex programs with superior dual sparsity, and solve each dual using Indexed Block Coordinate Descent, which makes use of Approximate Nearest Neighbor (ANN) search to select active dual variables without I/O cost on irrelevant samples. We prove that, despite the bias and weak guarantee from ANN query, the proposed algorithm has global convergence to the solution defined on entire dataset, with sublinear complexity each iteration. Experiments in both sufficient and limited memory conditions show that the proposed approach learns many times faster than other state-of-the-art solvers without sacrificing accuracy.

线性分类的复杂度与数据大小成线性关系。然而，在许多应用中，数据包含大量的样本，这不仅无助于提高模型的质量，而且仍然需要大量的I/O和内存来处理。在本文中，我们展示了基于最近邻索引的块坐标下降方法如何在学习双稀疏模型时显着降低这种成本。特别地，我们使用截断损失函数来诱导一系列具有优越对偶稀疏性的凸规划，并使用索引块坐标下降来求解每个对偶，该方法利用近似最近邻(ANN)搜索来选择活动对偶变量，而不需要在无关样本上进行I/O开销。我们证明，尽管人工神经网络查询存在偏差和弱保证，但该算法对整个数据集上定义的解具有全局收敛性，每次迭代具有次线性复杂度。在足够和有限记忆条件下的实验表明，该方法的学习速度比其他最先进的解决方案快许多倍，而不会牺牲准确性。

{"title":"Indexed block coordinate descent for large-scale linear classification with limited memory","authors":"I. E. Yen, Chun-Fu Chang, Ting-Wei Lin, Shan-Wei Lin, Shou-De Lin","doi":"10.1145/2487575.2487626","DOIUrl":"https://doi.org/10.1145/2487575.2487626","url":null,"abstract":"Linear Classification has achieved complexity linear to the data size. However, in many applications, data contain large amount of samples that does not help improve the quality of model, but still cost much I/O and memory to process. In this paper, we show how a Block Coordinate Descent method based on Nearest-Neighbor Index can significantly reduce such cost when learning a dual-sparse model. In particular, we employ truncated loss function to induce a series of convex programs with superior dual sparsity, and solve each dual using Indexed Block Coordinate Descent, which makes use of Approximate Nearest Neighbor (ANN) search to select active dual variables without I/O cost on irrelevant samples. We prove that, despite the bias and weak guarantee from ANN query, the proposed algorithm has global convergence to the solution defined on entire dataset, with sublinear complexity each iteration. Experiments in both sufficient and limited memory conditions show that the proposed approach learns many times faster than other state-of-the-art solvers without sacrificing accuracy.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86863601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5