Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献_第10页

Hadoop: a view from the trenches Hadoop:从战壕的角度来看

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2491128

M. Bhandarkar

From it's beginnings as a framework for building web crawlers for small-scale search engines to being one of the most promising technologies for building datacenter-scale distributed computing and storage platforms, Apache Hadoop has come far in the last seven years. In this talk I will reminisce about the early days of Hadoop, and will give an overview of the current state of the Hadoop ecosystem, and some real-world use cases of this open source platform. I will conclude with some crystal gazing in the future of Hadoop and associated technologies.

从最初为小型搜索引擎构建网络爬虫的框架，到成为构建数据中心规模的分布式计算和存储平台的最有前途的技术之一，Apache Hadoop在过去七年中取得了长足的进步。在这次演讲中，我将回顾Hadoop的早期，并概述Hadoop生态系统的当前状态，以及这个开源平台的一些实际用例。最后，我将展望一下Hadoop和相关技术的未来。

引用次数: 4

Inferring social roles and statuses in social networks 推断社会网络中的社会角色和地位

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487597

Yuchen Zhao, Guan Wang, Philip S. Yu, Shaobo Liu, Simon Zhang

Users in online social networks play a variety of social roles and statuses. For example, users in Twitter can be represented as advertiser, content contributor, information receiver, etc; users in Linkedin can be in different professional roles, such as engineer, salesperson and recruiter. Previous research work mainly focuses on using categorical and textual information to predict the attributes of users. However, it cannot be applied to a large number of users in real social networks, since much of such information is missing, outdated and non-standard. In this paper, we investigate the social roles and statuses that people act in online social networks in the perspective of network structures, since the uniqueness of social networks is connecting people. We quantitatively analyze a number of key social principles and theories that correlate with social roles and statuses. We systematically study how the network characteristics reflect the social situations of users in an online society. We discover patterns of homophily, the tendency of users to connect with users with similar social roles and statuses. In addition, we observe that different factors in social theories influence the social role/status of an individual user to various extent, since these social principles represent different aspects of the network. We then introduce an optimization framework based on Factor Conditioning Symmetry, and we propose a probabilistic model to integrate the optimization framework on local structural information as well as network influence to infer the unknown social roles and statuses of online users. We will present experiment results to show the effectiveness of the inference.

在线社交网络中的用户扮演着各种各样的社会角色和身份。例如，Twitter中的用户可以被表示为广告商、内容贡献者、信息接收者等;Linkedin的用户可以是不同的职业角色，比如工程师、销售人员和招聘人员。以往的研究工作主要集中在利用分类信息和文本信息来预测用户的属性。然而，它无法应用于现实社交网络中的大量用户，因为这些信息中有很多是缺失的、过时的和非标准的。本文从网络结构的角度考察了人们在网络社交网络中所扮演的社会角色和地位，因为社交网络的独特性在于将人们联系起来。我们定量分析了一些与社会角色和地位相关的关键社会原则和理论。我们系统地研究了网络特征如何反映网络社会中用户的社会状况。我们发现了同质性模式，即用户倾向于与具有相似社会角色和地位的用户联系。此外，我们观察到社会理论中的不同因素在不同程度上影响个人用户的社会角色/地位，因为这些社会原则代表了网络的不同方面。在此基础上，我们引入了一个基于因子条件对称的优化框架，并提出了一个概率模型，将优化框架结合局部结构信息和网络影响来推断在线用户的未知社会角色和状态。我们将给出实验结果来证明推理的有效性。

{"title":"Inferring social roles and statuses in social networks","authors":"Yuchen Zhao, Guan Wang, Philip S. Yu, Shaobo Liu, Simon Zhang","doi":"10.1145/2487575.2487597","DOIUrl":"https://doi.org/10.1145/2487575.2487597","url":null,"abstract":"Users in online social networks play a variety of social roles and statuses. For example, users in Twitter can be represented as advertiser, content contributor, information receiver, etc; users in Linkedin can be in different professional roles, such as engineer, salesperson and recruiter. Previous research work mainly focuses on using categorical and textual information to predict the attributes of users. However, it cannot be applied to a large number of users in real social networks, since much of such information is missing, outdated and non-standard. In this paper, we investigate the social roles and statuses that people act in online social networks in the perspective of network structures, since the uniqueness of social networks is connecting people. We quantitatively analyze a number of key social principles and theories that correlate with social roles and statuses. We systematically study how the network characteristics reflect the social situations of users in an online society. We discover patterns of homophily, the tendency of users to connect with users with similar social roles and statuses. In addition, we observe that different factors in social theories influence the social role/status of an individual user to various extent, since these social principles represent different aspects of the network. We then introduce an optimization framework based on Factor Conditioning Symmetry, and we propose a probabilistic model to integrate the optimization framework on local structural information as well as network influence to infer the unknown social roles and statuses of online users. We will present experiment results to show the effectiveness of the inference.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74211678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

Heat pump detection from coarse grained smart meter data with positive and unlabeled learning 热泵检测从粗粒度智能电表数据与积极的和未标记的学习

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488203

Hongliang Fei, Younghun Kim, S. Sahu, M. Naphade, Sanjay K. Mamidipalli, John Hutchinson

Recent advances in smart metering technology enable utility companies to have access to tremendous amount of smart meter data, from which the utility companies are eager to gain more insight about their customers. In this paper, we aim to detect electric heat pumps from coarse grained smart meter data for a heat pump marketing campaign. However, appliance detection is a challenging task, especially given a very low granularity and partial labeled even unlabeled data. Traditional methods install either a high granularity smart meter or sensors at every appliance, which is either too expensive or requires technical expertise. We propose a novel approach to detect heat pumps that utilizes low granularity smart meter data, prior sales data and weather data. In particular, motivated by the characteristics of heat pump consumption pattern, we extract novel features that are highly relevant to heat pump usage from smart meter data and weather data. Under the constraint that only a subset of heat pump users are available, we formalize the problem into a positive and unlabeled data classification and apply biased Support Vector Machine (BSVM) to our extracted features. Our empirical study on a real-world data set demonstrates the effectiveness of our method. Furthermore, our method has been deployed in a real-life setting where the partner electric company runs a targeted campaign for 292,496 customers. Based on the initial feedback, our detection algorithm can successfully detect substantial number of non-heat pump users who were identified heat pump users with the prior algorithm the company had used.

智能电表技术的最新进展使公用事业公司能够访问大量的智能电表数据，公用事业公司渴望从中获得更多关于客户的洞察。在本文中，我们的目标是从热泵营销活动的粗粒度智能电表数据中检测电动热泵。然而，设备检测是一项具有挑战性的任务，特别是考虑到非常低的粒度和部分标记甚至未标记的数据。传统方法在每个设备上安装高粒度智能仪表或传感器，这要么太昂贵，要么需要技术专长。我们提出了一种新的方法来检测热泵，利用低粒度智能仪表数据，之前的销售数据和天气数据。特别是，受热泵消费模式特征的驱动，我们从智能电表数据和天气数据中提取与热泵使用高度相关的新特征。在只有一部分热泵用户可用的约束下，我们将问题形式化为一个正的和未标记的数据分类，并对我们提取的特征应用有偏支持向量机(BSVM)。我们对真实世界数据集的实证研究证明了我们方法的有效性。此外，我们的方法已经在现实环境中得到了应用，合作电力公司为292,496名客户开展了有针对性的活动。基于最初的反馈，我们的检测算法可以成功地检测出大量非热泵用户，这些用户通过公司之前使用的算法被识别为热泵用户。

{"title":"Heat pump detection from coarse grained smart meter data with positive and unlabeled learning","authors":"Hongliang Fei, Younghun Kim, S. Sahu, M. Naphade, Sanjay K. Mamidipalli, John Hutchinson","doi":"10.1145/2487575.2488203","DOIUrl":"https://doi.org/10.1145/2487575.2488203","url":null,"abstract":"Recent advances in smart metering technology enable utility companies to have access to tremendous amount of smart meter data, from which the utility companies are eager to gain more insight about their customers. In this paper, we aim to detect electric heat pumps from coarse grained smart meter data for a heat pump marketing campaign. However, appliance detection is a challenging task, especially given a very low granularity and partial labeled even unlabeled data. Traditional methods install either a high granularity smart meter or sensors at every appliance, which is either too expensive or requires technical expertise. We propose a novel approach to detect heat pumps that utilizes low granularity smart meter data, prior sales data and weather data. In particular, motivated by the characteristics of heat pump consumption pattern, we extract novel features that are highly relevant to heat pump usage from smart meter data and weather data. Under the constraint that only a subset of heat pump users are available, we formalize the problem into a positive and unlabeled data classification and apply biased Support Vector Machine (BSVM) to our extracted features. Our empirical study on a real-world data set demonstrates the effectiveness of our method. Furthermore, our method has been deployed in a real-life setting where the partner electric company runs a targeted campaign for 292,496 customers. Based on the initial feedback, our detection algorithm can successfully detect substantial number of non-heat pump users who were identified heat pump users with the prior algorithm the company had used.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81426326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Diversity maximization under matroid constraints 矩阵约束下的多样性最大化

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487636

Z. Abbassi, V. Mirrokni, Mayur Thakur

Aggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to diversification is to maximize the average dissimilarity among documents. Another way to capture diversity is to avoid showing several documents from the same category (e.g. from the same news channel). We combine the above two diversification concepts by modeling the latter approach as a (partition) matroid constraint, and study diversity maximization problems under matroid constraints. We present the first constant-factor approximation algorithm for this problem, using a new technique. Our local search 0.5-approximation algorithm is also the first constant-factor approximation for the max-dispersion problem under matroid constraints. Our combinatorial proof technique for maximizing diversity under matroid constraints uses the existence of a family of Latin squares which may also be of independent interest. In order to apply these diversity maximization algorithms in the context of aggregator websites and as a preprocessing step for our diversity maximization tool, we develop greedy clustering algorithms that maximize weighted coverage of a predefined set of topics. Our algorithms are based on computing a set of cluster centers, where clusters are formed around them. We show the better performance of our algorithms for diversity and coverage maximization by running experiments on real (Twitter) and synthetic data in the context of real-time search over micro-posts. Finally we perform a user study validating our algorithms and diversity metrics.

聚合器网站通常以代表性集群的形式呈现文档。为了让用户获得更广阔的视角，在这些集群中提供一组多样化的代表性文档是很重要的。多样化的一种方法是最大化文档之间的平均不相似度。捕获多样性的另一种方法是避免显示来自同一类别的多个文档(例如来自同一新闻频道)。我们将上述两个多样化概念结合起来，将后者建模为一个(划分)矩阵约束，并研究了在矩阵约束下的多样性最大化问题。本文采用一种新技术，提出了该问题的第一个常因子近似算法。我们的局部搜索0.5近似算法也是在矩阵约束下最大色散问题的第一个常因子近似。我们在矩阵约束下最大化分集的组合证明技术使用了一组拉丁平方的存在性，这些拉丁平方也可能具有独立的兴趣。为了将这些多样性最大化算法应用于聚合网站，并作为多样性最大化工具的预处理步骤，我们开发了贪婪聚类算法，以最大化预定义主题集的加权覆盖率。我们的算法是基于计算一组集群中心，在这些中心周围形成集群。在微博实时搜索的背景下，通过在真实(Twitter)和合成数据上运行实验，我们展示了我们的算法在多样性和覆盖最大化方面的更好性能。最后，我们进行了用户研究，验证了我们的算法和多样性指标。

{"title":"Diversity maximization under matroid constraints","authors":"Z. Abbassi, V. Mirrokni, Mayur Thakur","doi":"10.1145/2487575.2487636","DOIUrl":"https://doi.org/10.1145/2487575.2487636","url":null,"abstract":"Aggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to diversification is to maximize the average dissimilarity among documents. Another way to capture diversity is to avoid showing several documents from the same category (e.g. from the same news channel). We combine the above two diversification concepts by modeling the latter approach as a (partition) matroid constraint, and study diversity maximization problems under matroid constraints. We present the first constant-factor approximation algorithm for this problem, using a new technique. Our local search 0.5-approximation algorithm is also the first constant-factor approximation for the max-dispersion problem under matroid constraints. Our combinatorial proof technique for maximizing diversity under matroid constraints uses the existence of a family of Latin squares which may also be of independent interest. In order to apply these diversity maximization algorithms in the context of aggregator websites and as a preprocessing step for our diversity maximization tool, we develop greedy clustering algorithms that maximize weighted coverage of a predefined set of topics. Our algorithms are based on computing a set of cluster centers, where clusters are formed around them. We show the better performance of our algorithms for diversity and coverage maximization by running experiments on real (Twitter) and synthetic data in the context of real-time search over micro-posts. Finally we perform a user study validating our algorithms and diversity metrics.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81608109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Cross-task crowdsourcing Cross-task众包

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487593

Kaixiang Mo, Erheng Zhong, Qiang Yang

Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.

众包是为各种数据挖掘任务收集标记数据的有效方法。确保生成数据的准确性至关重要，因为从不同用户收集的响应可能是嘈杂的和不可靠的。以前的工作是根据每个任务中的知识分别估计用户能力和问题难度来解决这个准确性问题。在这种情况下，每个任务都需要大量的数据来提供准确的估计。然而，在实践中，客户为给定的目标任务提供的预算可能是有限的，因此每个问题只能呈现给少数用户，而每个用户只能回答少数问题。这种数据稀疏性问题会导致以前的方法由于对稀有数据的过拟合问题而性能不佳，最终损害数据的准确性。幸运的是，在实际应用程序中，用户可以回答来自多个历史任务的问题。例如，可以对图像进行注释，也可以标记给定标题的情感。在本文中，我们采用迁移学习，从辅助历史任务中借鉴知识来提高给定目标任务中数据的准确性。其动机是用户在不同的众包任务中具有稳定的特征，从而可以集体利用来自不同任务的数据来估计用户在目标任务中的能力。我们提出了一种分层贝叶斯模型TLC (Transfer Learning for Crowdsourcing)，通过将重叠的用户视为桥梁来实现这一想法。此外，为了避免可能的负面影响，TLC引入了任务特定因素来模拟任务差异。实验结果表明，在预算非常有限的情况下，TLC在各种标记任务中显著提高了几种最先进的非迁移学习方法的准确性。

{"title":"Cross-task crowdsourcing","authors":"Kaixiang Mo, Erheng Zhong, Qiang Yang","doi":"10.1145/2487575.2487593","DOIUrl":"https://doi.org/10.1145/2487575.2487593","url":null,"abstract":"Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85270911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

A unified search federation system based on online user feedback 基于在线用户反馈的统一搜索联合系统

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488198

Luo Jie, Sudarshan Lamkhede, Rochit Sapra, Evans Hsu, Helen Song, Yi Chang

Today's popular web search engines expand the search process beyond crawled web pages to specialized corpora ("verticals") like images, videos, news, local, sports, finance, shopping etc., each with its own specialized search engine. Search federation deals with problems of the selection of search engines to query and merging of their results into a single result set. Despite a few recent advances, the problem is still very challenging. First, due to the heterogeneous nature of different verticals, how the system merges the vertical results with the web documents to serve the user's information need is still an open problem. Moreover, the scale of the search engine and the increasing number of vertical properties requires a solution which is efficient and scaleable. In this paper, we propose a unified framework for the search federation problem. We model the search federation as a contextual bandit problem. The system uses reward as a proxy for user satisfaction. Given a query, our system predicts the expected reward for each vertical, then organizes the search result page (SERP) in a way which maximizes the total reward. Instead of relying on human judges, our system leverages implicit user feedback to learn the model. The method is efficient to implement and can be applied to verticals of different nature. We have successfully deployed the system to three different markets, and it handles multiple verticals in each market. The system is now serving hundreds of millions of queries live each day, and has improved user metrics considerably.

今天流行的网络搜索引擎将搜索过程从抓取网页扩展到专门的语料库(“垂直”)，如图像，视频，新闻，本地，体育，金融，购物等，每个都有自己的专业搜索引擎。搜索联合处理选择搜索引擎进行查询并将其结果合并到单个结果集中的问题。尽管最近取得了一些进展，但这个问题仍然非常具有挑战性。首先，由于不同垂直方向的异构性，系统如何将垂直方向的结果与web文档合并以满足用户的信息需求仍然是一个有待解决的问题。此外，搜索引擎的规模和越来越多的垂直属性需要一个高效和可扩展的解决方案。在本文中，我们提出了一个搜索联合问题的统一框架。我们将搜索联盟建模为上下文强盗问题。该系统使用奖励作为用户满意度的代理。给定一个查询，我们的系统预测每个垂直的预期奖励，然后以最大化总奖励的方式组织搜索结果页面(SERP)。我们的系统不依赖于人类的判断，而是利用隐含的用户反馈来学习模型。该方法实现效率高，可应用于不同性质的垂线。我们已经成功地将该系统部署到三个不同的市场，它可以处理每个市场的多个垂直市场。该系统现在每天实时处理数以亿计的查询，并大大提高了用户指标。

{"title":"A unified search federation system based on online user feedback","authors":"Luo Jie, Sudarshan Lamkhede, Rochit Sapra, Evans Hsu, Helen Song, Yi Chang","doi":"10.1145/2487575.2488198","DOIUrl":"https://doi.org/10.1145/2487575.2488198","url":null,"abstract":"Today's popular web search engines expand the search process beyond crawled web pages to specialized corpora (\"verticals\") like images, videos, news, local, sports, finance, shopping etc., each with its own specialized search engine. Search federation deals with problems of the selection of search engines to query and merging of their results into a single result set. Despite a few recent advances, the problem is still very challenging. First, due to the heterogeneous nature of different verticals, how the system merges the vertical results with the web documents to serve the user's information need is still an open problem. Moreover, the scale of the search engine and the increasing number of vertical properties requires a solution which is efficient and scaleable. In this paper, we propose a unified framework for the search federation problem. We model the search federation as a contextual bandit problem. The system uses reward as a proxy for user satisfaction. Given a query, our system predicts the expected reward for each vertical, then organizes the search result page (SERP) in a way which maximizes the total reward. Instead of relying on human judges, our system leverages implicit user feedback to learn the model. The method is efficient to implement and can be applied to verticals of different nature. We have successfully deployed the system to three different markets, and it handles multiple verticals in each market. The system is now serving hundreds of millions of queries live each day, and has improved user metrics considerably.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81129879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Multi-space probabilistic sequence modeling 多空间概率序列建模

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487632

Shuo Chen, Jiexun Xu, T. Joachims

Learning algorithms that embed objects into Euclidean space have become the methods of choice for a wide range of problems, ranging from recommendation and image search to playlist prediction and language modeling. Probabilistic embedding methods provide elegant approaches to these problems, but can be expensive to train and store as a large monolithic model. In this paper, we propose a method that trains not one monolithic model, but multiple local embeddings for a class of pairwise conditional models especially suited for sequence and co-occurrence modeling. We show that computation and memory for training these multi-space models can be efficiently parallelized over many nodes of a cluster. Focusing on sequence modeling for music playlists, we show that the method substantially speeds up training while maintaining high model quality.

将对象嵌入欧几里得空间的学习算法已经成为解决各种问题的首选方法，从推荐和图像搜索到播放列表预测和语言建模。概率嵌入方法为解决这些问题提供了很好的方法，但是训练和存储大型整体模型的成本很高。在本文中，我们提出了一种方法，该方法不是训练一个整体模型，而是训练多个局部嵌入的两两条件模型，特别适合于序列和共现建模。我们证明了用于训练这些多空间模型的计算和内存可以在集群的多个节点上有效地并行化。专注于音乐播放列表的序列建模，我们证明了该方法在保持高模型质量的同时大大加快了训练速度。

引用次数: 29

FISM: factored item similarity models for top-N recommender systems top-N推荐系统的因子项目相似度模型

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487589

Santosh Kabbur, Xia Ning, G. Karypis

The effectiveness of existing top-N recommendation methods decreases as the sparsity of the datasets increases. To alleviate this problem, we present an item-based method for generating top-N recommendations that learns the item-item similarity matrix as the product of two low dimensional latent factor matrices. These matrices are learned using a structural equation modeling approach, wherein the value being estimated is not used for its own estimation. A comprehensive set of experiments on multiple datasets at three different sparsity levels indicate that the proposed methods can handle sparse datasets effectively and outperforms other state-of-the-art top-N recommendation methods. The experimental results also show that the relative performance gains compared to competing methods increase as the data gets sparser.

现有top-N推荐方法的有效性随着数据集稀疏度的增加而降低。为了缓解这个问题，我们提出了一种基于项目的方法来生成top-N推荐，该方法将项目-项目相似性矩阵作为两个低维潜在因素矩阵的乘积来学习。这些矩阵是使用结构方程建模方法学习的，其中被估计的值不用于其自身的估计。在三个不同稀疏度级别的多个数据集上进行的一组综合实验表明，所提出的方法可以有效地处理稀疏数据集，并且优于其他最先进的top-N推荐方法。实验结果还表明，与竞争方法相比，相对性能增益随着数据的稀疏而增加。

引用次数: 622

Synthetic review spamming and defense 综合评论垃圾邮件和防御

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2487688

Huan Sun, Alex Morales, Xifeng Yan

Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate. Synthetic Review Spamming Demo: www.cs.ucsb.edu/~alex_morales/reviewspam/

在线评论已经在许多应用程序中被广泛采用。因为他们既可以促进也可以损害产品或服务的声誉，买卖虚假评论成为一项有利可图的业务，也是一个巨大的威胁。在本文中，我们介绍了一种非常简单但功能强大的评论垃圾邮件技术，它可以很容易地失败现有的基于特征的检测算法。它使用一个真实的评论作为模板，并将其句子替换为存储库中其他评论的句子。这种机制产生的虚假评论极其难以检测:最先进的计算方法和人类读者的错误率都在35%-48%之间，略好于随机猜测。虽然发现这些虚假评论很有挑战性，但我们在打击它们方面取得了坚实的进展。开发了一种利用合成评论和真实评论之间语义流差异的新型防御方法，该方法能够将检测错误率降低到约22%，比现有方法的性能有了显着提高。然而，如何进一步降低错误率仍然是一项具有挑战性的研究任务。合成评论垃圾邮件演示:www.cs.ucsb.edu/~alex_morales/reviewspam/

{"title":"Synthetic review spamming and defense","authors":"Huan Sun, Alex Morales, Xifeng Yan","doi":"10.1145/2487575.2487688","DOIUrl":"https://doi.org/10.1145/2487575.2487688","url":null,"abstract":"Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate. Synthetic Review Spamming Demo: www.cs.ucsb.edu/~alex_morales/reviewspam/","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77695058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

An integrated framework for suicide risk prediction 自杀风险预测的综合框架

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pub Date : 2013-08-11 DOI: 10.1145/2487575.2488196

T. Tran, Dinh Q. Phung, Wei Luo, R. Harvey, M. Berk, S. Venkatesh

Suicide is a major concern in society. Despite of great attention paid by the community with very substantive medico-legal implications, there has been no satisfying method that can reliably predict the future attempted or completed suicide. We present an integrated machine learning framework to tackle this challenge. Our proposed framework consists of a novel feature extraction scheme, an embedded feature selection process, a set of risk classifiers and finally, a risk calibration procedure. For temporal feature extraction, we cast the patient's clinical history into a temporal image to which a bank of one-side filters are applied. The responses are then partly transformed into mid-level features and then selected in l1-norm framework under the extreme value theory. A set of probabilistic ordinal risk classifiers are then applied to compute the risk probabilities and further re-rank the features. Finally, the predicted risks are calibrated. Together with our Australian partner, we perform comprehensive study on data collected for the mental health cohort, and the experiments validate that our proposed framework outperforms risk assessment instruments by medical practitioners.

自杀是社会关注的一大问题。尽管社会对此给予了极大的关注，并具有非常实质性的医学和法律意义，但目前还没有令人满意的方法可以可靠地预测未来的企图或完成自杀。我们提出了一个集成的机器学习框架来应对这一挑战。我们提出的框架包括一个新的特征提取方案，一个嵌入式特征选择过程，一组风险分类器，最后是一个风险校准过程。对于时间特征提取，我们将患者的临床病史转换为一个时间图像，其中应用了一组单边滤波器。然后在极值理论下将部分响应转化为中级特征，并在11范数框架中进行选择。然后应用一组概率有序风险分类器来计算风险概率，并进一步对特征进行重新排序。最后，对预测的风险进行校准。我们与澳大利亚合作伙伴一起，对为心理健康队列收集的数据进行了全面研究，实验验证了我们提出的框架优于医生使用的风险评估工具。

{"title":"An integrated framework for suicide risk prediction","authors":"T. Tran, Dinh Q. Phung, Wei Luo, R. Harvey, M. Berk, S. Venkatesh","doi":"10.1145/2487575.2488196","DOIUrl":"https://doi.org/10.1145/2487575.2488196","url":null,"abstract":"Suicide is a major concern in society. Despite of great attention paid by the community with very substantive medico-legal implications, there has been no satisfying method that can reliably predict the future attempted or completed suicide. We present an integrated machine learning framework to tackle this challenge. Our proposed framework consists of a novel feature extraction scheme, an embedded feature selection process, a set of risk classifiers and finally, a risk calibration procedure. For temporal feature extraction, we cast the patient's clinical history into a temporal image to which a bank of one-side filters are applied. The responses are then partly transformed into mid-level features and then selected in l1-norm framework under the extreme value theory. A set of probabilistic ordinal risk classifiers are then applied to compute the risk probabilities and further re-rank the features. Finally, the predicted risks are calibrated. Together with our Australian partner, we perform comprehensive study on data collected for the mental health cohort, and the experiments validate that our proposed framework outperforms risk assessment instruments by medical practitioners.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88956659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44