From it's beginnings as a framework for building web crawlers for small-scale search engines to being one of the most promising technologies for building datacenter-scale distributed computing and storage platforms, Apache Hadoop has come far in the last seven years. In this talk I will reminisce about the early days of Hadoop, and will give an overview of the current state of the Hadoop ecosystem, and some real-world use cases of this open source platform. I will conclude with some crystal gazing in the future of Hadoop and associated technologies.
{"title":"Hadoop: a view from the trenches","authors":"M. Bhandarkar","doi":"10.1145/2487575.2491128","DOIUrl":"https://doi.org/10.1145/2487575.2491128","url":null,"abstract":"From it's beginnings as a framework for building web crawlers for small-scale search engines to being one of the most promising technologies for building datacenter-scale distributed computing and storage platforms, Apache Hadoop has come far in the last seven years. In this talk I will reminisce about the early days of Hadoop, and will give an overview of the current state of the Hadoop ecosystem, and some real-world use cases of this open source platform. I will conclude with some crystal gazing in the future of Hadoop and associated technologies.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76085604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchen Zhao, Guan Wang, Philip S. Yu, Shaobo Liu, Simon Zhang
Users in online social networks play a variety of social roles and statuses. For example, users in Twitter can be represented as advertiser, content contributor, information receiver, etc; users in Linkedin can be in different professional roles, such as engineer, salesperson and recruiter. Previous research work mainly focuses on using categorical and textual information to predict the attributes of users. However, it cannot be applied to a large number of users in real social networks, since much of such information is missing, outdated and non-standard. In this paper, we investigate the social roles and statuses that people act in online social networks in the perspective of network structures, since the uniqueness of social networks is connecting people. We quantitatively analyze a number of key social principles and theories that correlate with social roles and statuses. We systematically study how the network characteristics reflect the social situations of users in an online society. We discover patterns of homophily, the tendency of users to connect with users with similar social roles and statuses. In addition, we observe that different factors in social theories influence the social role/status of an individual user to various extent, since these social principles represent different aspects of the network. We then introduce an optimization framework based on Factor Conditioning Symmetry, and we propose a probabilistic model to integrate the optimization framework on local structural information as well as network influence to infer the unknown social roles and statuses of online users. We will present experiment results to show the effectiveness of the inference.
{"title":"Inferring social roles and statuses in social networks","authors":"Yuchen Zhao, Guan Wang, Philip S. Yu, Shaobo Liu, Simon Zhang","doi":"10.1145/2487575.2487597","DOIUrl":"https://doi.org/10.1145/2487575.2487597","url":null,"abstract":"Users in online social networks play a variety of social roles and statuses. For example, users in Twitter can be represented as advertiser, content contributor, information receiver, etc; users in Linkedin can be in different professional roles, such as engineer, salesperson and recruiter. Previous research work mainly focuses on using categorical and textual information to predict the attributes of users. However, it cannot be applied to a large number of users in real social networks, since much of such information is missing, outdated and non-standard. In this paper, we investigate the social roles and statuses that people act in online social networks in the perspective of network structures, since the uniqueness of social networks is connecting people. We quantitatively analyze a number of key social principles and theories that correlate with social roles and statuses. We systematically study how the network characteristics reflect the social situations of users in an online society. We discover patterns of homophily, the tendency of users to connect with users with similar social roles and statuses. In addition, we observe that different factors in social theories influence the social role/status of an individual user to various extent, since these social principles represent different aspects of the network. We then introduce an optimization framework based on Factor Conditioning Symmetry, and we propose a probabilistic model to integrate the optimization framework on local structural information as well as network influence to infer the unknown social roles and statuses of online users. We will present experiment results to show the effectiveness of the inference.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74211678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongliang Fei, Younghun Kim, S. Sahu, M. Naphade, Sanjay K. Mamidipalli, John Hutchinson
Recent advances in smart metering technology enable utility companies to have access to tremendous amount of smart meter data, from which the utility companies are eager to gain more insight about their customers. In this paper, we aim to detect electric heat pumps from coarse grained smart meter data for a heat pump marketing campaign. However, appliance detection is a challenging task, especially given a very low granularity and partial labeled even unlabeled data. Traditional methods install either a high granularity smart meter or sensors at every appliance, which is either too expensive or requires technical expertise. We propose a novel approach to detect heat pumps that utilizes low granularity smart meter data, prior sales data and weather data. In particular, motivated by the characteristics of heat pump consumption pattern, we extract novel features that are highly relevant to heat pump usage from smart meter data and weather data. Under the constraint that only a subset of heat pump users are available, we formalize the problem into a positive and unlabeled data classification and apply biased Support Vector Machine (BSVM) to our extracted features. Our empirical study on a real-world data set demonstrates the effectiveness of our method. Furthermore, our method has been deployed in a real-life setting where the partner electric company runs a targeted campaign for 292,496 customers. Based on the initial feedback, our detection algorithm can successfully detect substantial number of non-heat pump users who were identified heat pump users with the prior algorithm the company had used.
{"title":"Heat pump detection from coarse grained smart meter data with positive and unlabeled learning","authors":"Hongliang Fei, Younghun Kim, S. Sahu, M. Naphade, Sanjay K. Mamidipalli, John Hutchinson","doi":"10.1145/2487575.2488203","DOIUrl":"https://doi.org/10.1145/2487575.2488203","url":null,"abstract":"Recent advances in smart metering technology enable utility companies to have access to tremendous amount of smart meter data, from which the utility companies are eager to gain more insight about their customers. In this paper, we aim to detect electric heat pumps from coarse grained smart meter data for a heat pump marketing campaign. However, appliance detection is a challenging task, especially given a very low granularity and partial labeled even unlabeled data. Traditional methods install either a high granularity smart meter or sensors at every appliance, which is either too expensive or requires technical expertise. We propose a novel approach to detect heat pumps that utilizes low granularity smart meter data, prior sales data and weather data. In particular, motivated by the characteristics of heat pump consumption pattern, we extract novel features that are highly relevant to heat pump usage from smart meter data and weather data. Under the constraint that only a subset of heat pump users are available, we formalize the problem into a positive and unlabeled data classification and apply biased Support Vector Machine (BSVM) to our extracted features. Our empirical study on a real-world data set demonstrates the effectiveness of our method. Furthermore, our method has been deployed in a real-life setting where the partner electric company runs a targeted campaign for 292,496 customers. Based on the initial feedback, our detection algorithm can successfully detect substantial number of non-heat pump users who were identified heat pump users with the prior algorithm the company had used.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81426326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to diversification is to maximize the average dissimilarity among documents. Another way to capture diversity is to avoid showing several documents from the same category (e.g. from the same news channel). We combine the above two diversification concepts by modeling the latter approach as a (partition) matroid constraint, and study diversity maximization problems under matroid constraints. We present the first constant-factor approximation algorithm for this problem, using a new technique. Our local search 0.5-approximation algorithm is also the first constant-factor approximation for the max-dispersion problem under matroid constraints. Our combinatorial proof technique for maximizing diversity under matroid constraints uses the existence of a family of Latin squares which may also be of independent interest. In order to apply these diversity maximization algorithms in the context of aggregator websites and as a preprocessing step for our diversity maximization tool, we develop greedy clustering algorithms that maximize weighted coverage of a predefined set of topics. Our algorithms are based on computing a set of cluster centers, where clusters are formed around them. We show the better performance of our algorithms for diversity and coverage maximization by running experiments on real (Twitter) and synthetic data in the context of real-time search over micro-posts. Finally we perform a user study validating our algorithms and diversity metrics.
{"title":"Diversity maximization under matroid constraints","authors":"Z. Abbassi, V. Mirrokni, Mayur Thakur","doi":"10.1145/2487575.2487636","DOIUrl":"https://doi.org/10.1145/2487575.2487636","url":null,"abstract":"Aggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to diversification is to maximize the average dissimilarity among documents. Another way to capture diversity is to avoid showing several documents from the same category (e.g. from the same news channel). We combine the above two diversification concepts by modeling the latter approach as a (partition) matroid constraint, and study diversity maximization problems under matroid constraints. We present the first constant-factor approximation algorithm for this problem, using a new technique. Our local search 0.5-approximation algorithm is also the first constant-factor approximation for the max-dispersion problem under matroid constraints. Our combinatorial proof technique for maximizing diversity under matroid constraints uses the existence of a family of Latin squares which may also be of independent interest. In order to apply these diversity maximization algorithms in the context of aggregator websites and as a preprocessing step for our diversity maximization tool, we develop greedy clustering algorithms that maximize weighted coverage of a predefined set of topics. Our algorithms are based on computing a set of cluster centers, where clusters are formed around them. We show the better performance of our algorithms for diversity and coverage maximization by running experiments on real (Twitter) and synthetic data in the context of real-time search over micro-posts. Finally we perform a user study validating our algorithms and diversity metrics.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81608109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.
众包是为各种数据挖掘任务收集标记数据的有效方法。确保生成数据的准确性至关重要,因为从不同用户收集的响应可能是嘈杂的和不可靠的。以前的工作是根据每个任务中的知识分别估计用户能力和问题难度来解决这个准确性问题。在这种情况下,每个任务都需要大量的数据来提供准确的估计。然而,在实践中,客户为给定的目标任务提供的预算可能是有限的,因此每个问题只能呈现给少数用户,而每个用户只能回答少数问题。这种数据稀疏性问题会导致以前的方法由于对稀有数据的过拟合问题而性能不佳,最终损害数据的准确性。幸运的是,在实际应用程序中,用户可以回答来自多个历史任务的问题。例如,可以对图像进行注释,也可以标记给定标题的情感。在本文中,我们采用迁移学习,从辅助历史任务中借鉴知识来提高给定目标任务中数据的准确性。其动机是用户在不同的众包任务中具有稳定的特征,从而可以集体利用来自不同任务的数据来估计用户在目标任务中的能力。我们提出了一种分层贝叶斯模型TLC (Transfer Learning for Crowdsourcing),通过将重叠的用户视为桥梁来实现这一想法。此外,为了避免可能的负面影响,TLC引入了任务特定因素来模拟任务差异。实验结果表明,在预算非常有限的情况下,TLC在各种标记任务中显著提高了几种最先进的非迁移学习方法的准确性。
{"title":"Cross-task crowdsourcing","authors":"Kaixiang Mo, Erheng Zhong, Qiang Yang","doi":"10.1145/2487575.2487593","DOIUrl":"https://doi.org/10.1145/2487575.2487593","url":null,"abstract":"Crowdsourcing is an effective method for collecting labeled data for various data mining tasks. It is critical to ensure the veracity of the produced data because responses collected from different users may be noisy and unreliable. Previous works solve this veracity problem by estimating both the user ability and question difficulty based on the knowledge in each task individually. In this case, each single task needs large amounts of data to provide accurate estimations. However, in practice, budgets provided by customers for a given target task may be limited, and hence each question can be presented to only a few users where each user can answer only a few questions. This data sparsity problem can cause previous approaches to perform poorly due to the overfitting problem on rare data and eventually damage the data veracity. Fortunately, in real-world applications, users can answer questions from multiple historical tasks. For example, one can annotate images as well as label the sentiment of a given title. In this paper, we employ transfer learning, which borrows knowledge from auxiliary historical tasks to improve the data veracity in a given target task. The motivation is that users have stable characteristics across different crowdsourcing tasks and thus data from different tasks can be exploited collectively to estimate users' abilities in the target task. We propose a hierarchical Bayesian model, TLC (Transfer Learning for Crowdsourcing), to implement this idea by considering the overlapping users as a bridge. In addition, to avoid possible negative impact, TLC introduces task-specific factors to model task differences. The experimental results show that TLC significantly improves the accuracy over several state-of-the-art non-transfer-learning approaches under very limited budget in various labeling tasks.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85270911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luo Jie, Sudarshan Lamkhede, Rochit Sapra, Evans Hsu, Helen Song, Yi Chang
Today's popular web search engines expand the search process beyond crawled web pages to specialized corpora ("verticals") like images, videos, news, local, sports, finance, shopping etc., each with its own specialized search engine. Search federation deals with problems of the selection of search engines to query and merging of their results into a single result set. Despite a few recent advances, the problem is still very challenging. First, due to the heterogeneous nature of different verticals, how the system merges the vertical results with the web documents to serve the user's information need is still an open problem. Moreover, the scale of the search engine and the increasing number of vertical properties requires a solution which is efficient and scaleable. In this paper, we propose a unified framework for the search federation problem. We model the search federation as a contextual bandit problem. The system uses reward as a proxy for user satisfaction. Given a query, our system predicts the expected reward for each vertical, then organizes the search result page (SERP) in a way which maximizes the total reward. Instead of relying on human judges, our system leverages implicit user feedback to learn the model. The method is efficient to implement and can be applied to verticals of different nature. We have successfully deployed the system to three different markets, and it handles multiple verticals in each market. The system is now serving hundreds of millions of queries live each day, and has improved user metrics considerably.
{"title":"A unified search federation system based on online user feedback","authors":"Luo Jie, Sudarshan Lamkhede, Rochit Sapra, Evans Hsu, Helen Song, Yi Chang","doi":"10.1145/2487575.2488198","DOIUrl":"https://doi.org/10.1145/2487575.2488198","url":null,"abstract":"Today's popular web search engines expand the search process beyond crawled web pages to specialized corpora (\"verticals\") like images, videos, news, local, sports, finance, shopping etc., each with its own specialized search engine. Search federation deals with problems of the selection of search engines to query and merging of their results into a single result set. Despite a few recent advances, the problem is still very challenging. First, due to the heterogeneous nature of different verticals, how the system merges the vertical results with the web documents to serve the user's information need is still an open problem. Moreover, the scale of the search engine and the increasing number of vertical properties requires a solution which is efficient and scaleable. In this paper, we propose a unified framework for the search federation problem. We model the search federation as a contextual bandit problem. The system uses reward as a proxy for user satisfaction. Given a query, our system predicts the expected reward for each vertical, then organizes the search result page (SERP) in a way which maximizes the total reward. Instead of relying on human judges, our system leverages implicit user feedback to learn the model. The method is efficient to implement and can be applied to verticals of different nature. We have successfully deployed the system to three different markets, and it handles multiple verticals in each market. The system is now serving hundreds of millions of queries live each day, and has improved user metrics considerably.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81129879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Learning algorithms that embed objects into Euclidean space have become the methods of choice for a wide range of problems, ranging from recommendation and image search to playlist prediction and language modeling. Probabilistic embedding methods provide elegant approaches to these problems, but can be expensive to train and store as a large monolithic model. In this paper, we propose a method that trains not one monolithic model, but multiple local embeddings for a class of pairwise conditional models especially suited for sequence and co-occurrence modeling. We show that computation and memory for training these multi-space models can be efficiently parallelized over many nodes of a cluster. Focusing on sequence modeling for music playlists, we show that the method substantially speeds up training while maintaining high model quality.
{"title":"Multi-space probabilistic sequence modeling","authors":"Shuo Chen, Jiexun Xu, T. Joachims","doi":"10.1145/2487575.2487632","DOIUrl":"https://doi.org/10.1145/2487575.2487632","url":null,"abstract":"Learning algorithms that embed objects into Euclidean space have become the methods of choice for a wide range of problems, ranging from recommendation and image search to playlist prediction and language modeling. Probabilistic embedding methods provide elegant approaches to these problems, but can be expensive to train and store as a large monolithic model. In this paper, we propose a method that trains not one monolithic model, but multiple local embeddings for a class of pairwise conditional models especially suited for sequence and co-occurrence modeling. We show that computation and memory for training these multi-space models can be efficiently parallelized over many nodes of a cluster. Focusing on sequence modeling for music playlists, we show that the method substantially speeds up training while maintaining high model quality.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"162 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80242922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The effectiveness of existing top-N recommendation methods decreases as the sparsity of the datasets increases. To alleviate this problem, we present an item-based method for generating top-N recommendations that learns the item-item similarity matrix as the product of two low dimensional latent factor matrices. These matrices are learned using a structural equation modeling approach, wherein the value being estimated is not used for its own estimation. A comprehensive set of experiments on multiple datasets at three different sparsity levels indicate that the proposed methods can handle sparse datasets effectively and outperforms other state-of-the-art top-N recommendation methods. The experimental results also show that the relative performance gains compared to competing methods increase as the data gets sparser.
{"title":"FISM: factored item similarity models for top-N recommender systems","authors":"Santosh Kabbur, Xia Ning, G. Karypis","doi":"10.1145/2487575.2487589","DOIUrl":"https://doi.org/10.1145/2487575.2487589","url":null,"abstract":"The effectiveness of existing top-N recommendation methods decreases as the sparsity of the datasets increases. To alleviate this problem, we present an item-based method for generating top-N recommendations that learns the item-item similarity matrix as the product of two low dimensional latent factor matrices. These matrices are learned using a structural equation modeling approach, wherein the value being estimated is not used for its own estimation. A comprehensive set of experiments on multiple datasets at three different sparsity levels indicate that the proposed methods can handle sparse datasets effectively and outperforms other state-of-the-art top-N recommendation methods. The experimental results also show that the relative performance gains compared to competing methods increase as the data gets sparser.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78359623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate. Synthetic Review Spamming Demo: www.cs.ucsb.edu/~alex_morales/reviewspam/
{"title":"Synthetic review spamming and defense","authors":"Huan Sun, Alex Morales, Xifeng Yan","doi":"10.1145/2487575.2487688","DOIUrl":"https://doi.org/10.1145/2487575.2487688","url":null,"abstract":"Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate. Synthetic Review Spamming Demo: www.cs.ucsb.edu/~alex_morales/reviewspam/","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77695058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Tran, Dinh Q. Phung, Wei Luo, R. Harvey, M. Berk, S. Venkatesh
Suicide is a major concern in society. Despite of great attention paid by the community with very substantive medico-legal implications, there has been no satisfying method that can reliably predict the future attempted or completed suicide. We present an integrated machine learning framework to tackle this challenge. Our proposed framework consists of a novel feature extraction scheme, an embedded feature selection process, a set of risk classifiers and finally, a risk calibration procedure. For temporal feature extraction, we cast the patient's clinical history into a temporal image to which a bank of one-side filters are applied. The responses are then partly transformed into mid-level features and then selected in l1-norm framework under the extreme value theory. A set of probabilistic ordinal risk classifiers are then applied to compute the risk probabilities and further re-rank the features. Finally, the predicted risks are calibrated. Together with our Australian partner, we perform comprehensive study on data collected for the mental health cohort, and the experiments validate that our proposed framework outperforms risk assessment instruments by medical practitioners.
{"title":"An integrated framework for suicide risk prediction","authors":"T. Tran, Dinh Q. Phung, Wei Luo, R. Harvey, M. Berk, S. Venkatesh","doi":"10.1145/2487575.2488196","DOIUrl":"https://doi.org/10.1145/2487575.2488196","url":null,"abstract":"Suicide is a major concern in society. Despite of great attention paid by the community with very substantive medico-legal implications, there has been no satisfying method that can reliably predict the future attempted or completed suicide. We present an integrated machine learning framework to tackle this challenge. Our proposed framework consists of a novel feature extraction scheme, an embedded feature selection process, a set of risk classifiers and finally, a risk calibration procedure. For temporal feature extraction, we cast the patient's clinical history into a temporal image to which a bank of one-side filters are applied. The responses are then partly transformed into mid-level features and then selected in l1-norm framework under the extreme value theory. A set of probabilistic ordinal risk classifiers are then applied to compute the risk probabilities and further re-rank the features. Finally, the predicted risks are calibrated. Together with our Australian partner, we perform comprehensive study on data collected for the mental health cohort, and the experiments validate that our proposed framework outperforms risk assessment instruments by medical practitioners.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88956659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}