In this paper, we study `networked bandits', a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms. This problem remains under-investigated, in spite of its applicability to many practical problems. For instance, in social networks, an arm can obtain payoffs from both the selected user and its relations since they often share the content through the network. We examine whether it is possible to obtain multiple payoffs from several correlated arms based on the relationships. In particular, we formalize the networked bandit problem and propose an algorithm that considers not only the selected arm, but also the relationships between arms. Our algorithm is `optimism in face of uncertainty' style, in that it decides an arm depending on integrated confidence sets constructed from historical data. We analyze the performance in simulation experiments and on two real-world offline datasets. The experimental results demonstrate our algorithm's effectiveness in the networked bandit setting.
{"title":"Networked bandits with disjoint linear payoffs","authors":"Meng Fang, D. Tao","doi":"10.1145/2623330.2623672","DOIUrl":"https://doi.org/10.1145/2623330.2623672","url":null,"abstract":"In this paper, we study `networked bandits', a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms. This problem remains under-investigated, in spite of its applicability to many practical problems. For instance, in social networks, an arm can obtain payoffs from both the selected user and its relations since they often share the content through the network. We examine whether it is possible to obtain multiple payoffs from several correlated arms based on the relationships. In particular, we formalize the networked bandit problem and propose an algorithm that considers not only the selected arm, but also the relationships between arms. Our algorithm is `optimism in face of uncertainty' style, in that it decides an arm depending on integrated confidence sets constructed from historical data. We analyze the performance in simulation experiments and on two real-world offline datasets. The experimental results demonstrate our algorithm's effectiveness in the networked bandit setting.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73863963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid development of online social networks, a growing number of people are willing to share their group activities, e.g. having dinners with colleagues, and watching movies with spouses. This motivates the studies on group recommendation, which aims to recommend items for a group of users. Group recommendation is a challenging problem because different group members have different preferences, and how to make a trade-off among their preferences for recommendation is still an open problem. In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. Intuitively, users in a group may have different influences, and those who are expert in topics relevant to the group are usually more influential. In addition, users in a group may behave differently as group members from as individuals. COM is designed based on these intuitions, and is able to incorporate both users' selection history and personal considerations of content factors. When making recommendations, COM estimates the preference of a group to an item by aggregating the preferences of the group members with different weights. We conduct extensive experiments on four datasets, and the results show that the proposed model is effective in making group recommendations, and outperforms baseline methods significantly.
{"title":"COM: a generative model for group recommendation","authors":"Quan Yuan, G. Cong, Chin-Yew Lin","doi":"10.1145/2623330.2623616","DOIUrl":"https://doi.org/10.1145/2623330.2623616","url":null,"abstract":"With the rapid development of online social networks, a growing number of people are willing to share their group activities, e.g. having dinners with colleagues, and watching movies with spouses. This motivates the studies on group recommendation, which aims to recommend items for a group of users. Group recommendation is a challenging problem because different group members have different preferences, and how to make a trade-off among their preferences for recommendation is still an open problem. In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. Intuitively, users in a group may have different influences, and those who are expert in topics relevant to the group are usually more influential. In addition, users in a group may behave differently as group members from as individuals. COM is designed based on these intuitions, and is able to incorporate both users' selection history and personal considerations of content factors. When making recommendations, COM estimates the preference of a group to an item by aggregating the preferences of the group members with different weights. We conduct extensive experiments on four datasets, and the results show that the proposed model is effective in making group recommendations, and outperforms baseline methods significantly.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85244002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.
{"title":"Mining topics in documents: standing on the shoulders of big data","authors":"Zhiyuan Chen, B. Liu","doi":"10.1145/2623330.2623622","DOIUrl":"https://doi.org/10.1145/2623330.2623622","url":null,"abstract":"Topic modeling has been widely used to mine topics from documents. However, a key weakness of topic modeling is that it needs a large amount of data (e.g., thousands of documents) to provide reliable statistics to generate coherent topics. However, in practice, many document collections do not have so many documents. Given a small number of documents, the classic topic model LDA generates very poor topics. Even with a large volume of data, unsupervised learning of topic models can still produce unsatisfactory results. In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. Our research takes a radically different approach. We propose to learn as humans do, i.e., retaining the results learned in the past and using them to help future learning. When faced with a new task, we first mine some reliable (prior) knowledge from the past learning/modeling results and then use it to guide the model inference to generate more coherent topics. This approach is possible because of the big data readily available on the Web. The proposed algorithm mines two forms of knowledge: must-link (meaning that two words should be in the same topic) and cannot-link (meaning that two words should not be in the same topic). It also deals with two problems of the automatically mined knowledge, i.e., wrong knowledge and knowledge transitivity. Experimental results using review documents from 100 product domains show that the proposed approach makes dramatic improvements over state-of-the-art baselines.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85403961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of real estates. While a number of estate appraisal methods have been developed to value real property, the performances of these methods have been limited by the traditional data sources for estate appraisal. However, with the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas via ClusRanking. Also, we use a linear model to fuse these three influential factors and predict estate investment values. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Finally, we conduct a comprehensive evaluation with real-world estate related data, and the experimental results demonstrate the effectiveness of our method.
{"title":"Exploiting geographic dependencies for real estate appraisal: a mutual perspective of ranking and clustering","authors":"Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, Zhi-Hua Zhou","doi":"10.1145/2623330.2623675","DOIUrl":"https://doi.org/10.1145/2623330.2623675","url":null,"abstract":"It is traditionally a challenge for home buyers to understand, compare and contrast the investment values of real estates. While a number of estate appraisal methods have been developed to value real property, the performances of these methods have been limited by the traditional data sources for estate appraisal. However, with the development of new ways of collecting estate-related mobile data, there is a potential to leverage geographic dependencies of estates for enhancing estate appraisal. Indeed, the geographic dependencies of the value of an estate can be from the characteristics of its own neighborhood (individual), the values of its nearby estates (peer), and the prosperity of the affiliated latent business area (zone). To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. ClusRanking is able to exploit geographic individual, peer, and zone dependencies in a probabilistic ranking model. Specifically, we first extract the geographic utility of estates from geography data, estimate the neighborhood popularity of estates by mining taxicab trajectory data, and model the influence of latent business areas via ClusRanking. Also, we use a linear model to fuse these three influential factors and predict estate investment values. Moreover, we simultaneously consider individual, peer and zone dependencies, and derive an estate-specific ranking likelihood as the objective function. Finally, we conduct a comprehensive evaluation with real-world estate related data, and the experimental results demonstrate the effectiveness of our method.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82278591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. RTB allows advertisers to bid on a display ad impression in real time when it is being generated. It goes beyond contextual advertising by motivating the bidding focused on user data and it is different from the sponsored search auction where the bid price is associated with keywords. For the demand side, a fundamental technical challenge is to automate the bidding process based on the budget, the campaign objective and various information gathered in runtime and in history. In this paper, the programmatic bidding is cast as a functional optimisation problem. Under certain dependency assumptions, we derive simple bidding functions that can be calculated in real time; our finding shows that the optimal bid has a non-linear relationship with the impression level evaluation such as the click-through rate and the conversion rate, which are estimated in real time from the impression level features. This is different from previous work that is mainly focused on a linear bidding function. Our mathematical derivation suggests that optimal bidding strategies should try to bid more impressions rather than focus on a small set of high valued impressions because according to the current RTB market data, compared to the higher evaluated impressions, the lower evaluated ones are more cost effective and the chances of winning them are relatively higher. Aside from the theoretical insights, offline experiments on a real dataset and online experiments on a production RTB system verify the effectiveness of our proposed optimal bidding strategies and the functional optimisation framework.
{"title":"Optimal real-time bidding for display advertising","authors":"Weinan Zhang, Shuai Yuan, Jun Wang","doi":"10.1145/2623330.2623633","DOIUrl":"https://doi.org/10.1145/2623330.2623633","url":null,"abstract":"In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. RTB allows advertisers to bid on a display ad impression in real time when it is being generated. It goes beyond contextual advertising by motivating the bidding focused on user data and it is different from the sponsored search auction where the bid price is associated with keywords. For the demand side, a fundamental technical challenge is to automate the bidding process based on the budget, the campaign objective and various information gathered in runtime and in history. In this paper, the programmatic bidding is cast as a functional optimisation problem. Under certain dependency assumptions, we derive simple bidding functions that can be calculated in real time; our finding shows that the optimal bid has a non-linear relationship with the impression level evaluation such as the click-through rate and the conversion rate, which are estimated in real time from the impression level features. This is different from previous work that is mainly focused on a linear bidding function. Our mathematical derivation suggests that optimal bidding strategies should try to bid more impressions rather than focus on a small set of high valued impressions because according to the current RTB market data, compared to the higher evaluated impressions, the lower evaluated ones are more cost effective and the chances of winning them are relatively higher. Aside from the theoretical insights, offline experiments on a real dataset and online experiments on a production RTB system verify the effectiveness of our proposed optimal bidding strategies and the functional optimisation framework.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80543715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cardiac disease is the leading cause of death around the world; with ischemic heart disease alone claiming 7 million lives in 2011. This burden can be attributed, in part, to the absence of biomarkers that can reliably identify high risk patients and match them to treatments that are appropriate for them. In recent clinical studies, we have demonstrated the ability of computation to extract information with substantial prognostic utility that is typically disregarded in time-series data collected from cardiac patients. Of particular interest are subtle variations in long-term electrocardiographic (ECG) data that are usually overlooked as noise but provide a useful assessment of myocardial instability. In multiple clinical cohorts, we have developed the pathophysiological basis for studying probabilistic variations in long-term ECG and demonstrated the ability of this information to effectively risk stratify patients at risk of dying following heart attacks. In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. Our basic approach to uncovering pathological structure within the ECG focuses on characterizing beat-to-beat time-warped shape deformations of the ECG using a modified dynamic time-warping (DTW) and Lomb-Scargle periodogram-based algorithm. As part of our efforts to scale this work up, we explore a novel approach to address the quadratic runtime of DTW. We achieve this by developing the idea of adaptive downsampling to reduce the size of the inputs presented to DTW, and describe changes to the dynamic programming problem underlying DTW to exploit adaptively downsampled ECG signals. When evaluated on data from 765 patients in the DISPERSE2-TIMI33 trial, our results show that high morphologic variability is associated with an 8- to 9-fold increased risk of death within 90 days of a heart attack. Moreover, the use of adaptive downsampling with a modified DTW formulation achieves a 7- to almost 20-fold reduction in runtime relative to DTW, without a significant change in biomarker discrimination.
{"title":"Scalable noise mining in long-term electrocardiographic time-series to predict death following heart attacks","authors":"Chih-Chun Chia, Z. Syed","doi":"10.1145/2623330.2623702","DOIUrl":"https://doi.org/10.1145/2623330.2623702","url":null,"abstract":"Cardiac disease is the leading cause of death around the world; with ischemic heart disease alone claiming 7 million lives in 2011. This burden can be attributed, in part, to the absence of biomarkers that can reliably identify high risk patients and match them to treatments that are appropriate for them. In recent clinical studies, we have demonstrated the ability of computation to extract information with substantial prognostic utility that is typically disregarded in time-series data collected from cardiac patients. Of particular interest are subtle variations in long-term electrocardiographic (ECG) data that are usually overlooked as noise but provide a useful assessment of myocardial instability. In multiple clinical cohorts, we have developed the pathophysiological basis for studying probabilistic variations in long-term ECG and demonstrated the ability of this information to effectively risk stratify patients at risk of dying following heart attacks. In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. Our basic approach to uncovering pathological structure within the ECG focuses on characterizing beat-to-beat time-warped shape deformations of the ECG using a modified dynamic time-warping (DTW) and Lomb-Scargle periodogram-based algorithm. As part of our efforts to scale this work up, we explore a novel approach to address the quadratic runtime of DTW. We achieve this by developing the idea of adaptive downsampling to reduce the size of the inputs presented to DTW, and describe changes to the dynamic programming problem underlying DTW to exploit adaptively downsampled ECG signals. When evaluated on data from 765 patients in the DISPERSE2-TIMI33 trial, our results show that high morphologic variability is associated with an 8- to 9-fold increased risk of death within 90 days of a heart attack. Moreover, the use of adaptive downsampling with a modified DTW formulation achieves a 7- to almost 20-fold reduction in runtime relative to DTW, without a significant change in biomarker discrimination.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80559556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Tayebi, M. Ester, U. Glässer, P. Brantingham
Crime reduction and prevention strategies are essential to increase public safety and reduce the crime costs to society. Law enforcement agencies have long realized the importance of analyzing co-offending networks---networks of offenders who have committed crimes together---for this purpose. Although network structure can contribute significantly to co-offence prediction, research in this area is very limited. Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. Similar to other social networks, co-offending networks also suffer from a highly skewed distribution of positive and negative pairs. To address the class imbalance problem, we identify three types of criminal cooperation opportunities which help to reduce the class imbalance ratio significantly, while keeping half of the co-offences. The proposed framework is evaluated on a large crime dataset for the Province of British Columbia, Canada. Our experimental evaluation of four different feature sets show that the novel geo-social features are the best predictors. Overall, we experimentally show the high effectiveness of the proposed co-offence prediction framework. We believe that our framework will not only allow law enforcement agencies to improve their crime reduction and prevention strategies, but also offers new criminological insights into criminal link formation between offenders.
{"title":"Spatially embedded co-offence prediction using supervised learning","authors":"M. A. Tayebi, M. Ester, U. Glässer, P. Brantingham","doi":"10.1145/2623330.2623353","DOIUrl":"https://doi.org/10.1145/2623330.2623353","url":null,"abstract":"Crime reduction and prevention strategies are essential to increase public safety and reduce the crime costs to society. Law enforcement agencies have long realized the importance of analyzing co-offending networks---networks of offenders who have committed crimes together---for this purpose. Although network structure can contribute significantly to co-offence prediction, research in this area is very limited. Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. Similar to other social networks, co-offending networks also suffer from a highly skewed distribution of positive and negative pairs. To address the class imbalance problem, we identify three types of criminal cooperation opportunities which help to reduce the class imbalance ratio significantly, while keeping half of the co-offences. The proposed framework is evaluated on a large crime dataset for the Province of British Columbia, Canada. Our experimental evaluation of four different feature sets show that the novel geo-social features are the best predictors. Overall, we experimentally show the high effectiveness of the proposed co-offence prediction framework. We believe that our framework will not only allow law enforcement agencies to improve their crime reduction and prevention strategies, but also offers new criminological insights into criminal link formation between offenders.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"146 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80567840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manish Purohit, B. Prakash, Chanhyun Kang, Yao Zhang, V. S. Subrahmanian
Given a social network, can we quickly 'zoom-out' of the graph? Is there a smaller equivalent representation of the graph that preserves its propagation characteristics? Can we group nodes together based on their influence properties? These are important problems with applications to influence analysis, epidemiology and viral marketing applications. In this paper, we first formulate a novel Graph Coarsening Problem to find a succinct representation of any graph while preserving key characteristics for diffusion processes on that graph. We then provide a fast and effective near-linear-time (in nodes and edges) algorithm COARSENET for the same. Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information. Finally we also show how our method can help in diverse applications like influence maximization and detecting patterns of propagation at the level of automatically created groups on real cascade data.
{"title":"Fast influence-based coarsening for large networks","authors":"Manish Purohit, B. Prakash, Chanhyun Kang, Yao Zhang, V. S. Subrahmanian","doi":"10.1145/2623330.2623701","DOIUrl":"https://doi.org/10.1145/2623330.2623701","url":null,"abstract":"Given a social network, can we quickly 'zoom-out' of the graph? Is there a smaller equivalent representation of the graph that preserves its propagation characteristics? Can we group nodes together based on their influence properties? These are important problems with applications to influence analysis, epidemiology and viral marketing applications. In this paper, we first formulate a novel Graph Coarsening Problem to find a succinct representation of any graph while preserving key characteristics for diffusion processes on that graph. We then provide a fast and effective near-linear-time (in nodes and edges) algorithm COARSENET for the same. Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information. Finally we also show how our method can help in diverse applications like influence maximization and detecting patterns of propagation at the level of automatically created groups on real cascade data.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83073710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Purchasing decisions in many product categories are heavily influenced by the shopper's aesthetic preferences. It's insufficient to simply match a shopper with popular items from the category in question; a successful shopping experience also identifies products that match those aesthetics. The challenge of capturing shoppers' styles becomes more difficult as the size and diversity of the marketplace increases. At Etsy, an online marketplace for handmade and vintage goods with over 30 million diverse listings, the problem of capturing taste is particularly important -- users come to the site specifically to find items that match their eclectic styles. In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. We use Latent Dirichlet Allocation (LDA) to discover trending categories and styles on Etsy, which are then used to describe a user's "interest" profile. We also explore hashing methods to perform fast nearest neighbor search on a map-reduce framework, in order to efficiently obtain recommendations. These techniques have been implemented successfully at very large scale, substantially improving many key business metrics.
{"title":"Style in the long tail: discovering unique interests with latent variable models in large scale social E-commerce","authors":"D. Hu, Robert J. Hall, Josh Attenberg","doi":"10.1145/2623330.2623338","DOIUrl":"https://doi.org/10.1145/2623330.2623338","url":null,"abstract":"Purchasing decisions in many product categories are heavily influenced by the shopper's aesthetic preferences. It's insufficient to simply match a shopper with popular items from the category in question; a successful shopping experience also identifies products that match those aesthetics. The challenge of capturing shoppers' styles becomes more difficult as the size and diversity of the marketplace increases. At Etsy, an online marketplace for handmade and vintage goods with over 30 million diverse listings, the problem of capturing taste is particularly important -- users come to the site specifically to find items that match their eclectic styles. In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. We use Latent Dirichlet Allocation (LDA) to discover trending categories and styles on Etsy, which are then used to describe a user's \"interest\" profile. We also explore hashing methods to perform fast nearest neighbor search on a map-reduce framework, in order to efficiently obtain recommendations. These techniques have been implemented successfully at very large scale, substantially improving many key business metrics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83077151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Correlation clustering is a basic primitive in data miner's toolkit with applications ranging from entity matching to social network analysis. The goal in correlation clustering is, given a graph with signed edges, partition the nodes into clusters to minimize the number of disagreements. In this paper we obtain a new algorithm for correlation clustering. Our algorithm is easily implementable in computational models such as MapReduce and streaming, and runs in a small number of rounds. In addition, we show that our algorithm obtains an almost 3-approximation to the optimal correlation clustering. Experiments on huge graphs demonstrate the scalability of our algorithm and its applicability to data mining problems.
{"title":"Correlation clustering in MapReduce","authors":"Flavio Chierichetti, Nilesh N. Dalvi, Ravi Kumar","doi":"10.1145/2623330.2623743","DOIUrl":"https://doi.org/10.1145/2623330.2623743","url":null,"abstract":"Correlation clustering is a basic primitive in data miner's toolkit with applications ranging from entity matching to social network analysis. The goal in correlation clustering is, given a graph with signed edges, partition the nodes into clusters to minimize the number of disagreements. In this paper we obtain a new algorithm for correlation clustering. Our algorithm is easily implementable in computational models such as MapReduce and streaming, and runs in a small number of rounds. In addition, we show that our algorithm obtains an almost 3-approximation to the optimal correlation clustering. Experiments on huge graphs demonstrate the scalability of our algorithm and its applicability to data mining problems.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83123825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}