Gregory Ference, Wang-Chien Lee, Hui-Ju Hung, De-Nian Yang
To many location-based service applications that prefer diverse results, finding locations that are spatially diverse and close in proximity to a query point (e.g., the current location of a user) can be more useful than finding the k nearest neighbors/locations. In this paper, we investigate the problem of searching for the k Diverse-Near Neighbors (kDNNs)} in spatial space that is based upon the spatial diversity and proximity of candidate locations to the query point. While employing a conventional distance measure for proximity, we develop a new and intuitive diversity metric based upon the variance of the angles among the candidate locations with respect to the query point. Accordingly, we create a dynamic programming algorithm that finds the optimal kDNNs. Unfortunately, the dynamic programming algorithm, with a time complexity of O(kn3), incurs excessive computational cost. Therefore, we further propose two heuristic algorithms, namely, Distance-based Browsing (DistBrow) and Diversity-based Browsing (DivBrow) that provide high effectiveness while being efficient by exploring the search space prioritized upon the proximity to the query point and spatial diversity, respectively. Using real and synthetic datasets, we conduct a comprehensive performance evaluation. The results show that DistBrow and DivBrow have superior effectiveness compared to state-of-the-art algorithms while maintaining high efficiency.
{"title":"Spatial search for K diverse-near neighbors","authors":"Gregory Ference, Wang-Chien Lee, Hui-Ju Hung, De-Nian Yang","doi":"10.1145/2505515.2505747","DOIUrl":"https://doi.org/10.1145/2505515.2505747","url":null,"abstract":"To many location-based service applications that prefer diverse results, finding locations that are spatially diverse and close in proximity to a query point (e.g., the current location of a user) can be more useful than finding the k nearest neighbors/locations. In this paper, we investigate the problem of searching for the k Diverse-Near Neighbors (kDNNs)} in spatial space that is based upon the spatial diversity and proximity of candidate locations to the query point. While employing a conventional distance measure for proximity, we develop a new and intuitive diversity metric based upon the variance of the angles among the candidate locations with respect to the query point. Accordingly, we create a dynamic programming algorithm that finds the optimal kDNNs. Unfortunately, the dynamic programming algorithm, with a time complexity of O(kn3), incurs excessive computational cost. Therefore, we further propose two heuristic algorithms, namely, Distance-based Browsing (DistBrow) and Diversity-based Browsing (DivBrow) that provide high effectiveness while being efficient by exploring the search space prioritized upon the proximity to the query point and spatial diversity, respectively. Using real and synthetic datasets, we conduct a comprehensive performance evaluation. The results show that DistBrow and DivBrow have superior effectiveness compared to state-of-the-art algorithms while maintaining high efficiency.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73891715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Histograms have been extensively used for selectivity estimation by academics and have successfully been adopted by database industry. However, the estimation error is usually large for skewed distributions and biased attributes, which are typical in real-world data. Therefore, we propose effective models to quantitatively measure bias and selectivity based on information entropy. These models together with the principles of maximum entropy are then used to develop a class of entropy-based histograms. Moreover, since entropy can be computed incrementally, we present the incremental variations of our algorithms that reduce the complexities of the histogram construction from quadratic to linear. We conducted an extensive set of experiments with both synthetic and real-world datasets to compare the accuracy and efficiency of our proposed techniques with many other histogram-based techniques, showing the superiority of the entropy-based approaches for both equality and range queries.
{"title":"Entropy-based histograms for selectivity estimation","authors":"Hien To, Kuorong Chiang, C. Shahabi","doi":"10.1145/2505515.2505756","DOIUrl":"https://doi.org/10.1145/2505515.2505756","url":null,"abstract":"Histograms have been extensively used for selectivity estimation by academics and have successfully been adopted by database industry. However, the estimation error is usually large for skewed distributions and biased attributes, which are typical in real-world data. Therefore, we propose effective models to quantitatively measure bias and selectivity based on information entropy. These models together with the principles of maximum entropy are then used to develop a class of entropy-based histograms. Moreover, since entropy can be computed incrementally, we present the incremental variations of our algorithms that reduce the complexities of the histogram construction from quadratic to linear. We conducted an extensive set of experiments with both synthetic and real-world datasets to compare the accuracy and efficiency of our proposed techniques with many other histogram-based techniques, showing the superiority of the entropy-based approaches for both equality and range queries.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"584 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75238299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For keyword search on XML data, traditionally, a list of query results in the form of subtrees will be returned to users. However, we find that it is still not sufficient to meet users' information needs because: (1) the search intention of a certain keyword query varies from person to person; (2) amongst the query results, they may have sibling or containment relationships (in the context of whole XML database), which could be important for users to digest the query results and should be shown to users. Therefore, we try to equip the traditional XML keyword search engine with our new exploration model XMAP, providing user an interactive yet novel way to explore the results with better user experience.
{"title":"Exploring XML data is as easy as using maps","authors":"Yong Zeng, Z. Bao, Guoliang Li, T. Ling","doi":"10.1145/2505515.2508201","DOIUrl":"https://doi.org/10.1145/2505515.2508201","url":null,"abstract":"For keyword search on XML data, traditionally, a list of query results in the form of subtrees will be returned to users. However, we find that it is still not sufficient to meet users' information needs because: (1) the search intention of a certain keyword query varies from person to person; (2) amongst the query results, they may have sibling or containment relationships (in the context of whole XML database), which could be important for users to digest the query results and should be shown to users. Therefore, we try to equip the traditional XML keyword search engine with our new exploration model XMAP, providing user an interactive yet novel way to explore the results with better user experience.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"17 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72563739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Gamon, T. Yano, Xinying Song, Johnson Apacible, Patrick Pantel
We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.
{"title":"Identifying salient entities in web pages","authors":"Michael Gamon, T. Yano, Xinying Song, Johnson Apacible, Patrick Pantel","doi":"10.1145/2505515.2505602","DOIUrl":"https://doi.org/10.1145/2505515.2505602","url":null,"abstract":"We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78225908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Structured knowledge bases are an increasingly important way for storing and retrieving information. Within such knowledge bases, an important search task is finding similar entities based on one or more example entities. We present QBEES, a novel framework for defining entity similarity based only on structural features, so-called aspects, of the entities, that includes query-dependent and query-independent entity ranking components. We present evaluation results with a number of existing entity list completion benchmarks, comparing to several state-of-the-art baselines.
{"title":"QBEES: query by entity examples","authors":"S. Metzger, Ralf Schenkel, M. Sydow","doi":"10.1145/2505515.2507873","DOIUrl":"https://doi.org/10.1145/2505515.2507873","url":null,"abstract":"Structured knowledge bases are an increasingly important way for storing and retrieving information. Within such knowledge bases, an important search task is finding similar entities based on one or more example entities. We present QBEES, a novel framework for defining entity similarity based only on structural features, so-called aspects, of the entities, that includes query-dependent and query-independent entity ranking components. We present evaluation results with a number of existing entity list completion benchmarks, comparing to several state-of-the-art baselines.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75892252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The average of the customer ratings on the product, which we call reputation, is one of the key factors in online purchasing decision of a product. There is, however, no guarantee in the trustworthiness of the reputation since it can be manipulated rather easily. In this paper, we define false reputation as the problem of the reputation to be manipulated by unfair ratings, and design a general framework that provides trustable reputation. For this purpose, we propose TRUEREPUTATION, an algorithm that iteratively adjusts the reputation based on the confidence of customer ratings.
{"title":"Trustable aggregation of online ratings","authors":"Hyun-Kyo Oh, Sang-Wook Kim, Sunju Park, M. Zhou","doi":"10.1145/2505515.2507863","DOIUrl":"https://doi.org/10.1145/2505515.2507863","url":null,"abstract":"The average of the customer ratings on the product, which we call reputation, is one of the key factors in online purchasing decision of a product. There is, however, no guarantee in the trustworthiness of the reputation since it can be manipulated rather easily. In this paper, we define false reputation as the problem of the reputation to be manipulated by unfair ratings, and design a general framework that provides trustable reputation. For this purpose, we propose TRUEREPUTATION, an algorithm that iteratively adjusts the reputation based on the confidence of customer ratings.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74962508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusheng Xie, Zhengzhang Chen, Ankit Agrawal, A. Choudhary, Lu Liu
We investigate sampling techniques in unbalanced heterogeneous bipartite graphs (UHBGs), which have wide applications in real world web-scale social networks. We propose random walked-based link sampling and stratified sampling for UHBGs and show that they have advantages over generic random walk samplers. In addition, each sampler's node degree distribution parameter estimator statistic is analytically derived to be used as a quality indicator. In the experiments, we apply the two sampling techniques, with a baseline node sampling method, to both synthetic and real Facebook data. The experimental results show that random walk-based stratified sampler has significant advantage over node sampler and link sampler on UHBGs.
{"title":"Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs","authors":"Yusheng Xie, Zhengzhang Chen, Ankit Agrawal, A. Choudhary, Lu Liu","doi":"10.1145/2505515.2507822","DOIUrl":"https://doi.org/10.1145/2505515.2507822","url":null,"abstract":"We investigate sampling techniques in unbalanced heterogeneous bipartite graphs (UHBGs), which have wide applications in real world web-scale social networks. We propose random walked-based link sampling and stratified sampling for UHBGs and show that they have advantages over generic random walk samplers. In addition, each sampler's node degree distribution parameter estimator statistic is analytically derived to be used as a quality indicator. In the experiments, we apply the two sampling techniques, with a baseline node sampling method, to both synthetic and real Facebook data. The experimental results show that random walk-based stratified sampler has significant advantage over node sampler and link sampler on UHBGs.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80145395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The dramatic rates new digital content becomes available has brought collaborative filtering systems to the epicenter of computer science research in the last decade. One of the greatest challenges collaborative filtering systems are confronted with is the data sparsity problem: users typically rate only very few items; thus, availability of historical data is not adequate to effectively perform prediction. To alleviate these issues, in this paper we propose a novel multitask collaborative filtering approach. Our approach is based on a coupled latent factor model of the users rating functions, which allows for coming up with an agile information sharing mechanism that extracts much richer task-correlation information compared to existing approaches. Formulation of our method is based on concepts from the field of Bayesian nonparametrics, specifically Indian Buffet Process priors, which allow for data-driven determination of the optimal number of underlying latent features (item characteristics and user traits) assumed in the context of the model. We experiment on several real-world datasets, demonstrating both the efficacy of our method, and its superiority over existing approaches.
{"title":"Nonparametric bayesian multitask collaborative filtering","authors":"S. Chatzis","doi":"10.1145/2505515.2505517","DOIUrl":"https://doi.org/10.1145/2505515.2505517","url":null,"abstract":"The dramatic rates new digital content becomes available has brought collaborative filtering systems to the epicenter of computer science research in the last decade. One of the greatest challenges collaborative filtering systems are confronted with is the data sparsity problem: users typically rate only very few items; thus, availability of historical data is not adequate to effectively perform prediction. To alleviate these issues, in this paper we propose a novel multitask collaborative filtering approach. Our approach is based on a coupled latent factor model of the users rating functions, which allows for coming up with an agile information sharing mechanism that extracts much richer task-correlation information compared to existing approaches. Formulation of our method is based on concepts from the field of Bayesian nonparametrics, specifically Indian Buffet Process priors, which allow for data-driven determination of the optimal number of underlying latent features (item characteristics and user traits) assumed in the context of the model. We experiment on several real-world datasets, demonstrating both the efficacy of our method, and its superiority over existing approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80197138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a domain-dependent/independent topic switching model based on Bayesian probabilistic modeling for modeling online product reviews that are accompanied with numerical ratings provided by users. In this model, each word is allocated to a domain-dependent topic or a domain-independent topic, and the distribution of topics in an online review is connected to an observed numerical rating via a linear regression model. Domain-dependent topics utilize domain information observed with a corpus, and domain-independent topics utilize the framework of Bayesian Nonparametrics, which can estimate the number of topics in posterior distributions. The posterior distribution is estimated via collapsed Gibbs sampling. Using real data, our proposed model had smaller mean square error and smaller average mean error with a small model size and achieved convergence in fewer iterations for a regression task involving online review ratings, outperforming a baseline model that did not consider domains. Moreover, the proposed model can also tell us whether the words are positive or negative in the form of continuous values. This feature allows us to extract domain-dependent and -independent sentiment words.
{"title":"Domain-dependent/independent topic switching model for online reviews with numerical ratings","authors":"Yasutoshi Ida, Takuma Nakamura, Takashi Matsumoto","doi":"10.1145/2505515.2505540","DOIUrl":"https://doi.org/10.1145/2505515.2505540","url":null,"abstract":"We propose a domain-dependent/independent topic switching model based on Bayesian probabilistic modeling for modeling online product reviews that are accompanied with numerical ratings provided by users. In this model, each word is allocated to a domain-dependent topic or a domain-independent topic, and the distribution of topics in an online review is connected to an observed numerical rating via a linear regression model. Domain-dependent topics utilize domain information observed with a corpus, and domain-independent topics utilize the framework of Bayesian Nonparametrics, which can estimate the number of topics in posterior distributions. The posterior distribution is estimated via collapsed Gibbs sampling. Using real data, our proposed model had smaller mean square error and smaller average mean error with a small model size and achieved convergence in fewer iterations for a regression task involving online review ratings, outperforming a baseline model that did not consider domains. Moreover, the proposed model can also tell us whether the words are positive or negative in the form of continuous values. This feature allows us to extract domain-dependent and -independent sentiment words.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80391731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stamatina Thomaidou, Ismini Lourentzou, Panagiotis Katsivelis-Perakis, M. Vazirgiannis
Products, services or brands can be advertised alongside the search results in major search engines, while recently smaller displays on devices like tablets and smartphones have imposed the need for smaller ad texts. In this paper, we propose a method that produces in an automated manner compact text ads (promotional text snippets), given as input a product description webpage (landing page). The challenge is to produce a small comprehensive ad while maintaining at the same time relevance, clarity, and attractiveness. Our method includes the following phases. Initially, it extracts relevant and important n-grams (keywords) given the landing page. The keywords reserved must have a positive meaning in order to have a call-to-action style, thus we attempt sentiment analysis on them. Next, we build an Advertising Language Model to evaluate phrases in terms of their marketing appeal. We experiment with two variations of our method and we show that they outperform all the baseline approaches.
{"title":"Automated snippet generation for online advertising","authors":"Stamatina Thomaidou, Ismini Lourentzou, Panagiotis Katsivelis-Perakis, M. Vazirgiannis","doi":"10.1145/2505515.2507876","DOIUrl":"https://doi.org/10.1145/2505515.2507876","url":null,"abstract":"Products, services or brands can be advertised alongside the search results in major search engines, while recently smaller displays on devices like tablets and smartphones have imposed the need for smaller ad texts. In this paper, we propose a method that produces in an automated manner compact text ads (promotional text snippets), given as input a product description webpage (landing page). The challenge is to produce a small comprehensive ad while maintaining at the same time relevance, clarity, and attractiveness. Our method includes the following phases. Initially, it extracts relevant and important n-grams (keywords) given the landing page. The keywords reserved must have a positive meaning in order to have a call-to-action style, thus we attempt sentiment analysis on them. Next, we build an Advertising Language Model to evaluate phrases in terms of their marketing appeal. We experiment with two variations of our method and we show that they outperform all the baseline approaches.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"151 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80518137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}