Prior research in resource selection for federated search mainly focused on selecting a small number of information sources that are most relevant to a user query. However, result novelty and diversification are largely unexplored, which does not reflect the various kinds of information needs of users in real world applications. This paper proposes two general approaches to model both result relevance and diversification in selecting sources, in order to provide more comprehensive coverage of multiple aspects of a user query. The first approach focuses on diversifying the document ranking on a centralized sample database before selecting information sources under the framework of Relevant Document Distribution Estimation (ReDDE). The second approach first evaluates the relevance of information sources with respect to each aspect of the query, and then ranks the sources based on the novelty and relevance that they offer. Both approaches can be applied with a wide range of existing resource selection algorithms such as ReDDE, CRCS, CORI and Big Document. Moreover, this paper proposes a learning based approach to combine multiple resource selection algorithms for result diversification, which can further improve the performance. We propose a set of new metrics for resource selection in federated search to evaluate the diversification performance of different approaches. To our best knowledge, this is the first piece of work that addresses the problem of search result diversification in federated search. The effectiveness of the proposed approaches has been demonstrated by an extensive set of experiments on the federated search testbed of the Clueweb dataset.
{"title":"Search result diversification in resource selection for federated search","authors":"Dzung Hong, Luo Si","doi":"10.1145/2484028.2484091","DOIUrl":"https://doi.org/10.1145/2484028.2484091","url":null,"abstract":"Prior research in resource selection for federated search mainly focused on selecting a small number of information sources that are most relevant to a user query. However, result novelty and diversification are largely unexplored, which does not reflect the various kinds of information needs of users in real world applications. This paper proposes two general approaches to model both result relevance and diversification in selecting sources, in order to provide more comprehensive coverage of multiple aspects of a user query. The first approach focuses on diversifying the document ranking on a centralized sample database before selecting information sources under the framework of Relevant Document Distribution Estimation (ReDDE). The second approach first evaluates the relevance of information sources with respect to each aspect of the query, and then ranks the sources based on the novelty and relevance that they offer. Both approaches can be applied with a wide range of existing resource selection algorithms such as ReDDE, CRCS, CORI and Big Document. Moreover, this paper proposes a learning based approach to combine multiple resource selection algorithms for result diversification, which can further improve the performance. We propose a set of new metrics for resource selection in federated search to evaluate the diversification performance of different approaches. To our best knowledge, this is the first piece of work that addresses the problem of search result diversification in federated search. The effectiveness of the proposed approaches has been demonstrated by an extensive set of experiments on the federated search testbed of the Clueweb dataset.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121999153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Query-biased search result summaries, or "snippets", help users decide whether a result is relevant for their information need, and have become increasingly important for helping searchers with difficult or ambiguous search tasks. Previously published snippet generation algorithms have been primarily based on selecting document fragments most similar to the query, which does not take into account which parts of the document the searchers actually found useful. We present a new approach to improving result summaries by incorporating post-click searcher behavior data, such as mouse cursor movements and scrolling over the result documents. To achieve this aim, we develop a method for collecting behavioral data with precise association between searcher intent, document examination behavior, and the corresponding document fragments. In turn, this allows us to incorporate page examination behavior signals into a novel Behavior-Biased Snippet generation system (BeBS). By mining searcher examination data, BeBS infers document fragments of most interest to users, and combines this evidence with text-based features to select the most promising fragments for inclusion in the result summary. Our extensive experiments and analysis demonstrate that our method improves the quality of result summaries compared to existing state-of-the-art methods. We believe that this work opens a new direction for improving search result presentation, and we make available the code and the search behavior data used in this study to encourage further research in this area.
{"title":"Improving search result summaries by using searcher behavior data","authors":"Mikhail S. Ageev, Dmitry Lagun, Eugene Agichtein","doi":"10.1145/2484028.2484093","DOIUrl":"https://doi.org/10.1145/2484028.2484093","url":null,"abstract":"Query-biased search result summaries, or \"snippets\", help users decide whether a result is relevant for their information need, and have become increasingly important for helping searchers with difficult or ambiguous search tasks. Previously published snippet generation algorithms have been primarily based on selecting document fragments most similar to the query, which does not take into account which parts of the document the searchers actually found useful. We present a new approach to improving result summaries by incorporating post-click searcher behavior data, such as mouse cursor movements and scrolling over the result documents. To achieve this aim, we develop a method for collecting behavioral data with precise association between searcher intent, document examination behavior, and the corresponding document fragments. In turn, this allows us to incorporate page examination behavior signals into a novel Behavior-Biased Snippet generation system (BeBS). By mining searcher examination data, BeBS infers document fragments of most interest to users, and combines this evidence with text-based features to select the most promising fragments for inclusion in the result summary. Our extensive experiments and analysis demonstrate that our method improves the quality of result summaries compared to existing state-of-the-art methods. We believe that this work opens a new direction for improving search result presentation, and we make available the code and the search behavior data used in this study to encourage further research in this area.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120945552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Wang, Yiqun Liu, Min Zhang, Shaoping Ma, Meihong Zheng, Jing Qian, Kuo Zhang
In modern search engines, an increasing number of search result pages (SERPs) are federated from multiple specialized search engines (called verticals, such as Image or Video). As an effective approach to interpret users' click-through behavior as feedback information, most click models were designed to reduce the position bias and improve ranking performance of ordinary search results, which have homogeneous appearances. However, when vertical results are combined with ordinary ones, significant differences in presentation may lead to user behavior biases and thus failure of state-of-the-art click models. With the help of a popular commercial search engine in China, we collected a large scale log data set which contains behavior information on both vertical and ordinary results. We also performed eye-tracking analysis to study user's real-world examining behavior. According these analysis, we found that different result appearances may cause different behavior biases both for vertical results (local effect) and for the whole result lists (global effect). These biases include: examine bias for vertical results (especially those with multimedia components), trust bias for result lists with vertical results, and a higher probability of result revisitation for vertical results. Based on these findings, a novel click model considering these biases besides position bias was constructed to describe interaction with SERPs containing verticals. Experimental results show that the new Vertical-aware Click Model (VCM) is better at interpreting user click behavior on federated searches in terms of both log-likelihood and perplexity than existing models.
{"title":"Incorporating vertical results into search click models","authors":"Chao Wang, Yiqun Liu, Min Zhang, Shaoping Ma, Meihong Zheng, Jing Qian, Kuo Zhang","doi":"10.1145/2484028.2484036","DOIUrl":"https://doi.org/10.1145/2484028.2484036","url":null,"abstract":"In modern search engines, an increasing number of search result pages (SERPs) are federated from multiple specialized search engines (called verticals, such as Image or Video). As an effective approach to interpret users' click-through behavior as feedback information, most click models were designed to reduce the position bias and improve ranking performance of ordinary search results, which have homogeneous appearances. However, when vertical results are combined with ordinary ones, significant differences in presentation may lead to user behavior biases and thus failure of state-of-the-art click models. With the help of a popular commercial search engine in China, we collected a large scale log data set which contains behavior information on both vertical and ordinary results. We also performed eye-tracking analysis to study user's real-world examining behavior. According these analysis, we found that different result appearances may cause different behavior biases both for vertical results (local effect) and for the whole result lists (global effect). These biases include: examine bias for vertical results (especially those with multimedia components), trust bias for result lists with vertical results, and a higher probability of result revisitation for vertical results. Based on these findings, a novel click model considering these biases besides position bias was constructed to describe interaction with SERPs containing verticals. Experimental results show that the new Vertical-aware Click Model (VCM) is better at interpreting user click behavior on federated searches in terms of both log-likelihood and perplexity than existing models.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114794224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryen W. White, E. Yom-Tov, E. Horvitz, Eugene Agichtein, W. Hersh
This workshop brings together researchers and practitioners from industry and academia to discuss search and discovery in the medi-cal domain. The event focuses on ways to make medical and health information more accessible to laypeople (including enhancements to ranking algorithms and search interfaces), and how we can dis-cover new medical facts and phenomena from information sought online, as evidenced in query streams and other sources such as social media. This domain also offers many opportunities for appli-cations that monitor and improve quality of life of those affected by medical conditions, by providing tools to support their health-related information behavior.
{"title":"Workshop on health search and discovery: helping users and advancing medicine","authors":"Ryen W. White, E. Yom-Tov, E. Horvitz, Eugene Agichtein, W. Hersh","doi":"10.1145/2484028.2484220","DOIUrl":"https://doi.org/10.1145/2484028.2484220","url":null,"abstract":"This workshop brings together researchers and practitioners from industry and academia to discuss search and discovery in the medi-cal domain. The event focuses on ways to make medical and health information more accessible to laypeople (including enhancements to ranking algorithms and search interfaces), and how we can dis-cover new medical facts and phenomena from information sought online, as evidenced in query streams and other sources such as social media. This domain also offers many opportunities for appli-cations that monitor and improve quality of life of those affected by medical conditions, by providing tools to support their health-related information behavior.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115149662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roberto Konow, G. Navarro, C. Clarke, A. López-Ortiz
We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that our index uses about 20% less space, and performs queries up to three times faster, than state-of-the-art compact representations.
{"title":"Faster and smaller inverted indices with treaps","authors":"Roberto Konow, G. Navarro, C. Clarke, A. López-Ortiz","doi":"10.1145/2484028.2484088","DOIUrl":"https://doi.org/10.1145/2484028.2484088","url":null,"abstract":"We introduce a new representation of the inverted index that performs faster ranked unions and intersections while using less space. Our index is based on the treap data structure, which allows us to intersect/merge the document identifiers while simultaneously thresholding by frequency, instead of the costlier two-step classical processing methods. To achieve compression we represent the treap topology using compact data structures. Further, the treap invariants allow us to elegantly encode differentially both document identifiers and frequencies. Results show that our index uses about 20% less space, and performs queries up to three times faster, than state-of-the-art compact representations.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115504624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of the Contextual Suggestion track, an evaluation task at the TREC 2012 conference, is to suggest personalized tourist activities to an individual, given a certain location and time. In our content-based approach, we collected initial recommendations using the location context as search query in Google Places. We first ranked the recommendations based on their textual similarity to the user profiles. In order to improve the ranking of popular sights, we combined the initial ranking with rankings based on Google Search, popularity and categories. Finally, we performed filtering based on the temporal context. Overall, our system performed well above average and median, and outperformed the baseline - Google Places only -- run.
{"title":"Recommending personalized touristic sights using google places","authors":"Maya Sappelli, S. Verberne, Wessel Kraaij","doi":"10.1145/2484028.2484155","DOIUrl":"https://doi.org/10.1145/2484028.2484155","url":null,"abstract":"The purpose of the Contextual Suggestion track, an evaluation task at the TREC 2012 conference, is to suggest personalized tourist activities to an individual, given a certain location and time. In our content-based approach, we collected initial recommendations using the location context as search query in Google Places. We first ranked the recommendations based on their textual similarity to the user profiles. In order to improve the ranking of popular sights, we combined the initial ranking with rankings based on Google Search, popularity and categories. Finally, we performed filtering based on the temporal context. Overall, our system performed well above average and median, and outperformed the baseline - Google Places only -- run.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116182226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaosheng Fan, Yanyan Lan, J. Guo, Zuoquan Lin, Xueqi Cheng
Recommender system has become an effective tool for information filtering, which usually provides the most useful items to users by a top-k ranking list. Traditional recommendation techniques such as Nearest Neighbors (NN) and Matrix Factorization (MF) have been widely used in real recommender systems. However, neither approaches can well accomplish recommendation task since that: (1) most NN methods leverage the neighbor's behaviors for prediction, which may suffer the severe data sparsity problem; (2) MF methods are less sensitive to sparsity, but neighbors' influences on latent factors are not fully explored, since the latent factors are often used independently. To overcome the above problems, we propose a new framework for recommender systems, called collaborative factorization. It expresses the user as the combination of his own factors and those of the neighbors', called collaborative latent factors, and a ranking loss is then utilized for optimization. The advantage of our approach is that it can both enjoy the merits of NN and MF methods. In this paper, we take the logistic loss in RankNet and the likelihood loss in ListMLE as examples, and the corresponding collaborative factorization methods are called CoF-Net and CoF-MLE. Our experimental results on three benchmark datasets show that they are more effective than several state-of-the-art recommendation methods.
{"title":"Collaborative factorization for recommender systems","authors":"Chaosheng Fan, Yanyan Lan, J. Guo, Zuoquan Lin, Xueqi Cheng","doi":"10.1145/2484028.2484176","DOIUrl":"https://doi.org/10.1145/2484028.2484176","url":null,"abstract":"Recommender system has become an effective tool for information filtering, which usually provides the most useful items to users by a top-k ranking list. Traditional recommendation techniques such as Nearest Neighbors (NN) and Matrix Factorization (MF) have been widely used in real recommender systems. However, neither approaches can well accomplish recommendation task since that: (1) most NN methods leverage the neighbor's behaviors for prediction, which may suffer the severe data sparsity problem; (2) MF methods are less sensitive to sparsity, but neighbors' influences on latent factors are not fully explored, since the latent factors are often used independently. To overcome the above problems, we propose a new framework for recommender systems, called collaborative factorization. It expresses the user as the combination of his own factors and those of the neighbors', called collaborative latent factors, and a ranking loss is then utilized for optimization. The advantage of our approach is that it can both enjoy the merits of NN and MF methods. In this paper, we take the logistic loss in RankNet and the likelihood loss in ListMLE as examples, and the corresponding collaborative factorization methods are called CoF-Net and CoF-MLE. Our experimental results on three benchmark datasets show that they are more effective than several state-of-the-art recommendation methods.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121558533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Ekstrand-Abueg, Virgil Pavlu, Makoto P. Kato, T. Sakai, Takehiro Yamamoto, Mayu Iwata
Building test collections based on nuggets is useful evaluating systems that return documents, answers, or summaries. However, nugget construction requires a lot of manual work and is not feasible for large query sets. Towards an efficient and scalable nugget-based evaluation, we study the applicability of semi-automatic nugget extraction in the context of the ongoing NTCIR One Click Access (1CLICK) task. We compare manually-extracted and semi-automatically-extracted Japanese nuggets to demonstrate the coverage and efficiency of the semi-automatic nugget extraction. Our findings suggest that the manual nugget extraction can be replaced with a direct adaptation of the English semi-automatic nugget extraction system, especially for queries for which the user desires broad answers from free-form text.
{"title":"Exploring semi-automatic nugget extraction for Japanese one click access evaluation","authors":"Matthew Ekstrand-Abueg, Virgil Pavlu, Makoto P. Kato, T. Sakai, Takehiro Yamamoto, Mayu Iwata","doi":"10.1145/2484028.2484153","DOIUrl":"https://doi.org/10.1145/2484028.2484153","url":null,"abstract":"Building test collections based on nuggets is useful evaluating systems that return documents, answers, or summaries. However, nugget construction requires a lot of manual work and is not feasible for large query sets. Towards an efficient and scalable nugget-based evaluation, we study the applicability of semi-automatic nugget extraction in the context of the ongoing NTCIR One Click Access (1CLICK) task. We compare manually-extracted and semi-automatically-extracted Japanese nuggets to demonstrate the coverage and efficiency of the semi-automatic nugget extraction. Our findings suggest that the manual nugget extraction can be replaced with a direct adaptation of the English semi-automatic nugget extraction system, especially for queries for which the user desires broad answers from free-form text.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129463492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengxian Wan, Yanyan Lan, J. Guo, Chaosheng Fan, Xueqi Cheng
It is well recognized that users rely on social media (e.g. Twitter or Digg) to fulfill two common needs (i.e. social need and informational need) that is to keep in touch with their friends in the real world and to have access to information they are interested in. Traditional friend recommendation methods in social media mainly focus on a user's social need, but seldom address their informational need (i.e. suggesting friends that can provide information one may be interested in but have not been able to obtain so far). In this paper, we propose to recommend friends according to the informational utility, which stands for the degree to which a friend satisfies the target user's unfulfilled informational need, called informational friend recommendation. In order to capture users' informational need, we view a post in social media as an item and utilize collaborative filtering techniques to predict the rating for each post. The candidate friends are then ranked according to their informational utility for recommendation. In addition, we also show how to further consider diversity in such recommendations. Experiments on benchmark datasets demonstrate that our approach can significantly outperform the traditional friend recommendation methods under informational evaluation measures.
{"title":"Informational friend recommendation in social media","authors":"Shengxian Wan, Yanyan Lan, J. Guo, Chaosheng Fan, Xueqi Cheng","doi":"10.1145/2484028.2484179","DOIUrl":"https://doi.org/10.1145/2484028.2484179","url":null,"abstract":"It is well recognized that users rely on social media (e.g. Twitter or Digg) to fulfill two common needs (i.e. social need and informational need) that is to keep in touch with their friends in the real world and to have access to information they are interested in. Traditional friend recommendation methods in social media mainly focus on a user's social need, but seldom address their informational need (i.e. suggesting friends that can provide information one may be interested in but have not been able to obtain so far). In this paper, we propose to recommend friends according to the informational utility, which stands for the degree to which a friend satisfies the target user's unfulfilled informational need, called informational friend recommendation. In order to capture users' informational need, we view a post in social media as an item and utilize collaborative filtering techniques to predict the rating for each post. The candidate friends are then ranked according to their informational utility for recommendation. In addition, we also show how to further consider diversity in such recommendations. Experiments on benchmark datasets demonstrate that our approach can significantly outperform the traditional friend recommendation methods under informational evaluation measures.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131202271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rishabh Mehrotra, S. Sanner, Wray L. Buntine, Lexing Xie
Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.
{"title":"Improving LDA topic models for microblogs via tweet pooling and automatic labeling","authors":"Rishabh Mehrotra, S. Sanner, Wray L. Buntine, Lexing Xie","doi":"10.1145/2484028.2484166","DOIUrl":"https://doi.org/10.1145/2484028.2484166","url":null,"abstract":"Twitter, or the world of 140 characters poses serious challenges to the efficacy of topic models on short, messy text. While topic models such as Latent Dirichlet Allocation (LDA) have a long history of successful application to news articles and academic abstracts, they are often less coherent when applied to microblog content like Twitter. In this paper, we investigate methods to improve topics learned from Twitter content without modifying the basic machinery of LDA; we achieve this through various pooling schemes that aggregate tweets in a data preprocessing step for LDA. We empirically establish that a novel method of tweet pooling by hashtags leads to a vast improvement in a variety of measures for topic coherence across three diverse Twitter datasets in comparison to an unmodified LDA baseline and a variety of pooling schemes. An additional contribution of automatic hashtag labeling further improves on the hashtag pooling results for a subset of metrics. Overall, these two novel schemes lead to significantly improved LDA topic models on Twitter content.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124888841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}