Emails are examples of structured documents with various fields. These fields can be exploited to enhance the retrieval effectiveness of an Information Retrieval (IR) system that mailing list archives. In recent experiments of the TREC2005 Enterprise track, various fields were applied to varying degrees of success by the participants. In his work, using a field-based weighting model, we investigate the retrieval performance attainable by each field, and examine when fields evidence should be combined or not.
{"title":"Combining fields in known-item email search","authors":"C. Macdonald, I. Ounis","doi":"10.1145/1148170.1148312","DOIUrl":"https://doi.org/10.1145/1148170.1148312","url":null,"abstract":"Emails are examples of structured documents with various fields. These fields can be exploited to enhance the retrieval effectiveness of an Information Retrieval (IR) system that mailing list archives. In recent experiments of the TREC2005 Enterprise track, various fields were applied to varying degrees of success by the participants. In his work, using a field-based weighting model, we investigate the retrieval performance attainable by each field, and examine when fields evidence should be combined or not.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128811426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pseudo-relevance feedback has proven to be an effective strategy for improving retrieval accuracy in all retrieval models. However the performance of existing pseudo feedback methods is often affected significantly by some parameters, such as the number of feedback documents to use and the relative weight of original query terms; these parameters generally have to be set by trial-and-error without any guidance. In this paper, we present a more robust method for pseudo feedback based on statistical language models. Our main idea is to integrate the original query with feedback documents in a single probabilistic mixture model and regularize the estimation of the language model parameters in the model so that the information in the feedback documents can be gradually added to the original query. Unlike most existing feedback methods, our new method has no parameter to tune. Experiment results on two representative data sets show that the new method is significantly more robust than a state-of-the-art baseline language modeling approach for feedback with comparable or better retrieval accuracy.
{"title":"Regularized estimation of mixture models for robust pseudo-relevance feedback","authors":"Tao Tao, ChengXiang Zhai","doi":"10.1145/1148170.1148201","DOIUrl":"https://doi.org/10.1145/1148170.1148201","url":null,"abstract":"Pseudo-relevance feedback has proven to be an effective strategy for improving retrieval accuracy in all retrieval models. However the performance of existing pseudo feedback methods is often affected significantly by some parameters, such as the number of feedback documents to use and the relative weight of original query terms; these parameters generally have to be set by trial-and-error without any guidance. In this paper, we present a more robust method for pseudo feedback based on statistical language models. Our main idea is to integrate the original query with feedback documents in a single probabilistic mixture model and regularize the estimation of the language model parameters in the model so that the information in the feedback documents can be gradually added to the original query. Unlike most existing feedback methods, our new method has no parameter to tune. Experiment results on two representative data sets show that the new method is significantly more robust than a state-of-the-art baseline language modeling approach for feedback with comparable or better retrieval accuracy.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117232533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Web search multitasking study based on automatic task session detection procedure is described. The results of the study: 1) multitasking is very rare, 2) it usually covers only 2 task sessions, 3) it is frequently formed into a temporal inclusion of an interrupting task session into the interrupted session, 4) the quantitative characteristics of multitasking greatly differ from the characteristics of sequential execution of one and several tasks. A searcher minimizes task switching costs: he avoids multitasking and while multitasking he uses cheapest manner of task switching.
{"title":"An exploratory web log study of multitasking","authors":"Nikolai Buzikashvili","doi":"10.1145/1148170.1148286","DOIUrl":"https://doi.org/10.1145/1148170.1148286","url":null,"abstract":"The Web search multitasking study based on automatic task session detection procedure is described. The results of the study: 1) multitasking is very rare, 2) it usually covers only 2 task sessions, 3) it is frequently formed into a temporal inclusion of an interrupting task session into the interrupted session, 4) the quantitative characteristics of multitasking greatly differ from the characteristics of sequential execution of one and several tasks. A searcher minimizes task switching costs: he avoids multitasking and while multitasking he uses cheapest manner of task switching.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122634679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory-based methods for collaborative filtering predict new ratings by averaging (weighted) ratings between, respectively, pairs of similar users or items. In practice, a large number of ratings from similar users or similar items are not available, due to the sparsity inherent to rating data. Consequently, prediction quality can be poor. This paper re-formulates the memory-based collaborative filtering problem in a generative probabilistic framework, treating individual user-item ratings as predictors of missing ratings. The final rating is estimated by fusing predictions from three sources: predictions based on ratings of the same item by other users, predictions based on different item ratings made by the same user, and, third, ratings predicted based on data from other but similar users rating other but similar items. Existing user-based and item-based approaches correspond to the two simple cases of our framework. The complete model is however more robust to data sparsity, because the different types of ratings are used in concert, while additional ratings from similar users towards similar items are employed as a background model to smooth the predictions. Experiments demonstrate that the proposed methods are indeed more robust against data sparsity and give better recommendations.
{"title":"Unifying user-based and item-based collaborative filtering approaches by similarity fusion","authors":"Jun Wang, A. D. Vries, M. Reinders","doi":"10.1145/1148170.1148257","DOIUrl":"https://doi.org/10.1145/1148170.1148257","url":null,"abstract":"Memory-based methods for collaborative filtering predict new ratings by averaging (weighted) ratings between, respectively, pairs of similar users or items. In practice, a large number of ratings from similar users or similar items are not available, due to the sparsity inherent to rating data. Consequently, prediction quality can be poor. This paper re-formulates the memory-based collaborative filtering problem in a generative probabilistic framework, treating individual user-item ratings as predictors of missing ratings. The final rating is estimated by fusing predictions from three sources: predictions based on ratings of the same item by other users, predictions based on different item ratings made by the same user, and, third, ratings predicted based on data from other but similar users rating other but similar items. Existing user-based and item-based approaches correspond to the two simple cases of our framework. The complete model is however more robust to data sparsity, because the different types of ratings are used in concert, while additional ratings from similar users towards similar items are employed as a background model to smooth the predictions. Experiments demonstrate that the proposed methods are indeed more robust against data sparsity and give better recommendations.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117081401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both "almost-identical" documents in the data cleaning task and "relevant" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.
{"title":"Near-duplicate detection by instance-level constrained clustering","authors":"G. Yang, Jamie Callan","doi":"10.1145/1148170.1148243","DOIUrl":"https://doi.org/10.1145/1148170.1148243","url":null,"abstract":"For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both \"almost-identical\" documents in the data cleaning task and \"relevant\" documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115290384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on a method for the stylistic segmentation of text documents. Our technique involves mapping the change in a feature throughout a text. We use the linguistic features of conjunction and modality, through taxonomies from Systemic Functional Linguistics. This segmentation has applications in automated summarization, particularly of large documents.
{"title":"Stylistic text segmentation","authors":"Paul J. Chase, S. Argamon","doi":"10.1145/1148170.1148291","DOIUrl":"https://doi.org/10.1145/1148170.1148291","url":null,"abstract":"This paper focuses on a method for the stylistic segmentation of text documents. Our technique involves mapping the change in a feature throughout a text. We use the linguistic features of conjunction and modality, through taxonomies from Systemic Functional Linguistics. This segmentation has applications in automated summarization, particularly of large documents.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"38 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130758783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the effect of user supplied relevance feedback in improving web search results. Rather than using query refinement or document similarity measures to rerank results, we show that the web-graph distance between two documents is a robust measure of their relative relevancy. We demonstrate how the use of this metric can improve the rankings of result URLs, even when the user only rates one document in the dataset. Our research suggests that such interactive systems can significantly improve search results.
{"title":"Using web-graph distance for relevance feedback in web search","authors":"Sergei Vassilvitskii, Eric Brill","doi":"10.1145/1148170.1148199","DOIUrl":"https://doi.org/10.1145/1148170.1148199","url":null,"abstract":"We study the effect of user supplied relevance feedback in improving web search results. Rather than using query refinement or document similarity measures to rerank results, we show that the web-graph distance between two documents is a robust measure of their relative relevancy. We demonstrate how the use of this metric can improve the rankings of result URLs, even when the user only rates one document in the dataset. Our research suggests that such interactive systems can significantly improve search results.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128344571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated singer identification is important in organising, browsing and retrieving data in large music databases. In this paper, we propose a novel scheme, called Hybrid Singer Identifier (HSI), for automated singer recognition. HSI can effectively use multiple low-level features extracted from both vocal and non-vocal music segments to enhance the identification process with a hybrid architecture and build profiles of individual singer characteristics based on statistical mixture models. Extensive experimental results conducted on a large music database demonstrate the superiority of our method over state-of-the-art approaches.
{"title":"Towards efficient automated singer identification in large music databases","authors":"Jialie Shen, B. Cui, J. Shepherd, K. Tan","doi":"10.1145/1148170.1148184","DOIUrl":"https://doi.org/10.1145/1148170.1148184","url":null,"abstract":"Automated singer identification is important in organising, browsing and retrieving data in large music databases. In this paper, we propose a novel scheme, called Hybrid Singer Identifier (HSI), for automated singer recognition. HSI can effectively use multiple low-level features extracted from both vocal and non-vocal music segments to enhance the identification process with a hybrid architecture and build profiles of individual singer characteristics based on statistical mixture models. Extensive experimental results conducted on a large music database demonstrate the superiority of our method over state-of-the-art approaches.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131571238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Question classification has become a crucial step in modern question answering systems. Previous work has demonstrated the effectiveness of statistical machine learning approaches to this problem. This paper presents a new approach to building a question classifier using log-linear models. Evidence from a rich and diverse set of syntactic and semantic features is evaluated, as well as approaches which exploit the hierarchical structure of the question classes.
{"title":"Question classification with log-linear models","authors":"Phil Blunsom, K. Kocik, J. Curran","doi":"10.1145/1148170.1148282","DOIUrl":"https://doi.org/10.1145/1148170.1148282","url":null,"abstract":"Question classification has become a crucial step in modern question answering systems. Previous work has demonstrated the effectiveness of statistical machine learning approaches to this problem. This paper presents a new approach to building a question classifier using log-linear models. Evidence from a rich and diverse set of syntactic and semantic features is evaluated, as well as approaches which exploit the hierarchical structure of the question classes.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126767279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Milad Shokouhi, J. Zobel, Falk Scholer, S. Tahaghoghi
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.
{"title":"Capturing collection size for distributed non-cooperative retrieval","authors":"Milad Shokouhi, J. Zobel, Falk Scholer, S. Tahaghoghi","doi":"10.1145/1148170.1148227","DOIUrl":"https://doi.org/10.1145/1148170.1148227","url":null,"abstract":"Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124630477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}