Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets.
{"title":"Graph-based text classification: learn from your neighbors","authors":"Ralitsa Angelova, G. Weikum","doi":"10.1145/1148170.1148254","DOIUrl":"https://doi.org/10.1145/1148170.1148254","url":null,"abstract":"Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131883780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30% improvement in query throughput when tested on an eight-node parallel computer system.
{"title":"Load balancing for term-distributed parallel retrieval","authors":"Alistair Moffat, William Webber, J. Zobel","doi":"10.1145/1148170.1148232","DOIUrl":"https://doi.org/10.1145/1148170.1148232","url":null,"abstract":"Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query throughput rates, parallel systems are used, in which the document and index data are split across tightly-clustered distributed computing systems. The index data can be distributed either by document or by term. In this paper we examine methods for load balancing in term-distributed parallel architectures, and propose a suite of techniques for reducing net querying costs. In combination, the techniques we describe allow a 30% improvement in query throughput when tested on an eight-node parallel computer system.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132239434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Lacerda, Marco Cristo, Marcos André Gonçalves, Weiguo Fan, N. Ziviani, B. Ribeiro-Neto
Content-targeted advertising, the task of automatically associating ads to a Web page, constitutes a key Web monetization strategy nowadays. Further, it introduces new challenging technical problems and raises interesting questions. For instance, how to design ranking functions able to satisfy conflicting goals such as selecting advertisements (ads) that are relevant to the users and suitable and profitable to the publishers and advertisers? In this paper we propose a new framework for associating ads with web pages based on Genetic Programming (GP). Our GP method aims at learning functions that select the most appropriate ads, given the contents of a Web page. These ranking functions are designed to optimize overall precision and minimize the number of misplacements. By using a real ad collection and web pages from a newspaper, we obtained a gain over a state-of-the-art baseline method of 61.7% in average precision. Further, by evolving individuals to provide good ranking estimations, GP was able to discover ranking functions that are very effective in placing ads in web pages while avoiding irrelevant ones.
{"title":"Learning to advertise","authors":"A. Lacerda, Marco Cristo, Marcos André Gonçalves, Weiguo Fan, N. Ziviani, B. Ribeiro-Neto","doi":"10.1145/1148170.1148265","DOIUrl":"https://doi.org/10.1145/1148170.1148265","url":null,"abstract":"Content-targeted advertising, the task of automatically associating ads to a Web page, constitutes a key Web monetization strategy nowadays. Further, it introduces new challenging technical problems and raises interesting questions. For instance, how to design ranking functions able to satisfy conflicting goals such as selecting advertisements (ads) that are relevant to the users and suitable and profitable to the publishers and advertisers? In this paper we propose a new framework for associating ads with web pages based on Genetic Programming (GP). Our GP method aims at learning functions that select the most appropriate ads, given the contents of a Web page. These ranking functions are designed to optimize overall precision and minimize the number of misplacements. By using a real ad collection and web pages from a newspaper, we obtained a gain over a state-of-the-art baseline method of 61.7% in average precision. Further, by evolving individuals to provide good ranking estimations, GP was able to discover ranking functions that are very effective in placing ads in web pages while avoiding irrelevant ones.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"223 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134455952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the relationship between training set size and the parameter k for the k-Nearest Neighbors (kNN) classifier. When few examples are available, we observe that accuracy is sensitive to k and that best k tends to increase with training size. We explore the subsequent risk that k tuned on partitions will be suboptimal after aggregation and re-training. This risk is found to be most severe when little data is available. For larger training sizes, accuracy becomes increasingly stable with respect to k and the risk decreases.
{"title":"An analysis of the coupling between training set and neighborhood sizes for the kNN classifier","authors":"J. S. Olsson","doi":"10.1145/1148170.1148317","DOIUrl":"https://doi.org/10.1145/1148170.1148317","url":null,"abstract":"We consider the relationship between training set size and the parameter k for the k-Nearest Neighbors (kNN) classifier. When few examples are available, we observe that accuracy is sensitive to k and that best k tends to increase with training size. We explore the subsequent risk that k tuned on partitions will be suboptimal after aggregation and re-training. This risk is found to be most severe when little data is available. For larger training sizes, accuracy becomes increasingly stable with respect to k and the risk decreases.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125112575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Lillis, F. Toolan, Rem W. Collier, J. Dunnion
Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown in the past that this can greatly improve retrieval effectiveness over that of the individual results.This paper presents probFuse, a probabilistic approach to data fusion. ProbFuse assumes that the performance of the individual input systems on a number of training queries is indicative of their future performance. The fused result set is based on probabilities of relevance calculated during this training process. Retrieval experiments using data from the TREC ad hoc collection demonstrate that probFuse achieves results superior to that of the popular CombMNZ fusion algorithm.
数据融合是将文档集合上的独立搜索结果组合到一个输出结果集中。过去已经证明,这可以大大提高检索效率,而不是单个结果。提出了一种基于概率的数据融合方法probFuse。ProbFuse假设单个输入系统在许多训练查询上的表现表明了它们未来的表现。融合的结果集是基于在训练过程中计算的相关概率。利用TREC ad hoc数据集进行的检索实验表明,probFuse取得的结果优于流行的CombMNZ融合算法。
{"title":"ProbFuse: a probabilistic approach to data fusion","authors":"David Lillis, F. Toolan, Rem W. Collier, J. Dunnion","doi":"10.1145/1148170.1148197","DOIUrl":"https://doi.org/10.1145/1148170.1148197","url":null,"abstract":"Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown in the past that this can greatly improve retrieval effectiveness over that of the individual results.This paper presents probFuse, a probabilistic approach to data fusion. ProbFuse assumes that the performance of the individual input systems on a number of training queries is indicative of their future performance. The fused result set is based on probabilities of relevance calculated during this training process. Retrieval experiments using data from the TREC ad hoc collection demonstrate that probFuse achieves results superior to that of the popular CombMNZ fusion algorithm.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116474563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a method to rate the credibility of news articles using three clues: (1) commonality of the contents of articles among different news publishers; (2) numerical agreement versus contradiction of numerical values reported in the articles; and (3) objectivity based on subjective speculative phrases and news sources. We tested this method on news stories taken from seven different news sites on the Web. The average agreement between the system-produced "credibility" and the manual judgments of three human assessors on the 52 sample articles was 69.1%. The limitations of the current approach and future directions are discussed.
{"title":"A method of rating the credibility of news documents on the web","authors":"Ryosuke Nagura, Yohei Seki, N. Kando, Masaki Aono","doi":"10.1145/1148170.1148316","DOIUrl":"https://doi.org/10.1145/1148170.1148316","url":null,"abstract":"We propose a method to rate the credibility of news articles using three clues: (1) commonality of the contents of articles among different news publishers; (2) numerical agreement versus contradiction of numerical values reported in the articles; and (3) objectivity based on subjective speculative phrases and news sources. We tested this method on news stories taken from seven different news sites on the Web. The average agreement between the system-produced \"credibility\" and the manual judgments of three human assessors on the 52 sample articles was 69.1%. The limitations of the current approach and future directions are discussed.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115031089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Early speech retrieval experiments focused on news broadcasts, for which adequate Automatic Speech Recognition (ASR) accuracy could be obtained. Like newspapers, news broadcasts are a manually selected and arranged set of stories. Evaluation designs reflected that, using known story boundaries as a basis for evaluation. Substantial advances in ASR accuracy now make it possible to build search systems for some types of spontaneous conversational speech, but present evaluation designs continue to rely on known topic boundaries that are no longer well matched to the nature of the materials. We propose a new class of measures for speech retrieval based on manual annotation of points at which a user with specific topical interests would wish replay to begin.
{"title":"One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech","authors":"Baolong Liu, Douglas W. Oard","doi":"10.1145/1148170.1148311","DOIUrl":"https://doi.org/10.1145/1148170.1148311","url":null,"abstract":"Early speech retrieval experiments focused on news broadcasts, for which adequate Automatic Speech Recognition (ASR) accuracy could be obtained. Like newspapers, news broadcasts are a manually selected and arranged set of stories. Evaluation designs reflected that, using known story boundaries as a basis for evaluation. Substantial advances in ASR accuracy now make it possible to build search systems for some types of spontaneous conversational speech, but present evaluation designs continue to rely on known topic boundaries that are no longer well matched to the nature of the materials. We propose a new class of measures for speech retrieval based on manual annotation of points at which a user with specific topical interests would wish replay to begin.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115513761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an extensible java-based platform for contextual retrieval based on the probabilistic information retrieval model. Modules for dual indexes, relevance feedback with blind or machine learning approaches and query expansion with context are integrated into the Okapi system to deal with the contextual information. This platform allows easy extension to include other types of contextual information.
{"title":"A platform for Okapi-based contextual information retrieval","authors":"Xiangji Huang, M. Wen, Aijun An, Y. Huang","doi":"10.1145/1148170.1148341","DOIUrl":"https://doi.org/10.1145/1148170.1148341","url":null,"abstract":"We present an extensible java-based platform for contextual retrieval based on the probabilistic information retrieval model. Modules for dual indexes, relevance feedback with blind or machine learning approaches and query expansion with context are integrated into the Okapi system to deal with the contextual information. This platform allows easy extension to include other types of contextual information.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114918358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One challenging problem for biomedical text retrieval is to find accurate synonyms or name variants for biomedical entities. In this paper, we propose a new concept-based approach to tackle this problem. In this approach, a set of concepts instead of keywords will be extracted from a query first. Then these concepts will be used for retrieval purpose. The experiment results show that the proposed approach can boost the retrieval performance and it generates very good results on 2005 TREC Genomics data sets.
{"title":"Concept-based biomedical text retrieval","authors":"Ming Zhong, Xiangji Huang","doi":"10.1145/1148170.1148336","DOIUrl":"https://doi.org/10.1145/1148170.1148336","url":null,"abstract":"One challenging problem for biomedical text retrieval is to find accurate synonyms or name variants for biomedical entities. In this paper, we propose a new concept-based approach to tackle this problem. In this approach, a set of concepts instead of keywords will be extracted from a query first. Then these concepts will be used for retrieval purpose. The experiment results show that the proposed approach can boost the retrieval performance and it generates very good results on 2005 TREC Genomics data sets.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123930759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Technical professionals spend ~25% of their time at work searching for information, and have specialized information needs that are not well-served by generic enterprise search tools. In this study, we investigated how a group of software engineers use a workplace search system. We identify patterns of search behaviour specific to this group and distinct from general web and intranet search patterns, and make design recommendations for search systems that will better serve the needs of this group.
{"title":"Enterprise search behaviour of software engineers","authors":"Luanne Freund, Elaine Toms","doi":"10.1145/1148170.1148297","DOIUrl":"https://doi.org/10.1145/1148170.1148297","url":null,"abstract":"Technical professionals spend ~25% of their time at work searching for information, and have specialized information needs that are not well-served by generic enterprise search tools. In this study, we investigated how a group of software engineers use a workplace search system. We identify patterns of search behaviour specific to this group and distinct from general web and intranet search patterns, and make design recommendations for search systems that will better serve the needs of this group.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128896805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}