首页 > 最新文献

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval最新文献

英文 中文
On-line spam filter fusion 在线垃圾邮件过滤融合
T. Lynam, G. Cormack, D. Cheriton
We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.
我们展示了一组独立开发的垃圾邮件过滤器可以以简单的方式组合在一起,以提供比任何单个过滤器更好的过滤。将在TREC 2005垃圾邮件跟踪中评估的53个垃圾邮件过滤器的结果进行事后组合,以模拟过滤器的并行在线操作。使用TREC方法对综合结果进行评估,产生比最佳过滤器改善两倍以上的因素。最简单的方法——对单个过滤器返回的二元分类求平均值——产生了非常好的结果。一种新的方法——基于单个过滤器返回的分数平均对数赔率估计——产生了更好的结果,并为支持向量机和基于逻辑回归的堆叠方法提供了输入。堆叠方法似乎提供了进一步的改进,但仅适用于非常大的语料库。在堆叠方法中,逻辑回归的效果较好。最后,我们表明,有可能选择一个先验的过滤器子集,当组合时,仍然比最佳的单个过滤器性能好得多。
{"title":"On-line spam filter fusion","authors":"T. Lynam, G. Cormack, D. Cheriton","doi":"10.1145/1148170.1148195","DOIUrl":"https://doi.org/10.1145/1148170.1148195","url":null,"abstract":"We show that a set of independently developed spam filters may be combined in simple ways to provide substantially better filtering than any of the individual filters. The results of fifty-three spam filters evaluated at the TREC 2005 Spam Track were combined post-hoc so as to simulate the parallel on-line operation of the filters. The combined results were evaluated using the TREC methodology, yielding more than a factor of two improvement over the best filter. The simplest method -- averaging the binary classifications returned by the individual filters -- yields a remarkably good result. A new method -- averaging log-odds estimates based on the scores returned by the individual filters -- yields a somewhat better result, and provides input to SVM- and logistic-regression-based stacking methods. The stacking methods appear to provide further improvement, but only for very large corpora. Of the stacking methods, logistic regression yields the better result. Finally, we show that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134003505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Feature diversity in cluster ensembles for robust document clustering 针对鲁棒文档聚类的聚类集成特征多样性
Xavier Sevillano, Germán Cobo, Francesc Alías, J. Socoró
The performance of document clustering systems depends on employing optimal text representations, which are not only difficult to determine beforehand, but also may vary from one clustering problem to another. As a first step towards building robust document clusterers, a strategy based on feature diversity and cluster ensembles is presented in this work. Experiments conducted on a binary clustering problem show that our method is robust to near-optimal model order selection and able to detect constructive interactions between different document representations in the test bed.
文档聚类系统的性能依赖于使用最优的文本表示,这不仅难以事先确定,而且可能因聚类问题而异。作为构建健壮文档聚类的第一步,本文提出了一种基于特征多样性和聚类集成的策略。在一个二元聚类问题上进行的实验表明,我们的方法对接近最优的模型顺序选择具有鲁棒性,并且能够检测到测试平台中不同文档表示之间的建设性交互。
{"title":"Feature diversity in cluster ensembles for robust document clustering","authors":"Xavier Sevillano, Germán Cobo, Francesc Alías, J. Socoró","doi":"10.1145/1148170.1148323","DOIUrl":"https://doi.org/10.1145/1148170.1148323","url":null,"abstract":"The performance of document clustering systems depends on employing optimal text representations, which are not only difficult to determine beforehand, but also may vary from one clustering problem to another. As a first step towards building robust document clusterers, a strategy based on feature diversity and cluster ensembles is presented in this work. Experiments conducted on a binary clustering problem show that our method is robust to near-optimal model order selection and able to detect constructive interactions between different document representations in the test bed.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134100904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Type less, find more: fast autocompletion search with a succinct index 输入更少,查找更多:快速自动补全搜索与简洁的索引
H. Bast, Ingmar Weber
We consider the following full-text search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a state-of-the-art compressed inverted index, but with 10 times faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average response times of one tenth of a second. We have built a full-fledged, interactive search engine that realizes the proposed autocompletion feature combined with support for proximity search, semi-structured (XML) text, subword and phrase completion, and semantic tags.
我们考虑以下全文搜索自动补全功能。假设一个搜索引擎用户输入一个查询。然后,随着每个字母被键入,我们希望最后一个查询词的补全的即时显示,这将导致良好的命中。同时,应该显示任何这些补全的最佳命中率。适用于此问题的已知索引数据结构,要么会导致大量查询的处理时间变长,要么会占用大量空间。我们提出了一种新的索引数据结构,它使用的空间并不比最先进的压缩倒排索引多,但查询处理时间却快了10倍。即使在包含超过2500万个文档的大型TREC tb集合上,我们在一台机器上实现了索引在磁盘上的平均响应时间为十分之一秒。我们已经构建了一个成熟的交互式搜索引擎,它实现了建议的自动补全功能,并结合了对邻近搜索、半结构化(XML)文本、子词和短语补全以及语义标记的支持。
{"title":"Type less, find more: fast autocompletion search with a succinct index","authors":"H. Bast, Ingmar Weber","doi":"10.1145/1148170.1148234","DOIUrl":"https://doi.org/10.1145/1148170.1148234","url":null,"abstract":"We consider the following full-text search autocompletion feature. Imagine a user of a search engine typing a query. Then with every letter being typed, we would like an instant display of completions of the last query word which would lead to good hits. At the same time, the best hits for any of these completions should be displayed. Known indexing data structures that apply to this problem either incur large processing times for a substantial class of queries, or they use a lot of space. We present a new indexing data structure that uses no more space than a state-of-the-art compressed inverted index, but with 10 times faster query processing times. Even on the large TREC Terabyte collection, which comprises over 25 million documents, we achieve, on a single machine and with the index on disk, average response times of one tenth of a second. We have built a full-fledged, interactive search engine that realizes the proposed autocompletion feature combined with support for proximity search, semi-structured (XML) text, subword and phrase completion, and semantic tags.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134376513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 220
Automated performance assessment in interactive QA 交互式QA中的自动性能评估
J. Chai, Tyler Baldwin, Chen Zhang
In interactive question answering (QA), users and systems take turns to ask questions and provide answers. In such an interactive setting, user questions largely depend on the answers provided by the system. One question is whether user follow-up questions can provide feedback for the system to automatically assess its performance (e.g., assess whether a correct answer is delivered). This self-awareness can make QA systems more intelligent for information seeking, for example, by adapting better strategies to cope with problematic situations. Therefore, this paper describes our initial investigation in addressing this problem. Our results indicate that interaction context can provide useful cues for automated performance assessment in interactive QA.
在交互式问答(QA)中,用户和系统轮流提出问题并提供答案。在这样一个交互式设置中,用户的问题很大程度上取决于系统提供的答案。一个问题是用户后续的问题是否可以为系统自动评估其性能提供反馈(例如,评估是否提供了正确的答案)。这种自我意识可以使QA系统在信息搜索方面更加智能,例如,通过调整更好的策略来处理有问题的情况。因此,本文描述了我们在解决这一问题方面的初步调查。我们的研究结果表明,交互上下文可以为交互式QA中的自动性能评估提供有用的线索。
{"title":"Automated performance assessment in interactive QA","authors":"J. Chai, Tyler Baldwin, Chen Zhang","doi":"10.1145/1148170.1148290","DOIUrl":"https://doi.org/10.1145/1148170.1148290","url":null,"abstract":"In interactive question answering (QA), users and systems take turns to ask questions and provide answers. In such an interactive setting, user questions largely depend on the answers provided by the system. One question is whether user follow-up questions can provide feedback for the system to automatically assess its performance (e.g., assess whether a correct answer is delivered). This self-awareness can make QA systems more intelligent for information seeking, for example, by adapting better strategies to cope with problematic situations. Therefore, this paper describes our initial investigation in addressing this problem. Our results indicate that interaction context can provide useful cues for automated performance assessment in interactive QA.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131731750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Term proximity scoring for ad-hoc retrieval on very large text collections 在非常大的文本集合上进行临时检索的术语接近度评分
Stefan Büttcher, C. Clarke, Brad Lushman
We propose an integration of term proximity scoring into Okapi BM25. The relative retrieval effectiveness of our retrieval method, compared to pure BM25, varies from collection to collection.We present an experimental evaluation of our method and show that the gains achieved over BM25 as the size of the underlying text collection increases. We also show that for stemmed queries the impact of term proximity scoring is larger than for unstemmed queries.
我们建议将术语接近度评分整合到霍加狓BM25中。与纯BM25相比,我们的检索方法的相对检索效率因收集而异。我们对我们的方法进行了实验评估,并显示了随着底层文本集合的大小增加,在BM25上获得的收益。我们还表明,对于有词源的查询,术语接近度评分的影响大于无词源的查询。
{"title":"Term proximity scoring for ad-hoc retrieval on very large text collections","authors":"Stefan Büttcher, C. Clarke, Brad Lushman","doi":"10.1145/1148170.1148285","DOIUrl":"https://doi.org/10.1145/1148170.1148285","url":null,"abstract":"We propose an integration of term proximity scoring into Okapi BM25. The relative retrieval effectiveness of our retrieval method, compared to pure BM25, varies from collection to collection.We present an experimental evaluation of our method and show that the gains achieved over BM25 as the size of the underlying text collection increases. We also show that for stemmed queries the impact of term proximity scoring is larger than for unstemmed queries.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129373033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 183
A statistical method for system evaluation using incomplete judgments 使用不完全判断进行系统评价的统计方法
J. Aslam, Virgil Pavlu, Emine Yilmaz
We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.
考虑了大规模检索评价问题,提出了一种利用不完全判断对检索系统进行评价的统计方法。不像现有的技术(1)依赖于有效完整的,因此非常昂贵的相关性判断集,(2)产生标准性能度量的有偏差估计,或(3)产生被认为与这些标准度量相关的非标准度量的估计,我们提出的统计技术产生标准度量本身的无偏估计。我们提出的技术是基于随机抽样的。虽然我们的估计是无偏的统计设计,他们的方差是依赖于抽样分布;因此,我们推导出一个可能产生低方差估计的抽样分布。我们使用基准TREC数据测试了我们提出的技术,证明了从一组运行中获得的采样池可以有效地评估这些运行。我们进一步表明,这些采样池可以很好地推广到未见过的运行。我们的实验表明,使用少量的相关性判断(仅占典型trec风格判断池的4%)就可以获得对标准性能度量的高度准确的估计。
{"title":"A statistical method for system evaluation using incomplete judgments","authors":"J. Aslam, Virgil Pavlu, Emine Yilmaz","doi":"10.1145/1148170.1148263","DOIUrl":"https://doi.org/10.1145/1148170.1148263","url":null,"abstract":"We consider the problem of large-scale retrieval evaluation, and we propose a statistical method for evaluating retrieval systems using incomplete judgments. Unlike existing techniques that (1) rely on effectively complete, and thus prohibitively expensive, relevance judgment sets, (2) produce biased estimates of standard performance measures, or (3) produce estimates of non-standard measures thought to be correlated with these standard measures, our proposed statistical technique produces unbiased estimates of the standard measures themselves.Our proposed technique is based on random sampling. While our estimates are unbiased by statistical design, their variance is dependent on the sampling distribution employed; as such, we derive a sampling distribution likely to yield low variance estimates. We test our proposed technique using benchmark TREC data, demonstrating that a sampling pool derived from a set of runs can be used to efficiently and effectively evaluate those runs. We further show that these sampling pools generalize well to unseen runs. Our experiments indicate that highly accurate estimates of standard performance measures can be obtained using a number of relevance judgments as small as 4% of the typical TREC-style judgment pool.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131167344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 185
Context-sensitive semantic smoothing for the language modeling approach to genomic IR 基因组IR语言建模方法的上下文敏感语义平滑
Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, I. Song
Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. The implemented semantic smoothing models, such as the translation model which statistically maps document terms to query terms, and a number of works that have followed have shown good experimental results. However, these models are unable to incorporate contextual information. Thus, the resulting translation might be mixed and fairly general. To overcome this limitation, we propose a novel context-sensitive semantic smoothing method that decomposes a document or a query into a set of weighted context-sensitive topic signatures and then translate those topic signatures into query terms. In detail, we solve this problem through (1) choosing concept pairs as topic signatures and adopting an ontology-based approach to extract concept pairs; (2) estimating the translation model for each topic signature using the EM algorithm; and (3) expanding document and query models based on topic signature translations. The new smoothing method is evaluated on TREC 2004/05 Genomics Track collections and significant improvements are obtained. The MAP (mean average precision) achieves a 33.6% maximal gain over the simple language model, as well as a 7.8% gain over the language model with context-insensitive semantic smoothing.
语义平滑是一种将同义词和语义信息整合到语言模型中的方法,对提高检索性能具有重要意义。实现的语义平滑模型,如将文档术语统计映射到查询术语的翻译模型,以及随后的一些工作,都取得了良好的实验结果。然而,这些模型不能合并上下文信息。因此,最终的翻译可能是混合的和相当通用的。为了克服这一限制,我们提出了一种新的上下文敏感语义平滑方法,该方法将文档或查询分解为一组加权的上下文敏感主题签名,然后将这些主题签名转换为查询术语。具体解决方法如下:(1)选择概念对作为主题签名,采用基于本体的方法提取概念对;(2)利用EM算法估计每个主题签名的翻译模型;(3)扩展基于主题签名翻译的文档和查询模型。在TREC 2004/05基因组跟踪数据集上对该平滑方法进行了评估,结果表明该方法有显著改进。MAP(平均精度)比简单语言模型获得了33.6%的最大增益,比具有上下文不敏感语义平滑的语言模型获得了7.8%的增益。
{"title":"Context-sensitive semantic smoothing for the language modeling approach to genomic IR","authors":"Xiaohua Zhou, Xiaohua Hu, Xiaodan Zhang, Xia Lin, I. Song","doi":"10.1145/1148170.1148203","DOIUrl":"https://doi.org/10.1145/1148170.1148203","url":null,"abstract":"Semantic smoothing, which incorporates synonym and sense information into the language models, is effective and potentially significant to improve retrieval performance. The implemented semantic smoothing models, such as the translation model which statistically maps document terms to query terms, and a number of works that have followed have shown good experimental results. However, these models are unable to incorporate contextual information. Thus, the resulting translation might be mixed and fairly general. To overcome this limitation, we propose a novel context-sensitive semantic smoothing method that decomposes a document or a query into a set of weighted context-sensitive topic signatures and then translate those topic signatures into query terms. In detail, we solve this problem through (1) choosing concept pairs as topic signatures and adopting an ontology-based approach to extract concept pairs; (2) estimating the translation model for each topic signature using the EM algorithm; and (3) expanding document and query models based on topic signature translations. The new smoothing method is evaluated on TREC 2004/05 Genomics Track collections and significant improvements are obtained. The MAP (mean average precision) achieves a 33.6% maximal gain over the simple language model, as well as a 7.8% gain over the language model with context-insensitive semantic smoothing.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"113 16","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120826599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
LDA-based document models for ad-hoc retrieval 用于临时检索的基于lda的文档模型
Xing Wei, W. Bruce Croft
Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.
结合某种形式的主题模型的搜索算法在信息检索中有着悠久的历史。例如,基于聚类的检索从60年代开始研究,最近在语言模型框架中取得了良好的成果。一种基于文档的正式生成模型构建主题模型的方法,潜狄利克雷分配(Latent Dirichlet Allocation, LDA),在机器学习文献中被大量引用,但其在信息检索中的可行性和有效性大多未知。本文研究了如何有效地利用LDA来改进ad-hoc检索。我们在语言建模框架内提出了一个基于lda的文档模型,并在几个TREC集合上对其进行了评估。在LDA中采用Gibbs抽样进行近似推理,并分析了计算复杂度。我们表明,使用基于聚类的模型可以以合理的效率获得检索方面的改进。
{"title":"LDA-based document models for ad-hoc retrieval","authors":"Xing Wei, W. Bruce Croft","doi":"10.1145/1148170.1148204","DOIUrl":"https://doi.org/10.1145/1148170.1148204","url":null,"abstract":"Search algorithms incorporating some form of topic model have a long history in information retrieval. For example, cluster-based retrieval has been studied since the 60s and has recently produced good results in the language model framework. An approach to building topic models based on a formal generative model of documents, Latent Dirichlet Allocation (LDA), is heavily cited in the machine learning literature, but its feasibility and effectiveness in information retrieval is mostly unknown. In this paper, we study how to efficiently use LDA to improve ad-hoc retrieval. We propose an LDA-based document model within the language modeling framework, and evaluate it on several TREC collections. Gibbs sampling is employed to conduct approximate inference in LDA and the computational complexity is analyzed. We show that improvements over retrieval using cluster-based models can be obtained with reasonable efficiency.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"144 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120873599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1226
Getting work done on the web: supporting transactional queries 在网络上完成工作:支持事务性查询
Yunyao Li, R. Krishnamurthy, Shivakumar Vaithyanathan, H. Jagadish
Many searches on the web have a transactional intent. We argue that pages satisfying transactional needs can be distinguished from the more common pages that have some information and links, but cannot be used to execute a transaction. Based on this hypothesis, we provide a recipe for constructing a transaction annotator. By constructing an annotator with one corpus and then demonstrating its classification performance on another,we establish its robustness. Finally, we show experimentally that a search procedure that exploits such pre-annotation greatly outperforms traditional search for retrieving transactional pages.
网络上的许多搜索都有交易的意图。我们认为,可以将满足事务需求的页面与具有一些信息和链接但不能用于执行事务的更常见的页面区分开来。基于这个假设,我们提供了一个构造事务注释器的方法。通过在一个语料库上构造一个标注器,然后在另一个语料库上展示其分类性能,建立其鲁棒性。最后,我们通过实验证明,利用这种预注释的搜索过程在检索事务性页面时大大优于传统搜索。
{"title":"Getting work done on the web: supporting transactional queries","authors":"Yunyao Li, R. Krishnamurthy, Shivakumar Vaithyanathan, H. Jagadish","doi":"10.1145/1148170.1148266","DOIUrl":"https://doi.org/10.1145/1148170.1148266","url":null,"abstract":"Many searches on the web have a transactional intent. We argue that pages satisfying transactional needs can be distinguished from the more common pages that have some information and links, but cannot be used to execute a transaction. Based on this hypothesis, we provide a recipe for constructing a transaction annotator. By constructing an annotator with one corpus and then demonstrating its classification performance on another,we establish its robustness. Finally, we show experimentally that a search procedure that exploits such pre-annotation greatly outperforms traditional search for retrieving transactional pages.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123959958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Structure-driven crawler generation by example 通过示例生成结构驱动的爬虫
Márcio L. A. Vidal, A. D. Silva, E. Moura, J. Cavalcanti
Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.
许多Web IR和数字图书馆应用程序需要一个抓取过程来收集页面,其最终目标是利用Web站点上可用的有用信息。对于其中一些应用程序,确定页面何时出现在集合中的标准与页面内容相关。但是,在某些情况下,页面的内部结构比其内容提供了更好的标准来指导爬行过程。在本文中,我们提出了一种结构驱动的方法来生成Web爬虫,它只需要用户付出最小的努力。其思想是将示例页面和Web站点的入口点作为输入,并基于导航模式生成结构驱动的爬虫,爬虫必须遵循的链接模式序列才能到达结构类似于示例页面的页面。在我们进行的实验中,由我们的新方法生成的结构驱动爬虫能够收集与给定样本匹配的所有页面,包括那些在生成后添加的页面。
{"title":"Structure-driven crawler generation by example","authors":"Márcio L. A. Vidal, A. D. Silva, E. Moura, J. Cavalcanti","doi":"10.1145/1148170.1148223","DOIUrl":"https://doi.org/10.1145/1148170.1148223","url":null,"abstract":"Many Web IR and Digital Library applications require a crawling process to collect pages with the ultimate goal of taking advantage of useful information available on Web sites. For some of these applications the criteria to determine when a page is to be present in a collection are related to the page content. However, there are situations in which the inner structure of the pages provides a better criteria to guide the crawling process than their content. In this paper, we present a structure-driven approach for generating Web crawlers that requires a minimum effort from users. The idea is to take as input a sample page and an entry point to a Web site and generate a structure-driven crawler based on navigation patterns, sequences of patterns for the links a crawler has to follow to reach the pages structurally similar to the sample page. In the experiments we have carried out, structure-driven crawlers generated by our new approach were able to collect all pages that match the samples given, including those pages added after their generation.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125926534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
期刊
Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1