Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval最新文献_第10页

Authorship attribution with thousands of candidate authors 作者署名与成千上万的候选作者

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148304

Moshe Koppel, Jonathan Schler, S. Argamon, Eran Messeri

In this paper, we use a blog corpus to demonstrate that we can often identify the author of an anonymous text even where there are many thousands of candidate authors. Our approach combines standard information retrieval methods with a text categorization meta-learning scheme that determines when to even venture a guess.

在本文中，我们使用博客语料库来证明，即使有成千上万的候选作者，我们也经常可以识别匿名文本的作者。我们的方法将标准信息检索方法与文本分类元学习方案相结合，该方案决定何时冒险猜测。

引用次数: 110

Pruned query evaluation using pre-computed impacts 使用预先计算的影响进行修剪查询评估

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148235

V. Anh, Alistair Moffat

Exhaustive evaluation of ranked queries can be expensive, particularly when only a small subset of the overall ranking is required, or when queries contain common terms. This concern gives rise to techniques for dynamic query pruning, that is, methods for eliminating redundant parts of the usual exhaustive evaluation, yet still generating a demonstrably "good enough" set of answers to the query. In this work we propose new pruning methods that make use of impact-sorted indexes. Compared to exhaustive evaluation, the new methods reduce the amount of computation performed, reduce the amount of memory required for accumulators, reduce the amount of data transferred from disk, and at the same time allow performance guarantees in terms of precision and mean average precision. These strong claims are backed by experiments using the TREC Terabyte collection and queries.

对排名查询进行详尽的评估可能会非常昂贵，特别是当只需要整个排名的一小部分时，或者当查询包含常见术语时。这种关注导致了动态查询修剪技术的出现，即消除通常的穷举计算中冗余部分的方法，同时仍然为查询生成一组明显“足够好”的答案。在这项工作中，我们提出了利用影响排序索引的新的修剪方法。与穷举评估相比，新方法减少了执行的计算量，减少了累加器所需的内存量，减少了从磁盘传输的数据量，同时在精度和平均精度方面提供了性能保证。这些强有力的主张得到了使用TREC tb集合和查询的实验的支持。

引用次数: 177

Rpref: a generalization of Bpref towards graded relevance judgments Rpref: Bpref对分级相关性判断的概括

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148293

Jan De Beer, Marie-Francine Moens

We present rpref ; our generalization of the bpref evaluation metric for assessing the quality of search engine results, given graded rather than binary user relevance judgments.

我们提出代表;我们对bpref评估指标的推广，用于评估搜索引擎结果的质量，给出了分级而不是二元用户相关性判断。

引用次数: 15

Incorporating query difference for learning retrieval functions in information retrieval 在信息检索中，结合查询差异实现学习检索功能

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148335

H. Zha, Zhaohui Zheng, Haoying Fu, Gordon Sun

We discuss information retrieval methods that aim at serving a diverse stream of user queries. We propose methods that emphasize the importance of taking into consideration of query difference in learning effective retrieval functions. We formulate the problem as a multi-task learning problem using a risk minimization framework. In particular, we show how to calibrate the empirical risk to incorporate query difference in terms of introducing nuisance parameters in the statistical models, and we also propose an alternating optimization method to simultaneously learn the retrieval function and the nuisance parameters. We illustrate the effectiveness of the proposed methods using modeling data extracted from a commercial search engine.

我们讨论了旨在服务于不同用户查询流的信息检索方法。我们提出的方法强调了在学习有效的检索函数时考虑查询差异的重要性。我们使用风险最小化框架将问题表述为一个多任务学习问题。特别是，我们展示了如何在统计模型中引入妨害参数来校准经验风险以纳入查询差异，并且我们还提出了一种交替优化方法来同时学习检索函数和妨害参数。我们使用从商业搜索引擎中提取的建模数据来说明所提出方法的有效性。

引用次数: 5

Bias and the limits of pooling 偏倚和池化的限制

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148284

C. Buckley, D. Dimmick, I. Soboroff, E. Voorhees

Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. As document sets grow larger, a constant-size pool represents an increasingly small percentage of the document set, and at some point the assumption of approximately complete judgments must become invalid.This paper demonstrates that the AQUAINT 2005 test collection exhibits bias caused by pools that were too shallow for the document set size despite having many diverse runs contribute to the pools. The existing judgment set favors relevant documents that contain topic title words even though relevant documents containing few topic title words are known to exist in the document set. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.

现代检索测试集合是通过一个称为池的过程构建的，在这个过程中，每个主题只判断整个文档集的一个样本。池化背后的思想是找到足够多的相关文档，这样，当假定未经判断的文档不相关时，得出的判断集就足够完整和无偏。随着文档集越来越大，固定大小的池在文档集中所占的比例越来越小，并且在某些时候，近似完整判断的假设必须变得无效。本文证明，AQUAINT 2005测试集显示出偏差，这是由于池对于文档集大小来说太浅，尽管对池有许多不同的运行贡献。即使已知文档集中存在很少包含主题标题词的相关文档，现有判断集也倾向于包含主题标题词的相关文档。本文最后提出了对传统池和评估方法的修改建议，这些方法可能允许构建非常大的可重用测试集合。

引用次数: 55

Give me just one highly relevant document: P-measure 给我一个高度相关的文件:P-measure

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148322

T. Sakai

We introduce an evaluation metric called P-measure for the task of retrieving

我们为检索一个高度相关文档的任务引入了一个称为P-measure的评估度量。它在实际任务中模拟用户行为，如已知项搜索，并且比不能处理分级相关性的倒数排序更稳定和敏感。

引用次数: 11

The TIJAH XML information retrieval system TIJAH XML信息检索系统

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148338

H. Blok, V. Mihajlović, G. Ramírez, T. Westerveld, D. Hiemstra, A. D. Vries

Not many XML information retrieval (IR) systems exist that allow easy addition of and switching between different IR models. Especially for the scientific environment where building a system takes a lot of time and keeps researchers away from the real work, i.e., investigating what is the most effective IR model, a platform that would provide this functionality would be ideal. For this reason we developed such an XML IR system. It is centered around a logical algebra, named score region algebra (SRA), that enables transparent specification of IR models for XML databases (see [1] for more details). The transparency is achieved by a possibility to instantiate various retrieval models, using abstract score functions within algebra operators, while logical query plan and operator definitions remain unchanged. Our algebra operators model three important aspects of XML IR: element relevance score computation, element score propagation, and element score combination. To implement a new IR model, one only needs to provide definitions for these abstract function classes. To illustrate the usefulness of our algebra our demo system supports several, well known IR scoring models (e.g., Language Models, Okapi, and tf.idf), combined with different score propagation and combination functions. The user can select which model to use at run time. Following good practice in database systems design, our prototype system has a typical three-layered architecture. (1) The conceptual layer takes a NEXI [3] query expression as input, e.g.,

没有多少XML信息检索(IR)系统允许在不同的IR模型之间轻松添加和切换。特别是对于构建系统需要花费大量时间并且使研究人员远离实际工作的科学环境，例如，调查什么是最有效的IR模型，提供此功能的平台将是理想的。出于这个原因，我们开发了这样一个XML IR系统。它以一个名为分数区域代数(SRA)的逻辑代数为中心，它支持XML数据库IR模型的透明规范(参见[1]了解更多细节)。通过使用代数运算符中的抽象分数函数实例化各种检索模型，而逻辑查询计划和运算符定义保持不变，可以实现透明度。我们的代数运算符为XML IR的三个重要方面建模:元素相关性评分计算、元素评分传播和元素评分组合。要实现新的IR模型，只需要为这些抽象函数类提供定义。为了说明我们的代数的有用性，我们的演示系统支持几个众所周知的IR评分模型(例如，语言模型，Okapi和tf.idf)，结合不同的分数传播和组合函数。用户可以选择在运行时使用哪个模型。遵循数据库系统设计中的良好实践，我们的原型系统具有典型的三层体系结构。(1)概念层以NEXI[3]查询表达式作为输入，例如:

{"title":"The TIJAH XML information retrieval system","authors":"H. Blok, V. Mihajlović, G. Ramírez, T. Westerveld, D. Hiemstra, A. D. Vries","doi":"10.1145/1148170.1148338","DOIUrl":"https://doi.org/10.1145/1148170.1148338","url":null,"abstract":"Not many XML information retrieval (IR) systems exist that allow easy addition of and switching between different IR models. Especially for the scientific environment where building a system takes a lot of time and keeps researchers away from the real work, i.e., investigating what is the most effective IR model, a platform that would provide this functionality would be ideal. For this reason we developed such an XML IR system. It is centered around a logical algebra, named score region algebra (SRA), that enables transparent specification of IR models for XML databases (see [1] for more details). The transparency is achieved by a possibility to instantiate various retrieval models, using abstract score functions within algebra operators, while logical query plan and operator definitions remain unchanged. Our algebra operators model three important aspects of XML IR: element relevance score computation, element score propagation, and element score combination. To implement a new IR model, one only needs to provide definitions for these abstract function classes. To illustrate the usefulness of our algebra our demo system supports several, well known IR scoring models (e.g., Language Models, Okapi, and tf.idf), combined with different score propagation and combination functions. The user can select which model to use at run time. Following good practice in database systems design, our prototype system has a typical three-layered architecture. (1) The conceptual layer takes a NEXI [3] query expression as input, e.g.,","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128788096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Combining bidirectional translation and synonymy for cross-language information retrieval 结合双向翻译和同义词进行跨语言信息检索

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148208

Jianqiang Wang, Douglas W. Oard

This paper introduces a general framework for the use of translation probabilities in cross-language information retrieval based on the notion that information retrieval fundamentally requires matching what the searcher means with what the author of a document meant. That perspective yields a computational formulation that provides a natural way of combining what have been known as query and document translation. Two well-recognized techniques are shown to be a special case of this model under restrictive assumptions. Cross-language search results are reported that are statistically indistinguishable from strong monolingual baselines for both French and Chinese documents.

基于信息检索从根本上要求将检索者的意思与文档作者的意思相匹配这一概念，本文介绍了在跨语言信息检索中使用翻译概率的一般框架。该透视图产生了一个计算公式，该公式提供了一种自然的方式来组合已知的查询和文档翻译。两种公认的技术被证明是该模型在限制性假设下的特殊情况。据报道，跨语言搜索结果在统计上与法语和中文文档的强单语言基线无法区分。

引用次数: 72

A new web page summarization method 一种新的网页摘要方法

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148294

Q. Diao, Jiulong Shan

In this paper, we present a novel multi-webpage summarization algorithm. It adds the graph based ranking algorithm into the framework of Maximum Marginal Relevance (MMR) method, to not only capture the main topic of the web pages but also eliminate the redundancy existing in the sentences of the summary result. The experiment result indicates that the new approach has the better performance than the previous methods.

本文提出了一种新的多网页摘要算法。它将基于图的排序算法加入到最大边际相关性(MMR)方法的框架中，既能捕捉网页的主题，又能消除摘要结果中句子中的冗余。实验结果表明，该方法比以往的方法具有更好的性能。

引用次数: 10

Fact-focused novelty detection: a feasibility study 以事实为中心的新颖性检测:可行性研究

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148318

Jahna Otterbacher, Dragomir R. Radev

Methods for detecting sentences in an input document set, which are both relevant and novel with respect to an information need, would be of direct benefit to many systems, such as extractive text summarizers. However, satisfactory levels of agreement between judges performing this task manually have yet to demonstrated, leaving researchers to conclude that the task is too subjective. In previous experiments, judges were asked to first identify sentences that are relevant to a general topic, and then to eliminate sentences from the list that do not contain new information. Currently, a new task is proposed, in which annotators perform the same procedure, but within the context of a specific, factual information need. In the experiment, satisfactory levels of agreement between independent annotators were achieved on the first step of identifying sentences containing relevant information relevant. However, the results indicate that judges do not agree on which sentences contain novel information.

在输入文档集中检测与信息需求相关且新颖的句子的方法将对许多系统(如提取文本摘要器)有直接的好处。然而，人工执行这项任务的法官之间的一致程度令人满意，这还有待证明，这让研究人员得出结论，认为这项任务过于主观。在之前的实验中，法官被要求首先识别与一般主题相关的句子，然后从列表中删除不包含新信息的句子。目前，提出了一种新的任务，其中注释者执行相同的过程，但在特定的事实信息需求的上下文中。在实验中，独立注释者在识别包含相关信息的句子的第一步上达到了满意的一致性。然而，结果表明，法官在哪些句子包含新信息方面意见不一。

引用次数: 0

首页上一页

4
5
6
7
8
9
10