Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval最新文献

英文中文

Finding near-duplicate web pages: a large-scale evaluation of algorithms 寻找近乎重复的网页:对算法的大规模评估

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148222

M. Henzinger

Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.

Broder等人的[3]shingling算法和Charikar的[4]基于随机投影的方法被认为是寻找近重复网页的“最先进”算法。这两种算法都是由流行的网络搜索引擎开发或使用的。我们在一个非常大的规模上比较了这两种算法，即在一组1.6B不同的网页上。结果表明，两种算法都不能很好地寻找同一位点上的近重复对，而对于不同位点上的近重复对，两种算法都能达到较高的精度。由于Charikar的算法在不同的站点上发现了更多的近重复对，因此它总体上达到了更好的精度，即0.50，而Broder等人的算法为0.38。我们提出了一种组合算法，其精度为0.79，召回率为其他算法的79%。

引用次数: 514

Analysis of a low-dimensional linear model under recommendation attacks 推荐攻击下的低维线性模型分析

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148259

Sheng Zhang, Ouyang Yi, J. Ford, F. Makedon

Collaborative filtering techniques have become popular in the past decade as an effective way to help people deal with information overload. Recent research has identified significant vulnerabilities in collaborative filtering techniques. Shilling attacks, in which attackers introduce biased ratings to influence recommendation systems, have been shown to be effective against memory-based collaborative filtering algorithms. We examine the effectiveness of two popular shilling attacks (the random attack and the average attack) on a model-based algorithm that uses Singular Value Decomposition (SVD) to learn a low-dimensional linear model. Our results show that the SVD-based algorithm is much more resistant to shilling attacks than memory-based algorithms. Furthermore, we develop an attack detection method directly built on the SVD-based algorithm and show that this method detects random shilling attacks with high detection rates and very low false alarm rates.

协同过滤技术作为一种帮助人们处理信息过载的有效方法在过去十年中变得流行起来。最近的研究发现了协同过滤技术的重大漏洞。先令攻击，其中攻击者引入有偏见的评级来影响推荐系统，已被证明对基于记忆的协同过滤算法有效。我们研究了两种流行的先令攻击(随机攻击和平均攻击)对基于模型的算法的有效性，该算法使用奇异值分解(SVD)来学习低维线性模型。我们的研究结果表明，基于奇异值分解的算法比基于内存的算法更能抵抗先令攻击。此外，我们开发了一种直接建立在基于奇异值分解的算法上的攻击检测方法，并表明该方法检测随机先令攻击具有很高的检测率和很低的虚警率。

引用次数: 90

Automatic construction of known-item finding test beds 自动构建已知项目查找试验台

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148276

L. Azzopardi, M. de Rijke

This work is an initial study on the utility of automatically generated queries for evaluating known-item retrieval and how such queries compare to real queries. The main advantage of automatically generating queries is that for any given test collection numerous queries can be produced at minimal cost. For evaluation, this has huge ramifications as state-of-the-art algorithms can be tested on different types of generated queries which mimic particular querying styles that a user may adopt. Our approach draws upon previous research in IR which has probabilistically generated simulated queries for other purposes [2, 3].

这项工作是对自动生成查询的效用的初步研究，用于评估已知项检索，以及如何将这些查询与实际查询进行比较。自动生成查询的主要优点是，对于任何给定的测试集合，都可以以最小的成本生成大量查询。对于评估，这有很大的影响，因为最先进的算法可以在模拟用户可能采用的特定查询样式的不同类型的生成查询上进行测试。我们的方法借鉴了以前在IR方面的研究，该研究为其他目的概率地生成了模拟查询[2,3]。

引用次数: 38

Action modeling: language models that predict query behavior 动作建模:预测查询行为的语言模型

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148315

G. C. Murray, Jimmy J. Lin, Abdur Chowdhury

We present a novel language modeling approach to capturing the query reformulation behavior of Web search users. Based on a framework that categorizes eight different types of "user moves" (adding/removing query terms, etc.), we treat search sessions as sequence data and build n-gram language models to capture user behavior. We evaluated our models in a prediction task. The results suggest that useful patterns of activity can be extracted from user histories. Furthermore, by examining prediction performance under different order n-gram models, we gained insight into the amount of history/context that is associated with different types of user actions. Our work serves as the basis for more refined user models.

我们提出了一种新的语言建模方法来捕获Web搜索用户的查询重构行为。基于对八种不同类型的“用户移动”(添加/删除查询项等)进行分类的框架，我们将搜索会话视为序列数据，并构建n-gram语言模型来捕获用户行为。我们在一个预测任务中评估了我们的模型。结果表明，可以从用户历史中提取有用的活动模式。此外，通过检查不同阶n-gram模型下的预测性能，我们深入了解了与不同类型的用户操作相关的历史/上下文的数量。我们的工作可以作为更精细的用户模型的基础。

引用次数: 1

Measuring similarity of semi-structured documents with context weights 用上下文权重度量半结构化文档的相似度

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148334

Christopher C. Yang, Nan Liu

In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure. Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents.

在这项工作中，我们研究了基于扩展向量空间模型的以文本为中心的XML文档的相似性度量，该模型同时考虑了文档内容和结构。基于基准的实验结果表明，所提出的度量比忽略XML文档结构知识的基线具有更好的性能。

引用次数: 5

Answering complex questions with random walk models 用随机游走模型回答复杂问题

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148211

S. Harabagiu, V. Lacatusu, Andrew Hickl

We present a novel framework for answering complex questions that relies on question decomposition. Complex questions are decomposed by a procedure that operates on a Markov chain, by following a random walk on a bipartite graph of relations established between concepts related to the topic of a complex question and subquestions derived from topic-relevant passages that manifest these relations. Decomposed questions discovered during this random walk are then submitted to a state-of-the-art Question Answering (Q/A) system in order to retrieve a set of passages that can later be merged into a comprehensive answer by a Multi-Document Summarization (MDS) system. In our evaluations, we show that access to the decompositions generated using this method can significantly enhance the relevance and comprehensiveness of summary-length answers to complex questions.

我们提出了一个新的框架来回答依赖于问题分解的复杂问题。复杂问题通过一个在马尔可夫链上操作的过程来分解，通过在与复杂问题的主题相关的概念和从显示这些关系的主题相关段落派生的子问题之间建立的关系的二部图上随机行走。在随机漫步过程中发现的分解问题随后被提交给最先进的问答(Q/ a)系统，以便检索一组段落，这些段落随后可以由多文档摘要(MDS)系统合并成一个全面的答案。在我们的评估中，我们表明访问使用这种方法生成的分解可以显着提高对复杂问题的摘要长度答案的相关性和全面性。

引用次数: 75

Is XML retrieval meaningful to users?: searcher preferences for full documents vs. elements XML检索对用户有意义吗?:完整文档与元素的搜索首选项

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148306

Birger Larsen, A. Tombros, Saadia Malik

The aim of this study is to investigate whether element retrieval (as opposed to full-text retrieval) is meaningful and useful for searchers when carrying out information-seeking tasks. Our results suggest that searchers find the structural breakdown of documents useful when browsing within retrieved documents, and provide support for the usefulness of element retrieval in interactive settings.

本研究的目的是探讨元素检索(相对于全文检索)对检索者在执行信息检索任务时是否有意义和有用。我们的研究结果表明，搜索者在浏览检索到的文档时发现文档的结构分解是有用的，并且为交互式设置中的元素检索提供了有用的支持。

引用次数: 23

The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine 知识在概念检索中的作用:临床医学领域的研究

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148191

Jimmy J. Lin, Dina Demner-Fushman

Despite its intuitive appeal, the hypothesis that retrieval at the level of "concepts" should outperform purely term-based approaches remains unverified empirically. In addition, the use of "knowledge" has not consistently resulted in performance gains. After identifying possible reasons for previous negative results, we present a novel framework for "conceptual retrieval" that articulates the types of knowledge that are important for information seeking. We instantiate this general framework in the domain of clinical medicine based on the principles of evidence-based medicine (EBM). Experiments show that an EBM-based scoring algorithm dramatically outperforms a state-of-the-art baseline that employs only term statistics. Ablation studies further yield a better understanding of the performance contributions of different components. Finally, we discuss how other domains can benefit from knowledge-based approaches.

尽管其直观的吸引力，在“概念”层面的检索应该优于纯基于术语的方法的假设仍然未经经验验证。此外，“知识”的使用并不总是带来性能提升。在确定了之前负面结果的可能原因之后，我们提出了一个新的“概念检索”框架，该框架阐明了对信息搜索重要的知识类型。我们在基于循证医学(EBM)原则的临床医学领域实例化了这一总体框架。实验表明，基于ebm的评分算法显著优于仅使用术语统计的最先进基线。烧蚀研究进一步使人们更好地了解不同组分对性能的贡献。最后，我们讨论了其他领域如何从基于知识的方法中受益。

引用次数: 71

Hybrid index maintenance for growing text collections 用于不断增长的文本集合的混合索引维护

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148233

Stefan Büttcher, C. Clarke, Brad Lushman

We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.

我们提出了一种新的混合索引维护策略，用于单调增长的文本集合的在线索引构建。这些新策略改进了动态文本检索系统中混合索引维护的最新结果。与以前的技术一样，我们的新方法区分了短列表和长列表:短列表使用合并策略进行维护，长列表则保持独立，并在适当的位置进行更新。通过这种方式，可以避免对长张贴列表进行昂贵的重新定位。讨论了现有混合方法的不足，并对该方法进行了实验评价，结果表明，该方法的索引维护性能优于原有方法，特别是在索引系统可用主存较小的情况下。通过复杂度分析，证明了在Zipfian项分布下，最佳混合维护策略执行的磁盘访问次数与文本集合的大小呈线性关系，表明该策略具有渐近最优性。

{"title":"Hybrid index maintenance for growing text collections","authors":"Stefan Büttcher, C. Clarke, Brad Lushman","doi":"10.1145/1148170.1148233","DOIUrl":"https://doi.org/10.1145/1148170.1148233","url":null,"abstract":"We present a new family of hybrid index maintenance strategies to be used in on-line index construction for monotonically growing text collections. These new strategies improve upon recent results for hybrid index maintenance in dynamic text retrieval systems. Like previous techniques, our new method distinguishes between short and long posting lists: While short lists are maintained using a merge strategy, long lists are kept separate and are updated in-place. This way, costly relocations of long posting lists are avoided.We discuss the shortcomings of previous hybrid methods and give an experimental evaluation of the new technique, showing that its index maintenance performance is superior to that of the earlier methods, especially when the amount of main memory available to the indexing system is small. We also present a complexity analysis which proves that, under a Zipfian term distribution, the asymptotical number of disk accesses performed by the best hybrid maintenance strategy is linear in the size of the text collection, implying the asymptotical optimality of the proposed strategy.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130382901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Searching the web using composed pages 使用组合页面搜索网络

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Pub Date : 2006-08-06 DOI: 10.1145/1148170.1148331

R. Varadarajan, Vagelis Hristidis, Tao Li

Given a user keyword query, current Web search engines return a list of pages ranked by their “goodness” with respect to the query. However, this technique misses results whose contents are distributed across multiple physical pages and are connected via hyperlinks and frames [3]. That is, it is often the case that no single page contains all query keywords. Li et al. [3] make a first step towards this problem by returning a tree of hyperlinked pages that collectively contain all query keywords. The limitation of this approach is that it operates at the page-level granularity, which ignores the specific context where the keywords are found within the pages. More importantly, it is cumbersome for the user to locate the most desirable tree of pages due to the amount of data in each page tree and a large number of page trees.

给定一个用户关键字查询，当前的Web搜索引擎将返回一个页面列表，这些页面按照它们相对于该查询的“优点”进行排序。然而，该技术忽略了内容分布在多个物理页面上并通过超链接和框架连接的结果[3]。也就是说，通常没有一个页面包含所有查询关键字。Li等人[3]通过返回一个包含所有查询关键字的超链接页面树，为解决这个问题迈出了第一步。这种方法的局限性在于它在页面级粒度上操作，忽略了在页面中找到关键字的特定上下文。更重要的是，由于每个页面树中的数据量和大量页面树，用户定位最理想的页面树是很麻烦的。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀