首页 > 最新文献

Foundations and Trends in Information Retrieval最新文献

英文 中文
Expertise Retrieval 专业知识检索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2012-08-12 DOI: 10.1561/1500000024
K. Balog, Yi Fang, M. de Rijke, P. Serdyukov, Luo Si
People have looked for experts since before the advent of computers. With advances in information retrieval technology and the large-scale availability of digital traces of knowledge-related activities, computer systems that can fully automate the process of locating expertise have become a reality. The past decade has witnessed tremendous interest, and a wealth of results, in expertise retrieval as an emerging subdiscipline in information retrieval. This survey highlights advances in models and algorithms relevant to this field. We draw connections among methods proposed in the literature and summarize them in five groups of basic approaches. These serve as the building blocks for more advanced models that arise when we consider a range of content-based factors that may impact the strength of association between a topic and a person. We also discuss practical aspects of building an expert search system and present applications of the technology in other domains, such as blog distillation and entity retrieval. The limitations of current approaches are also pointed out. We end our survey with a set of conjectures on what the future may hold for expertise retrieval research.
在计算机出现之前,人们就一直在寻找专家。随着信息检索技术的进步和知识相关活动的数字痕迹的大规模可用性,能够完全自动化定位专业知识过程的计算机系统已经成为现实。在过去的十年中,专家知识检索作为信息检索领域的一个新兴分支学科得到了极大的关注和大量的成果。这项调查突出了与该领域相关的模型和算法的进展。我们在文献中提出的方法之间建立联系,并将其归纳为五组基本方法。当我们考虑一系列可能影响主题和人之间关联强度的基于内容的因素时,这些模型将成为更高级模型的构建块。我们还讨论了建立专家搜索系统的实际方面,以及该技术在其他领域的应用,如博客蒸馏和实体检索。同时指出了现有方法的局限性。我们以一组关于专业知识检索研究的未来的猜想来结束我们的调查。
{"title":"Expertise Retrieval","authors":"K. Balog, Yi Fang, M. de Rijke, P. Serdyukov, Luo Si","doi":"10.1561/1500000024","DOIUrl":"https://doi.org/10.1561/1500000024","url":null,"abstract":"People have looked for experts since before the advent of computers. With advances in information retrieval technology and the large-scale availability of digital traces of knowledge-related activities, computer systems that can fully automate the process of locating expertise have become a reality. The past decade has witnessed tremendous interest, and a wealth of results, in expertise retrieval as an emerging subdiscipline in information retrieval. This survey highlights advances in models and algorithms relevant to this field. We draw connections among methods proposed in the literature and summarize them in five groups of basic approaches. These serve as the building blocks for more advanced models that arise when we consider a range of content-based factors that may impact the strength of association between a topic and a person. We also discuss practical aspects of building an expert search system and present applications of the technology in other domains, such as blog distillation and entity retrieval. The limitations of current approaches are also pointed out. We end our survey with a set of conjectures on what the future may hold for expertise retrieval research.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"2 1","pages":"127-256"},"PeriodicalIF":10.4,"publicationDate":"2012-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84257898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 227
Information Retrieval on the Blogosphere 博客圈的信息检索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2012-08-01 DOI: 10.1561/1500000026
Rodrygo L. T. Santos, C. Macdonald, R. McCreadie, I. Ounis, I. Soboroff
Blogs have recently emerged as a new open, rapidly evolving and reactive publishing medium on the Web. Rather than managed by a central entity, the content on the blogosphere — the collection of all blogs on the Web — is produced by millions of independent bloggers, who can write about virtually anything. This open publishing paradigm has led to a growing mass of user-generated content on the Web, which can vary tremendously both in format and quality when looked at in isolation, but which can also reveal interesting patterns when observed in aggregation. One field particularly interested in studying how information is produced, consumed, and searched in the blogosphere is information retrieval. In this survey, we review the published literature on searching the blogosphere. In particular, we describe the phenomenon of blogging and the motivations for searching for information on blogs. We cover both the search tasks underlying blog searchers' information needs and the most successful approaches to these tasks. These include blog post and full blog search tasks, as well as blog-aided search tasks, such as trend and market analysis. Finally, we also describe the publicly available resources that support research on searching the blogosphere.
博客最近作为一种新的开放的、快速发展的、反应性的网络发布媒介而出现。博客圈上的内容——网络上所有博客的集合——不是由一个中央实体管理,而是由数百万独立的博主制作的,他们几乎可以写任何东西。这种开放的发布模式导致了Web上用户生成内容的数量不断增长,这些内容在格式和质量上可能会有很大的差异,但如果放在一起观察,也会揭示出有趣的模式。在博客圈中,对研究信息如何产生、消费和搜索特别感兴趣的一个领域是信息检索。在这项调查中,我们回顾了已发表的关于搜索博客圈的文献。特别地,我们描述了博客现象和在博客上搜索信息的动机。我们涵盖了博客搜索者信息需求的搜索任务,以及实现这些任务的最成功方法。这包括博客文章和完整的博客搜索任务,以及博客辅助搜索任务,如趋势和市场分析。最后,我们还描述了支持搜索博客圈研究的公开可用资源。
{"title":"Information Retrieval on the Blogosphere","authors":"Rodrygo L. T. Santos, C. Macdonald, R. McCreadie, I. Ounis, I. Soboroff","doi":"10.1561/1500000026","DOIUrl":"https://doi.org/10.1561/1500000026","url":null,"abstract":"Blogs have recently emerged as a new open, rapidly evolving and reactive publishing medium on the Web. Rather than managed by a central entity, the content on the blogosphere — the collection of all blogs on the Web — is produced by millions of independent bloggers, who can write about virtually anything. This open publishing paradigm has led to a growing mass of user-generated content on the Web, which can vary tremendously both in format and quality when looked at in isolation, but which can also reveal interesting patterns when observed in aggregation. One field particularly interested in studying how information is produced, consumed, and searched in the blogosphere is information retrieval. In this survey, we review the published literature on searching the blogosphere. In particular, we describe the phenomenon of blogging and the motivations for searching for information on blogs. We cover both the search tasks underlying blog searchers' information needs and the most successful approaches to these tasks. These include blog post and full blog search tasks, as well as blog-aided search tasks, such as trend and market analysis. Finally, we also describe the publicly available resources that support research on searching the blogosphere.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"38 1","pages":"1-125"},"PeriodicalIF":10.4,"publicationDate":"2012-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76898614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Spoken Content Retrieval: A Survey of Techniques and Technologies 口语内容检索:技术与技术综述
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2012-02-23 DOI: 10.1561/1500000020
M. Larson, G. Jones
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR.
语音媒体,即包含语音内容的数字音频和视频,近年来蓬勃发展。在Internet上以及在私人和企业设置中,大量的集合正在积累。这种增长推动了对促进可靠索引和检索的技术和技术的广泛研究。语音内容检索(SCR)需要将音频和语音处理技术与信息检索(IR)方法相结合。SCR研究最初调查了以文档式单位结构的计划演讲,但随后将重点转移到更非正式的自发演讲内容上,在演播室之外和会话环境中。本调查提供了SCR领域的概述,包括组件技术,SCR与文本IR和自动语音识别以及用户交互问题的关系。它的目标是具有语音技术或IR背景的研究人员,他们正在寻求更深入的见解,了解如何将这些领域集成到支持研究和开发中,从而解决SCR的核心挑战。
{"title":"Spoken Content Retrieval: A Survey of Techniques and Technologies","authors":"M. Larson, G. Jones","doi":"10.1561/1500000020","DOIUrl":"https://doi.org/10.1561/1500000020","url":null,"abstract":"Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"49 1","pages":"235-422"},"PeriodicalIF":10.4,"publicationDate":"2012-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73010469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Federated Search 联邦搜索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2011-03-05 DOI: 10.1561/1500000010
Milad Shokouhi, Luo Si
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot easily index uncrawlable hidden web collections while federated search systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated search systems need to acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. The goal of this work, is to provide a comprehensive summary of the previous research on the federated search challenges described above.
联邦搜索(联邦信息检索或分布式信息检索)是一种同时搜索多个文本集合的技术。查询被提交给最有可能返回相关答案的集合子集。所选集合返回的结果被集成并合并到单个列表中。在许多环境中,联邦搜索比集中式搜索更受欢迎。例如,谷歌这样的商业搜索引擎不能很容易地索引无法抓取的隐藏web集合,而联邦搜索系统可以搜索隐藏web集合的内容而不需要抓取。在企业环境中,每个组织维护一个独立的搜索引擎,联邦搜索技术可以提供对多个集合的并行搜索。在联邦搜索中有三个主要挑战。对于每个查询,选择最有可能返回相关文档的集合子集。这就产生了集合选择问题。为了能够选择合适的集合,联邦搜索系统需要获取关于每个集合内容的一些知识,这就产生了集合表示问题。从所选集合返回的结果在最终呈现给用户之前被合并。最后一步是结果合并问题。这项工作的目标是对前面描述的联邦搜索挑战的研究提供一个全面的总结。
{"title":"Federated Search","authors":"Milad Shokouhi, Luo Si","doi":"10.1561/1500000010","DOIUrl":"https://doi.org/10.1561/1500000010","url":null,"abstract":"Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot easily index uncrawlable hidden web collections while federated search systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. \u0000 \u0000There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated search systems need to acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. \u0000 \u0000The goal of this work, is to provide a comprehensive summary of the previous research on the federated search challenges described above.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"30 1","pages":"1-102"},"PeriodicalIF":10.4,"publicationDate":"2011-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77818433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 167
Adversarial Web Search 对抗性网络搜索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2011-01-09 DOI: 10.1561/1500000021
C. Castillo, Brian D. Davison
Web search engines have become indispensable tools for finding content. As the popularity of the Web has increased, the efforts to exploit the Web for commercial, social, or political advantage have grown, making it harder for search engines to discriminate between truthful signals of content quality and deceptive attempts to game search engines' rankings. This problem is further complicated by the open nature of the Web, which allows anyone to write and publish anything, and by the fact that search engines must analyze ever-growing numbers of Web pages. Moreover, increasing expectations of users, who over time rely on Web search for information needs related to more aspects of their lives, further deepen the need for search engines to develop effective counter-measures against deception. In this monograph, we consider the effects of the adversarial relationship between search systems and those who wish to manipulate them, a field known as "Adversarial Information Retrieval". We show that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware. We also examine work over the past decade or so that aims to discover such spamming activities to get spam pages removed or their effect on the quality of the results reduced. Research in Adversarial Information Retrieval has been evolving over time, and currently continues both in traditional areas (e.g., link spam) and newer areas, such as click fraud and spam in social media, demonstrating that this conflict is far from over.
网络搜索引擎已经成为寻找内容不可或缺的工具。随着网络越来越受欢迎,利用网络获取商业、社会或政治利益的努力也越来越多,这使得搜索引擎很难区分内容质量的真实信号和欺骗搜索引擎排名的企图。由于Web的开放性(任何人都可以编写和发布任何内容)以及搜索引擎必须分析不断增长的Web页面数量,这个问题变得更加复杂。此外,随着时间的推移,用户越来越依赖网络搜索来获取与他们生活的更多方面相关的信息需求,用户的期望越来越高,这进一步加深了对搜索引擎开发有效的反欺骗措施的需求。在这本专著中,我们考虑了搜索系统和那些希望操纵它们的人之间的对抗性关系的影响,这是一个被称为“对抗性信息检索”的领域。我们表明,搜索引擎垃圾邮件发送者创建虚假内容和误导性链接,以引诱毫无防备的访问者进入充满广告或恶意软件的页面。我们还检查了过去十年左右的工作,旨在发现此类垃圾邮件活动,以删除垃圾邮件页面或降低其对结果质量的影响。对抗性信息检索的研究一直在不断发展,目前在传统领域(如链接垃圾邮件)和新领域(如社交媒体中的点击欺诈和垃圾邮件)都在继续,这表明这种冲突远未结束。
{"title":"Adversarial Web Search","authors":"C. Castillo, Brian D. Davison","doi":"10.1561/1500000021","DOIUrl":"https://doi.org/10.1561/1500000021","url":null,"abstract":"Web search engines have become indispensable tools for finding content. As the popularity of the Web has increased, the efforts to exploit the Web for commercial, social, or political advantage have grown, making it harder for search engines to discriminate between truthful signals of content quality and deceptive attempts to game search engines' rankings. This problem is further complicated by the open nature of the Web, which allows anyone to write and publish anything, and by the fact that search engines must analyze ever-growing numbers of Web pages. Moreover, increasing expectations of users, who over time rely on Web search for information needs related to more aspects of their lives, further deepen the need for search engines to develop effective counter-measures against deception. \u0000 \u0000In this monograph, we consider the effects of the adversarial relationship between search systems and those who wish to manipulate them, a field known as \"Adversarial Information Retrieval\". We show that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware. We also examine work over the past decade or so that aims to discover such spamming activities to get spam pages removed or their effect on the quality of the results reduced. \u0000 \u0000Research in Adversarial Information Retrieval has been evolving over time, and currently continues both in traditional areas (e.g., link spam) and newer areas, such as click fraud and spam in social media, demonstrating that this conflict is far from over.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"1 1","pages":"377-486"},"PeriodicalIF":10.4,"publicationDate":"2011-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80485931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 116
Automatic Summarization 自动摘要
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2011-01-01 DOI: 10.1561/1500000015
A. Nenkova, S. Maskey, Yang Liu
It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field. We would like to thank the anonymous reviewers, our students and Noemie Elhadad, Hongyan Jing, Julia Hirschberg, Annie Louis, Smaranda Muresan and Dragomir Radev for their helpful feedback. This paper was supported in part by the U.S. National Science Foundation (NSF) under IIS-05-34871 and CAREER 09-53445. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Full text available at: http://dx.doi.org/10.1561/1500000015
自鲁恩关于自动摘要的开创性论文发表以来,已经过去了50年。近年来,对自动摘要的实际需求日益迫切,并发表了大量关于该主题的论文。因此,很难找到一个单一的参考文献来概述过去的工作或总结任务和必要的系统组件的完整视图。本文试图通过提供总结研究的全面概述来填补这一空白,包括在句子提取方面的更传统的努力,以及确定重要内容的最新方法,用于特定领域和体裁的总结以及总结的评估。我们还讨论了仍然存在的挑战,特别是对语言生成和更深层次的语言语义理解的需求,这将是该领域未来发展所必需的。我们要感谢匿名审稿人、我们的学生以及Noemie Elhadad、Hongyan Jing、Julia Hirschberg、Annie Louis、Smaranda Muresan和Dragomir Radev提供的有用反馈。本文得到了美国国家科学基金会(NSF)的部分资助,项目编号为IIS-05-34871和CAREER 09-53445。本材料中表达的任何观点、发现、结论或建议都是作者的观点,并不一定反映美国国家科学基金会的观点。全文可在:http://dx.doi.org/10.1561/1500000015
{"title":"Automatic Summarization","authors":"A. Nenkova, S. Maskey, Yang Liu","doi":"10.1561/1500000015","DOIUrl":"https://doi.org/10.1561/1500000015","url":null,"abstract":"It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field. We would like to thank the anonymous reviewers, our students and Noemie Elhadad, Hongyan Jing, Julia Hirschberg, Annie Louis, Smaranda Muresan and Dragomir Radev for their helpful feedback. This paper was supported in part by the U.S. National Science Foundation (NSF) under IIS-05-34871 and CAREER 09-53445. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Full text available at: http://dx.doi.org/10.1561/1500000015","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"32 1","pages":"103-233"},"PeriodicalIF":10.4,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78665747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 427
Test Collection Based Evaluation of Information Retrieval Systems 基于测试集合的信息检索系统评价
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2010-06-03 DOI: 10.1561/1500000009
M. Sanderson
Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.
使用测试集合和评估措施来评估信息检索系统的有效性,其起源可追溯到20世纪50年代初。在这项工作开始以来的近60年里,测试集合的使用实际上是一种评估标准。本专著调查了所进行的研究,并解释了为评估检索系统而设计的方法和措施,包括在检索实验中使用统计显著性检验的详细情况。这本专著回顾了最近测试收集方法和评估措施的有效性的检验,以及概述了当前研究利用查询日志和现场实验室的趋势。就其核心而言,现代测试集与20世纪50年代和60年代先驱研究人员设想的结构几乎没有什么不同。本教程和回顾表明,尽管它的年龄,这种长期存在的评估方法仍然是检索研究的一个高度重视的工具。
{"title":"Test Collection Based Evaluation of Information Retrieval Systems","authors":"M. Sanderson","doi":"10.1561/1500000009","DOIUrl":"https://doi.org/10.1561/1500000009","url":null,"abstract":"Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"62 3 1","pages":"247-375"},"PeriodicalIF":10.4,"publicationDate":"2010-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79770399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 399
Web Crawling Web爬行
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2010-03-01 DOI: 10.1561/1500000017
Christopher Olston, Marc Najork
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
这是对网络爬行的科学和实践的调查。虽然乍一看,网络爬行似乎只是广度优先搜索的一个应用,但事实是存在许多挑战,从系统问题(如管理非常大的数据结构)到理论问题(如多久重新访问一次不断发展的内容源)。本调查概述了基本挑战,并描述了最先进的模型和解决方案。它还强调了未来工作的途径。
{"title":"Web Crawling","authors":"Christopher Olston, Marc Najork","doi":"10.1561/1500000017","DOIUrl":"https://doi.org/10.1561/1500000017","url":null,"abstract":"This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"20 1","pages":"175-246"},"PeriodicalIF":10.4,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75351298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Mining Query Logs: Turning Search Usage Data into Knowledge 挖掘查询日志:将搜索使用数据转化为知识
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2010-01-01 DOI: 10.1561/1500000013
F. Silvestri
Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.
Web搜索引擎在其日志中保存了自其开始运行以来的用户信息。这些信息通常有多种用途。本调查的主要重点是通过展示查询挖掘的基础,并通过分析用于从这个(潜在的)无限信息源中提取有用知识的基本算法和技术,来介绍查询挖掘的学科。通过分析查询日志挖掘的流行应用程序及其对用户体验的影响,我们展示了搜索应用程序如何从这种分析中受益。我们通过简要地介绍该领域中一些最具挑战性的当前开放问题来结束本文。
{"title":"Mining Query Logs: Turning Search Usage Data into Knowledge","authors":"F. Silvestri","doi":"10.1561/1500000013","DOIUrl":"https://doi.org/10.1561/1500000013","url":null,"abstract":"Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"1 1","pages":"1-174"},"PeriodicalIF":10.4,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91093428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
Concept-Based Video Retrieval 基于概念的视频检索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2009-05-26 DOI: 10.1561/1500000014
Cees G. M. Snoek, M. Worring
In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.
在本文中,我们回顾了300篇关于视频检索的文献,指出了纯文本解决方案不令人满意的情况,并展示了大多数基于概念的有前途的替代方案。因此,我们讨论的中心是语义概念的概念:对可观察实体的客观语言描述。具体来说,我们提出了我们的观点,即它的自动检测、不确定性下的选择和交互式使用如何解决视频检索的主要科学问题:语义差距。为了弥补这一差距,我们对基于概念的视频搜索引擎进行了剖析。我们提出了这样一个跨学科多媒体系统的组件分解,涵盖了信息检索、计算机视觉、机器学习和人机交互的影响。对于每个组件,我们回顾了文献中最先进的解决方案,每个组件都有不同的特点和优点。由于这些差异,如果没有像NIST TRECVID基准测试那样认真的评估工作,我们就无法理解视频检索的进展。我们讨论了它的数据、任务、结果,以及为可重复实验创建注释和基线的许多派生的社区倡议。最后,我们展望了未来的挑战和机遇。
{"title":"Concept-Based Video Retrieval","authors":"Cees G. M. Snoek, M. Worring","doi":"10.1561/1500000014","DOIUrl":"https://doi.org/10.1561/1500000014","url":null,"abstract":"In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"10 1","pages":"215-322"},"PeriodicalIF":10.4,"publicationDate":"2009-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81589311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 429
期刊
Foundations and Trends in Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1