首页 > 最新文献

Foundations and Trends in Information Retrieval最新文献

英文 中文
Web Crawling Web爬行
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2010-03-01 DOI: 10.1561/1500000017
Christopher Olston, Marc Najork
This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
这是对网络爬行的科学和实践的调查。虽然乍一看,网络爬行似乎只是广度优先搜索的一个应用,但事实是存在许多挑战,从系统问题(如管理非常大的数据结构)到理论问题(如多久重新访问一次不断发展的内容源)。本调查概述了基本挑战,并描述了最先进的模型和解决方案。它还强调了未来工作的途径。
{"title":"Web Crawling","authors":"Christopher Olston, Marc Najork","doi":"10.1561/1500000017","DOIUrl":"https://doi.org/10.1561/1500000017","url":null,"abstract":"This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"20 1","pages":"175-246"},"PeriodicalIF":10.4,"publicationDate":"2010-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75351298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Mining Query Logs: Turning Search Usage Data into Knowledge 挖掘查询日志:将搜索使用数据转化为知识
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2010-01-01 DOI: 10.1561/1500000013
F. Silvestri
Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.
Web搜索引擎在其日志中保存了自其开始运行以来的用户信息。这些信息通常有多种用途。本调查的主要重点是通过展示查询挖掘的基础,并通过分析用于从这个(潜在的)无限信息源中提取有用知识的基本算法和技术,来介绍查询挖掘的学科。通过分析查询日志挖掘的流行应用程序及其对用户体验的影响,我们展示了搜索应用程序如何从这种分析中受益。我们通过简要地介绍该领域中一些最具挑战性的当前开放问题来结束本文。
{"title":"Mining Query Logs: Turning Search Usage Data into Knowledge","authors":"F. Silvestri","doi":"10.1561/1500000013","DOIUrl":"https://doi.org/10.1561/1500000013","url":null,"abstract":"Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"1 1","pages":"1-174"},"PeriodicalIF":10.4,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91093428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 200
Concept-Based Video Retrieval 基于概念的视频检索
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2009-05-26 DOI: 10.1561/1500000014
Cees G. M. Snoek, M. Worring
In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.
在本文中,我们回顾了300篇关于视频检索的文献,指出了纯文本解决方案不令人满意的情况,并展示了大多数基于概念的有前途的替代方案。因此,我们讨论的中心是语义概念的概念:对可观察实体的客观语言描述。具体来说,我们提出了我们的观点,即它的自动检测、不确定性下的选择和交互式使用如何解决视频检索的主要科学问题:语义差距。为了弥补这一差距,我们对基于概念的视频搜索引擎进行了剖析。我们提出了这样一个跨学科多媒体系统的组件分解,涵盖了信息检索、计算机视觉、机器学习和人机交互的影响。对于每个组件,我们回顾了文献中最先进的解决方案,每个组件都有不同的特点和优点。由于这些差异,如果没有像NIST TRECVID基准测试那样认真的评估工作,我们就无法理解视频检索的进展。我们讨论了它的数据、任务、结果,以及为可重复实验创建注释和基线的许多派生的社区倡议。最后,我们展望了未来的挑战和机遇。
{"title":"Concept-Based Video Retrieval","authors":"Cees G. M. Snoek, M. Worring","doi":"10.1561/1500000014","DOIUrl":"https://doi.org/10.1561/1500000014","url":null,"abstract":"In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"10 1","pages":"215-322"},"PeriodicalIF":10.4,"publicationDate":"2009-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81589311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 429
Methods for Evaluating Interactive Information Retrieval Systems with Users 具有用户的交互式信息检索系统评价方法
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2009-04-28 DOI: 10.1561/1500000012
D. Kelly
This paper provides overview and instruction regarding the evaluation of interactive information retrieval systems with users. The primary goal of this article is to catalog and compile material related to this topic into a single source. This article (1) provides historical background on the development of user-centered approaches to the evaluation of interactive information retrieval systems; (2) describes the major components of interactive information retrieval system evaluation; (3) describes different experimental designs and sampling strategies; (4) presents core instruments and data collection techniques and measures; (5) explains basic data analysis techniques; and (4) reviews and discusses previous studies. This article also discusses validity and reliability issues with respect to both measures and methods, presents background information on research ethics and discusses some ethical issues which are specific to studies of interactive information retrieval (IIR). Finally, this article concludes with a discussion of outstanding challenges and future research directions.
本文提供了与用户交互信息检索系统评价的概述和指导。本文的主要目标是将与该主题相关的材料编目并汇编成一个单一的来源。本文(1)提供了以用户为中心的交互式信息检索系统评价方法发展的历史背景;(2)描述了交互式信息检索系统的主要组成部分的评价;(3)描述了不同的实验设计和抽样策略;(4)介绍了核心仪器和数据收集技术和措施;(5)解释基本的数据分析技术;(4)回顾和讨论了前人的研究。本文还讨论了测量和方法的效度和信度问题,介绍了研究伦理的背景资料,并讨论了交互信息检索(IIR)研究特有的一些伦理问题。最后,对本文面临的突出挑战和未来的研究方向进行了讨论。
{"title":"Methods for Evaluating Interactive Information Retrieval Systems with Users","authors":"D. Kelly","doi":"10.1561/1500000012","DOIUrl":"https://doi.org/10.1561/1500000012","url":null,"abstract":"This paper provides overview and instruction regarding the evaluation of interactive information retrieval systems with users. The primary goal of this article is to catalog and compile material related to this topic into a single source. This article (1) provides historical background on the development of user-centered approaches to the evaluation of interactive information retrieval systems; (2) describes the major components of interactive information retrieval system evaluation; (3) describes different experimental designs and sampling strategies; (4) presents core instruments and data collection techniques and measures; (5) explains basic data analysis techniques; and (4) reviews and discusses previous studies. This article also discusses validity and reliability issues with respect to both measures and methods, presents background information on research ethics and discusses some ethical issues which are specific to studies of interactive information retrieval (IIR). Finally, this article concludes with a discussion of outstanding challenges and future research directions.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"517 1","pages":"1-224"},"PeriodicalIF":10.4,"publicationDate":"2009-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77147271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 621
The Probabilistic Relevance Framework: BM25 and Beyond 概率关联框架:BM25及以后
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2009-04-01 DOI: 10.1561/1500000019
S. Robertson, H. Zaragoza
The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.
概率相关框架(PRF)是一个用于文档检索的正式框架,以1970 - 1980年代的工作为基础,它导致了最成功的文本检索算法之一BM25的发展。近年来,PRF的研究产生了能够考虑文档元数据(特别是结构和链接图信息)的新的检索模型。同样,这导致了最成功的web搜索和企业搜索算法之一BM25F。这项工作从概念的角度介绍了PRF,描述了框架背后的概率建模假设以及由其应用产生的不同排名算法:二元独立模型、相关反馈模型、BM25和BM25F。它还讨论了PRF和其他IR统计模型之间的关系,并涵盖了一些相关主题,例如非文本特征的使用,以及具有自由参数的模型的参数优化。
{"title":"The Probabilistic Relevance Framework: BM25 and Beyond","authors":"S. Robertson, H. Zaragoza","doi":"10.1561/1500000019","DOIUrl":"https://doi.org/10.1561/1500000019","url":null,"abstract":"The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"35 1","pages":"333-389"},"PeriodicalIF":10.4,"publicationDate":"2009-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2328
Opinion Mining and Sentiment Analysis 意见挖掘和情感分析
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2008-07-08 DOI: 10.1561/1500000011
B. Pang, Lillian Lee
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.
我们收集信息行为的一个重要部分一直是找出别人的想法。随着诸如在线评论网站和个人博客等意见丰富的资源的日益普及和普及,人们现在可以并且确实积极地使用信息技术来寻求和理解他人的意见,因此出现了新的机遇和挑战。因此,在意见挖掘和情感分析领域(处理文本中的意见、情绪和主观性的计算处理)的突然爆发,至少在一定程度上是对直接将意见作为一级对象处理的新系统的兴趣激增的直接回应。本调查涵盖了有望直接实现以意见为导向的信息寻求系统的技术和方法。与传统的基于事实的分析相比,我们的重点是寻求解决由情感感知应用带来的新挑战的方法。我们包括关于评估文本摘要的材料,以及关于以舆论为导向的信息获取服务的发展所产生的隐私、操纵和经济影响等更广泛问题的材料。为了促进未来的工作,还提供了对可用资源、基准数据集和评估活动的讨论。
{"title":"Opinion Mining and Sentiment Analysis","authors":"B. Pang, Lillian Lee","doi":"10.1561/1500000011","DOIUrl":"https://doi.org/10.1561/1500000011","url":null,"abstract":"An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. \u0000 \u0000This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"2017 18","pages":"1-135"},"PeriodicalIF":10.4,"publicationDate":"2008-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1561/1500000011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72400159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4579
Email Spam Filtering: A Systematic Review 垃圾邮件过滤:一个系统的审查
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2008-06-23 DOI: 10.1561/1500000006
G. Cormack
Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.
垃圾邮件是一种精心设计的信息,不管收件人的意愿如何,都要发送给大量收件人。垃圾邮件过滤器是一种自动识别垃圾邮件以防止其传递的工具。垃圾邮件和垃圾邮件过滤器的目的是截然相反的:如果垃圾邮件避开了过滤器,那么它就是有效的,而如果过滤器识别了垃圾邮件,那么它就是有效的。这些定义的循环性质,以及它们对发送者和接收者意图的吸引力,使它们难以形式化。典型的电子邮件用户的工作定义不会比“当我看到它时我就知道了”更正式。然而,当前的垃圾邮件过滤器是非常有效的,考虑到不确定性的程度和对垃圾邮件正式定义的争论,比预期的更有效,考虑到最先进的信息检索和机器学习方法对看似类似的问题的预期更有效。但它们足够有效吗?哪个更好?如何改进它们?它们的有效性会被更巧妙的垃圾邮件所削弱吗?我们调查了当前和建议的垃圾邮件过滤技术,特别强调了它们的工作效果。我们主要关注的是垃圾邮件的过滤;在其他通信和存储媒体(如即时消息和Web)中,垃圾邮件过滤的异同将在外围解决。在此过程中,我们将研究垃圾邮件的定义、用户的信息需求以及垃圾邮件过滤器作为庞大而复杂的信息世界的一个组成部分所扮演的角色。对众所周知的方法进行了充分的详细说明,使得本文的阐述是独立的,但是,本文的重点是对垃圾邮件的独特考虑。比较,只要可能,使用共同的评价措施,并控制实验设置的差异。这种比较并不容易,因为评估垃圾邮件过滤器的基准、度量和方法仍在不断发展。我们调查了这些努力,他们的结果和他们的局限性。尽管最近在评估方法方面取得了进展,但关于垃圾邮件过滤技术的有效性和垃圾邮件过滤评估方法的有效性仍然存在许多不确定性(包括广泛持有但未经证实的信念)。我们概述了几个不确定性,并提出了实验方法来解决它们。
{"title":"Email Spam Filtering: A Systematic Review","authors":"G. Cormack","doi":"10.1561/1500000006","DOIUrl":"https://doi.org/10.1561/1500000006","url":null,"abstract":"Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than \"I know it when I see it.\" Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? \u0000 \u0000We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"5 1","pages":"335-455"},"PeriodicalIF":10.4,"publicationDate":"2008-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76027829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 296
Authorship Attribution 作者归因
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2008-03-06 DOI: 10.1561/1500000005
P. Juola
Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in "non-traditional" authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about type or rate of errors, and few "best practices" are available. In part because of this confusion, the field has perhaps had less uptake and general acceptance than is its due. This review surveys the history and present state of the discipline, presenting some comparative results when available. It shows, first, that the discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.
作者归属是一门从作者所写文献的特征推断作者特征的科学,是一个历史悠久、应用广泛的问题。最近在“非传统”作者归属方面的工作证明了基于作者风格自动分析文档的实用性,但目前的技术状况令人困惑。分析很难应用,对错误类型或错误率知之甚少,而且很少有“最佳实践”可用。在某种程度上,由于这种混乱,该领域可能没有得到应有的重视和普遍接受。本文回顾了该学科的历史和现状,并在可用的情况下提出了一些比较结果。它表明,首先,这门学科是相当成功的,即使是在涉及用不熟悉和研究较少的语言编写的小文件的困难情况下;它进一步分析了所使用的分析类型和特性,并尝试确定性能良好的系统的特征,最后将这些特征形成一组最佳实践建议。
{"title":"Authorship Attribution","authors":"P. Juola","doi":"10.1561/1500000005","DOIUrl":"https://doi.org/10.1561/1500000005","url":null,"abstract":"Authorship attribution, the science of inferring characteristics of the author from the characteristics of documents written by that author, is a problem with a long history and a wide range of application. Recent work in \"non-traditional\" authorship attribution demonstrates the practicality of automatically analyzing documents based on authorial style, but the state of the art is confusing. Analyses are difficult to apply, little is known about type or rate of errors, and few \"best practices\" are available. In part because of this confusion, the field has perhaps had less uptake and general acceptance than is its due. \u0000 \u0000This review surveys the history and present state of the discipline, presenting some comparative results when available. It shows, first, that the discipline is quite successful, even in difficult cases involving small documents in unfamiliar and less studied languages; it further analyzes the types of analysis and features used and tries to determine characteristics of well-performing systems, finally formulating these in a set of recommendations for best practices.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"23 1","pages":"233-334"},"PeriodicalIF":10.4,"publicationDate":"2008-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79341045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 962
Statistical Language Models for Information Retrieval: A Critical Review 信息检索的统计语言模型:综述
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2008-03-01 DOI: 10.1561/1500000008
ChengXiang Zhai
Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges.
近年来,统计语言模型已成功地应用于许多信息检索问题。最近的大量工作表明,统计语言模型不仅具有优越的经验性能,而且有助于参数调优,并为非传统检索问题的建模开辟了可能性。一般来说,统计语言模型提供了一种对各种检索问题建模的原则性方法。本调查的目的是系统和批判性地回顾现有的将统计语言模型应用于信息检索的工作,总结他们的贡献,并指出突出的挑战。
{"title":"Statistical Language Models for Information Retrieval: A Critical Review","authors":"ChengXiang Zhai","doi":"10.1561/1500000008","DOIUrl":"https://doi.org/10.1561/1500000008","url":null,"abstract":"Statistical language models have recently been successfully applied to many information retrieval problems. A great deal of recent work has shown that statistical language models not only lead to superior empirical performance, but also facilitate parameter tuning and open up possibilities for modeling nontraditional retrieval problems. In general, statistical language models provide a principled way of modeling various kinds of retrieval problems. The purpose of this survey is to systematically and critically review the existing work in applying statistical language models to information retrieval, summarize their contributions, and point out outstanding challenges.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"79 1","pages":"137-213"},"PeriodicalIF":10.4,"publicationDate":"2008-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83359588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 348
Open-Domain Question-Answering Open-Domain答疑
IF 10.4 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2007-08-01 DOI: 10.1561/1500000001
J. Prager
Open-Domain Question Answering is an introduction to the field of Question Answering (QA). It covers the basic principles of QA along with a selection of systems that have exhibited interesting and significant techniques, so it serves more as a tutorial than as an exhaustive survey of the field. Starting with a brief history of the field, it goes on to describe the architecture of a QA system before analysing in detail some of the specific approaches that have been successfully deployed by academia and industry designing and building such systems. Open-Domain Question Answering is both a guide for beginners who are embarking on research in this area, and a useful reference for established researchers and practitioners in this field.
表现最好的问答(QA)系统有两种类型:一种是年复一年表现良好的、稳定的、完善的、多方面的系统,另一种是采用完全创新的方法而脱颖而出的系统,它的表现几乎超过了其他所有系统。本文将深入研究这两种类型的系统。我们建立了一个“典型的”qa系统,并涵盖了组件模块常用的方法。理解这一点将使任何熟练的系统开发人员能够构建自己的qa系统。幸运的是,开发人员可以免费提供许多组件,使其成为研究生级项目的合理期望。我们还研究了一些表现良好的系统,它们采用了有趣和创新的方法。
{"title":"Open-Domain Question-Answering","authors":"J. Prager","doi":"10.1561/1500000001","DOIUrl":"https://doi.org/10.1561/1500000001","url":null,"abstract":"Open-Domain Question Answering is an introduction to the field of Question Answering (QA). It covers the basic principles of QA along with a selection of systems that have exhibited interesting and significant techniques, so it serves more as a tutorial than as an exhaustive survey of the field. Starting with a brief history of the field, it goes on to describe the architecture of a QA system before analysing in detail some of the specific approaches that have been successfully deployed by academia and industry designing and building such systems. Open-Domain Question Answering is both a guide for beginners who are embarking on research in this area, and a useful reference for established researchers and practitioners in this field.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"3232 1","pages":"91-231"},"PeriodicalIF":10.4,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86591447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 94
期刊
Foundations and Trends in Information Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1