Foundations and Trends in Information Retrieval最新文献

英文中文

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2011-01-09 DOI: 10.1561/1500000021

C. Castillo, Brian D. Davison

Web search engines have become indispensable tools for finding content. As the popularity of the Web has increased, the efforts to exploit the Web for commercial, social, or political advantage have grown, making it harder for search engines to discriminate between truthful signals of content quality and deceptive attempts to game search engines' rankings. This problem is further complicated by the open nature of the Web, which allows anyone to write and publish anything, and by the fact that search engines must analyze ever-growing numbers of Web pages. Moreover, increasing expectations of users, who over time rely on Web search for information needs related to more aspects of their lives, further deepen the need for search engines to develop effective counter-measures against deception. In this monograph, we consider the effects of the adversarial relationship between search systems and those who wish to manipulate them, a field known as "Adversarial Information Retrieval". We show that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware. We also examine work over the past decade or so that aims to discover such spamming activities to get spam pages removed or their effect on the quality of the results reduced. Research in Adversarial Information Retrieval has been evolving over time, and currently continues both in traditional areas (e.g., link spam) and newer areas, such as click fraud and spam in social media, demonstrating that this conflict is far from over.

网络搜索引擎已经成为寻找内容不可或缺的工具。随着网络越来越受欢迎，利用网络获取商业、社会或政治利益的努力也越来越多，这使得搜索引擎很难区分内容质量的真实信号和欺骗搜索引擎排名的企图。由于Web的开放性(任何人都可以编写和发布任何内容)以及搜索引擎必须分析不断增长的Web页面数量，这个问题变得更加复杂。此外，随着时间的推移，用户越来越依赖网络搜索来获取与他们生活的更多方面相关的信息需求，用户的期望越来越高，这进一步加深了对搜索引擎开发有效的反欺骗措施的需求。在这本专著中，我们考虑了搜索系统和那些希望操纵它们的人之间的对抗性关系的影响，这是一个被称为“对抗性信息检索”的领域。我们表明，搜索引擎垃圾邮件发送者创建虚假内容和误导性链接，以引诱毫无防备的访问者进入充满广告或恶意软件的页面。我们还检查了过去十年左右的工作，旨在发现此类垃圾邮件活动，以删除垃圾邮件页面或降低其对结果质量的影响。对抗性信息检索的研究一直在不断发展，目前在传统领域(如链接垃圾邮件)和新领域(如社交媒体中的点击欺诈和垃圾邮件)都在继续，这表明这种冲突远未结束。

{"title":"Adversarial Web Search","authors":"C. Castillo, Brian D. Davison","doi":"10.1561/1500000021","DOIUrl":"https://doi.org/10.1561/1500000021","url":null,"abstract":"Web search engines have become indispensable tools for finding content. As the popularity of the Web has increased, the efforts to exploit the Web for commercial, social, or political advantage have grown, making it harder for search engines to discriminate between truthful signals of content quality and deceptive attempts to game search engines' rankings. This problem is further complicated by the open nature of the Web, which allows anyone to write and publish anything, and by the fact that search engines must analyze ever-growing numbers of Web pages. Moreover, increasing expectations of users, who over time rely on Web search for information needs related to more aspects of their lives, further deepen the need for search engines to develop effective counter-measures against deception. \u0000 \u0000In this monograph, we consider the effects of the adversarial relationship between search systems and those who wish to manipulate them, a field known as \"Adversarial Information Retrieval\". We show that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware. We also examine work over the past decade or so that aims to discover such spamming activities to get spam pages removed or their effect on the quality of the results reduced. \u0000 \u0000Research in Adversarial Information Retrieval has been evolving over time, and currently continues both in traditional areas (e.g., link spam) and newer areas, such as click fraud and spam in social media, demonstrating that this conflict is far from over.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"1 1","pages":"377-486"},"PeriodicalIF":10.4,"publicationDate":"2011-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80485931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 116

Automatic Summarization 自动摘要

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2011-01-01 DOI: 10.1561/1500000015

A. Nenkova, S. Maskey, Yang Liu

It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field. We would like to thank the anonymous reviewers, our students and Noemie Elhadad, Hongyan Jing, Julia Hirschberg, Annie Louis, Smaranda Muresan and Dragomir Radev for their helpful feedback. This paper was supported in part by the U.S. National Science Foundation (NSF) under IIS-05-34871 and CAREER 09-53445. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Full text available at: http://dx.doi.org/10.1561/1500000015

自鲁恩关于自动摘要的开创性论文发表以来，已经过去了50年。近年来，对自动摘要的实际需求日益迫切，并发表了大量关于该主题的论文。因此，很难找到一个单一的参考文献来概述过去的工作或总结任务和必要的系统组件的完整视图。本文试图通过提供总结研究的全面概述来填补这一空白，包括在句子提取方面的更传统的努力，以及确定重要内容的最新方法，用于特定领域和体裁的总结以及总结的评估。我们还讨论了仍然存在的挑战，特别是对语言生成和更深层次的语言语义理解的需求，这将是该领域未来发展所必需的。我们要感谢匿名审稿人、我们的学生以及Noemie Elhadad、Hongyan Jing、Julia Hirschberg、Annie Louis、Smaranda Muresan和Dragomir Radev提供的有用反馈。本文得到了美国国家科学基金会(NSF)的部分资助，项目编号为IIS-05-34871和CAREER 09-53445。本材料中表达的任何观点、发现、结论或建议都是作者的观点，并不一定反映美国国家科学基金会的观点。全文可在:http://dx.doi.org/10.1561/1500000015

{"title":"Automatic Summarization","authors":"A. Nenkova, S. Maskey, Yang Liu","doi":"10.1561/1500000015","DOIUrl":"https://doi.org/10.1561/1500000015","url":null,"abstract":"It has now been 50 years since the publication of Luhn’s seminal paper on automatic summarization. During these years the practical need for automatic summarization has become increasingly urgent and numerous papers have been published on the topic. As a result, it has become harder to find a single reference that gives an overview of past efforts or a complete view of summarization tasks and necessary system components. This article attempts to fill this void by providing a comprehensive overview of research in summarization, including the more traditional efforts in sentence extraction as well as the most novel recent approaches for determining important content, for domain and genre specific summarization and for evaluation of summarization. We also discuss the challenges that remain open, in particular the need for language generation and deeper semantic understanding of language that would be necessary for future advances in the field. We would like to thank the anonymous reviewers, our students and Noemie Elhadad, Hongyan Jing, Julia Hirschberg, Annie Louis, Smaranda Muresan and Dragomir Radev for their helpful feedback. This paper was supported in part by the U.S. National Science Foundation (NSF) under IIS-05-34871 and CAREER 09-53445. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Full text available at: http://dx.doi.org/10.1561/1500000015","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"32 1","pages":"103-233"},"PeriodicalIF":10.4,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78665747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 427

Test Collection Based Evaluation of Information Retrieval Systems 基于测试集合的信息检索系统评价

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2010-06-03 DOI: 10.1561/1500000009

M. Sanderson

Use of test collections and evaluation measures to assess the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections is a de facto standard of evaluation. This monograph surveys the research conducted and explains the methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This monograph reviews more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining trends in current research exploiting query logs and live labs. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial and review shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.

使用测试集合和评估措施来评估信息检索系统的有效性，其起源可追溯到20世纪50年代初。在这项工作开始以来的近60年里，测试集合的使用实际上是一种评估标准。本专著调查了所进行的研究，并解释了为评估检索系统而设计的方法和措施，包括在检索实验中使用统计显著性检验的详细情况。这本专著回顾了最近测试收集方法和评估措施的有效性的检验，以及概述了当前研究利用查询日志和现场实验室的趋势。就其核心而言，现代测试集与20世纪50年代和60年代先驱研究人员设想的结构几乎没有什么不同。本教程和回顾表明，尽管它的年龄，这种长期存在的评估方法仍然是检索研究的一个高度重视的工具。

引用次数: 399

Web Crawling Web爬行

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2010-03-01 DOI: 10.1561/1500000017

Christopher Olston, Marc Najork

This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.

这是对网络爬行的科学和实践的调查。虽然乍一看，网络爬行似乎只是广度优先搜索的一个应用，但事实是存在许多挑战，从系统问题(如管理非常大的数据结构)到理论问题(如多久重新访问一次不断发展的内容源)。本调查概述了基本挑战，并描述了最先进的模型和解决方案。它还强调了未来工作的途径。

引用次数: 2

Mining Query Logs: Turning Search Usage Data into Knowledge 挖掘查询日志:将搜索使用数据转化为知识

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2010-01-01 DOI: 10.1561/1500000013

F. Silvestri

Web search engines have stored in their logs information about users since they started to operate. This information often serves many purposes. The primary focus of this survey is on introducing to the discipline of query mining by showing its foundations and by analyzing the basic algorithms and techniques that are used to extract useful knowledge from this (potentially) infinite source of information. We show how search applications may benefit from this kind of analysis by analyzing popular applications of query log mining and their influence on user experience. We conclude the paper by, briefly, presenting some of the most challenging current open problems in this field.

Web搜索引擎在其日志中保存了自其开始运行以来的用户信息。这些信息通常有多种用途。本调查的主要重点是通过展示查询挖掘的基础，并通过分析用于从这个(潜在的)无限信息源中提取有用知识的基本算法和技术，来介绍查询挖掘的学科。通过分析查询日志挖掘的流行应用程序及其对用户体验的影响，我们展示了搜索应用程序如何从这种分析中受益。我们通过简要地介绍该领域中一些最具挑战性的当前开放问题来结束本文。

引用次数: 200

Concept-Based Video Retrieval 基于概念的视频检索

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2009-05-26 DOI: 10.1561/1500000014

Cees G. M. Snoek, M. Worring

In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.

在本文中，我们回顾了300篇关于视频检索的文献，指出了纯文本解决方案不令人满意的情况，并展示了大多数基于概念的有前途的替代方案。因此，我们讨论的中心是语义概念的概念:对可观察实体的客观语言描述。具体来说，我们提出了我们的观点，即它的自动检测、不确定性下的选择和交互式使用如何解决视频检索的主要科学问题:语义差距。为了弥补这一差距，我们对基于概念的视频搜索引擎进行了剖析。我们提出了这样一个跨学科多媒体系统的组件分解，涵盖了信息检索、计算机视觉、机器学习和人机交互的影响。对于每个组件，我们回顾了文献中最先进的解决方案，每个组件都有不同的特点和优点。由于这些差异，如果没有像NIST TRECVID基准测试那样认真的评估工作，我们就无法理解视频检索的进展。我们讨论了它的数据、任务、结果，以及为可重复实验创建注释和基线的许多派生的社区倡议。最后，我们展望了未来的挑战和机遇。

{"title":"Concept-Based Video Retrieval","authors":"Cees G. M. Snoek, M. Worring","doi":"10.1561/1500000014","DOIUrl":"https://doi.org/10.1561/1500000014","url":null,"abstract":"In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human–computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"10 1","pages":"215-322"},"PeriodicalIF":10.4,"publicationDate":"2009-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81589311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 429

Methods for Evaluating Interactive Information Retrieval Systems with Users 具有用户的交互式信息检索系统评价方法

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2009-04-28 DOI: 10.1561/1500000012

D. Kelly

This paper provides overview and instruction regarding the evaluation of interactive information retrieval systems with users. The primary goal of this article is to catalog and compile material related to this topic into a single source. This article (1) provides historical background on the development of user-centered approaches to the evaluation of interactive information retrieval systems; (2) describes the major components of interactive information retrieval system evaluation; (3) describes different experimental designs and sampling strategies; (4) presents core instruments and data collection techniques and measures; (5) explains basic data analysis techniques; and (4) reviews and discusses previous studies. This article also discusses validity and reliability issues with respect to both measures and methods, presents background information on research ethics and discusses some ethical issues which are specific to studies of interactive information retrieval (IIR). Finally, this article concludes with a discussion of outstanding challenges and future research directions.

本文提供了与用户交互信息检索系统评价的概述和指导。本文的主要目标是将与该主题相关的材料编目并汇编成一个单一的来源。本文(1)提供了以用户为中心的交互式信息检索系统评价方法发展的历史背景;(2)描述了交互式信息检索系统的主要组成部分的评价;(3)描述了不同的实验设计和抽样策略;(4)介绍了核心仪器和数据收集技术和措施;(5)解释基本的数据分析技术;(4)回顾和讨论了前人的研究。本文还讨论了测量和方法的效度和信度问题，介绍了研究伦理的背景资料，并讨论了交互信息检索(IIR)研究特有的一些伦理问题。最后，对本文面临的突出挑战和未来的研究方向进行了讨论。

引用次数: 621

The Probabilistic Relevance Framework: BM25 and Beyond 概率关联框架:BM25及以后

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2009-04-01 DOI: 10.1561/1500000019

S. Robertson, H. Zaragoza

The Probabilistic Relevance Framework (PRF) is a formal framework for document retrieval, grounded in work done in the 1970—1980s, which led to the development of one of the most successful text-retrieval algorithms, BM25. In recent years, research in the PRF has yielded new retrieval models capable of taking into account document meta-data (especially structure and link-graph information). Again, this has led to one of the most successful Web-search and corporate-search algorithms, BM25F. This work presents the PRF from a conceptual point of view, describing the probabilistic modelling assumptions behind the framework and the different ranking algorithms that result from its application: the binary independence model, relevance feedback models, BM25 and BM25F. It also discusses the relation between the PRF and other statistical models for IR, and covers some related topics, such as the use of non-textual features, and parameter optimisation for models with free parameters.

概率相关框架(PRF)是一个用于文档检索的正式框架，以1970 - 1980年代的工作为基础，它导致了最成功的文本检索算法之一BM25的发展。近年来，PRF的研究产生了能够考虑文档元数据(特别是结构和链接图信息)的新的检索模型。同样，这导致了最成功的web搜索和企业搜索算法之一BM25F。这项工作从概念的角度介绍了PRF，描述了框架背后的概率建模假设以及由其应用产生的不同排名算法:二元独立模型、相关反馈模型、BM25和BM25F。它还讨论了PRF和其他IR统计模型之间的关系，并涵盖了一些相关主题，例如非文本特征的使用，以及具有自由参数的模型的参数优化。

引用次数: 2328

Opinion Mining and Sentiment Analysis 意见挖掘和情感分析

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2008-07-08 DOI: 10.1561/1500000011

B. Pang, Lillian Lee

An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.

我们收集信息行为的一个重要部分一直是找出别人的想法。随着诸如在线评论网站和个人博客等意见丰富的资源的日益普及和普及，人们现在可以并且确实积极地使用信息技术来寻求和理解他人的意见，因此出现了新的机遇和挑战。因此，在意见挖掘和情感分析领域(处理文本中的意见、情绪和主观性的计算处理)的突然爆发，至少在一定程度上是对直接将意见作为一级对象处理的新系统的兴趣激增的直接回应。本调查涵盖了有望直接实现以意见为导向的信息寻求系统的技术和方法。与传统的基于事实的分析相比，我们的重点是寻求解决由情感感知应用带来的新挑战的方法。我们包括关于评估文本摘要的材料，以及关于以舆论为导向的信息获取服务的发展所产生的隐私、操纵和经济影响等更广泛问题的材料。为了促进未来的工作，还提供了对可用资源、基准数据集和评估活动的讨论。

{"title":"Opinion Mining and Sentiment Analysis","authors":"B. Pang, Lillian Lee","doi":"10.1561/1500000011","DOIUrl":"https://doi.org/10.1561/1500000011","url":null,"abstract":"An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. \u0000 \u0000This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"2017 18","pages":"1-135"},"PeriodicalIF":10.4,"publicationDate":"2008-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1561/1500000011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72400159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4579

Email Spam Filtering: A Systematic Review 垃圾邮件过滤:一个系统的审查

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Foundations and Trends in Information Retrieval

Pub Date : 2008-06-23 DOI: 10.1561/1500000006

G. Cormack

Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.

垃圾邮件是一种精心设计的信息，不管收件人的意愿如何，都要发送给大量收件人。垃圾邮件过滤器是一种自动识别垃圾邮件以防止其传递的工具。垃圾邮件和垃圾邮件过滤器的目的是截然相反的:如果垃圾邮件避开了过滤器，那么它就是有效的，而如果过滤器识别了垃圾邮件，那么它就是有效的。这些定义的循环性质，以及它们对发送者和接收者意图的吸引力，使它们难以形式化。典型的电子邮件用户的工作定义不会比“当我看到它时我就知道了”更正式。然而，当前的垃圾邮件过滤器是非常有效的，考虑到不确定性的程度和对垃圾邮件正式定义的争论，比预期的更有效，考虑到最先进的信息检索和机器学习方法对看似类似的问题的预期更有效。但它们足够有效吗?哪个更好?如何改进它们?它们的有效性会被更巧妙的垃圾邮件所削弱吗?我们调查了当前和建议的垃圾邮件过滤技术，特别强调了它们的工作效果。我们主要关注的是垃圾邮件的过滤;在其他通信和存储媒体(如即时消息和Web)中，垃圾邮件过滤的异同将在外围解决。在此过程中，我们将研究垃圾邮件的定义、用户的信息需求以及垃圾邮件过滤器作为庞大而复杂的信息世界的一个组成部分所扮演的角色。对众所周知的方法进行了充分的详细说明，使得本文的阐述是独立的，但是，本文的重点是对垃圾邮件的独特考虑。比较，只要可能，使用共同的评价措施，并控制实验设置的差异。这种比较并不容易，因为评估垃圾邮件过滤器的基准、度量和方法仍在不断发展。我们调查了这些努力，他们的结果和他们的局限性。尽管最近在评估方法方面取得了进展，但关于垃圾邮件过滤技术的有效性和垃圾邮件过滤评估方法的有效性仍然存在许多不确定性(包括广泛持有但未经证实的信念)。我们概述了几个不确定性，并提出了实验方法来解决它们。

{"title":"Email Spam Filtering: A Systematic Review","authors":"G. Cormack","doi":"10.1561/1500000006","DOIUrl":"https://doi.org/10.1561/1500000006","url":null,"abstract":"Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than \"I know it when I see it.\" Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam? \u0000 \u0000We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.","PeriodicalId":48829,"journal":{"name":"Foundations and Trends in Information Retrieval","volume":"5 1","pages":"335-455"},"PeriodicalIF":10.4,"publicationDate":"2008-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76027829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 296

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Foundations and Trends in Information Retrieval

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀