Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Dennis Fetterly, M. Manasse, Marc Najork
{"title":"Spam, damn spam, and statistics: using statistical analysis to locate spam web pages","authors":"Dennis Fetterly, M. Manasse, Marc Najork","doi":"10.1145/1017074.1017077","DOIUrl":null,"url":null,"abstract":"The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call \"web spam\", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":"198 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"350","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1017074.1017077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 350

Abstract

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
垃圾邮件,该死的垃圾邮件和统计:使用统计分析来定位垃圾网页
搜索引擎对商业网站的重要性日益增加,导致了一种我们称之为“垃圾网页”的现象,即存在的网页只是为了误导搜索引擎(错误地)将用户引导到某些网站。网络垃圾邮件对用户和搜索引擎来说都是一个麻烦:用户很难找到他们需要的信息,而搜索引擎不得不处理膨胀的语料库,这反过来又导致每次查询的成本增加。因此,搜索引擎有强烈的动机从他们的索引中清除垃圾网页。我们建议可以通过统计分析来识别一些垃圾网页:某些类别的垃圾网页,特别是那些由机器生成的垃圾网页,在某些属性上与一般网页的属性不同。我们研究了各种这样的属性,包括链接结构、页面内容和页面演变,并发现这些属性的统计分布中的异常值极有可能是由web垃圾邮件引起的。本文描述了我们所研究的特性,给出了我们所观察到的统计分布,并展示了哪些异常值与网络垃圾邮件高度相关。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring Pros and Cons of Ranked Entities with COMPETE Strategies for Detection of Correlated Data Streams Exploring Genomic Datasets: from Batch to Interactive and Back Discovery and Creation of Rich Entities for Knowledge Bases Recommendations for Explorations based on Graphs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1