A Comprehensive Prediction Method of Visit Priority for Focused Crawler

Xueming Li, Minling Xing, Jiapei Zhang
{"title":"A Comprehensive Prediction Method of Visit Priority for Focused Crawler","authors":"Xueming Li, Minling Xing, Jiapei Zhang","doi":"10.1109/IPTC.2011.14","DOIUrl":null,"url":null,"abstract":"The purpose of a focused crawler is to crawl more topical portions of the Internet precisely. How to predict the visit priorities of candidate URLs whose corresponding pages have yet to be fetched is the determining factor in the focused crawler's ability of getting more relevant pages. This paper introduces a comprehensive prediction method to address this problem. In this method, a page partition algorithm that partitions the page into smaller blocks and interclass rules that statistically capture linkage relationships among the topic classes are adopted to help the focused crawler cross tunnel and to enlarge the focused crawler's coverage, URL's address, anchor text and block content are used to predict visit priority more precisely. Experiments are carried out on the target topic of tennis and the results show that crawler based on this method is more effective than a rule-based crawler on harvest ratio.","PeriodicalId":388589,"journal":{"name":"2011 2nd International Symposium on Intelligence Information Processing and Trusted Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 2nd International Symposium on Intelligence Information Processing and Trusted Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPTC.2011.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The purpose of a focused crawler is to crawl more topical portions of the Internet precisely. How to predict the visit priorities of candidate URLs whose corresponding pages have yet to be fetched is the determining factor in the focused crawler's ability of getting more relevant pages. This paper introduces a comprehensive prediction method to address this problem. In this method, a page partition algorithm that partitions the page into smaller blocks and interclass rules that statistically capture linkage relationships among the topic classes are adopted to help the focused crawler cross tunnel and to enlarge the focused crawler's coverage, URL's address, anchor text and block content are used to predict visit priority more precisely. Experiments are carried out on the target topic of tennis and the results show that crawler based on this method is more effective than a rule-based crawler on harvest ratio.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
聚焦爬虫访问优先级的综合预测方法
聚焦爬虫的目的是精确地抓取Internet的更多主题部分。如何预测尚未获取相应页面的候选url的访问优先级是焦点爬虫获取更多相关页面能力的决定因素。本文介绍了一种综合预测方法来解决这一问题。该方法采用页面划分算法,将页面划分为更小的块,并采用类间规则统计捕获主题类之间的链接关系,帮助重点爬虫跨隧道,扩大重点爬虫的覆盖范围,使用URL地址、锚文本和块内容更精确地预测访问优先级。以网球为目标主题进行了实验,结果表明,基于该方法的爬行器在收获率方面比基于规则的爬行器更有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A New Method for Impossible Differential Cryptanalysis of 7-Round AES-192 Automatic Summarization for Chinese Text Based on Sub Topic Partition and Sentence Features The Color Components' Exchanging on Different Color Spaces and the Using for Image Segmentation An Improved Multi-objective Population Migration Optimization Algorithm The Application of MapReduce in the Cloud Computing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1