{"title":"A Comprehensive Prediction Method of Visit Priority for Focused Crawler","authors":"Xueming Li, Minling Xing, Jiapei Zhang","doi":"10.1109/IPTC.2011.14","DOIUrl":null,"url":null,"abstract":"The purpose of a focused crawler is to crawl more topical portions of the Internet precisely. How to predict the visit priorities of candidate URLs whose corresponding pages have yet to be fetched is the determining factor in the focused crawler's ability of getting more relevant pages. This paper introduces a comprehensive prediction method to address this problem. In this method, a page partition algorithm that partitions the page into smaller blocks and interclass rules that statistically capture linkage relationships among the topic classes are adopted to help the focused crawler cross tunnel and to enlarge the focused crawler's coverage, URL's address, anchor text and block content are used to predict visit priority more precisely. Experiments are carried out on the target topic of tennis and the results show that crawler based on this method is more effective than a rule-based crawler on harvest ratio.","PeriodicalId":388589,"journal":{"name":"2011 2nd International Symposium on Intelligence Information Processing and Trusted Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 2nd International Symposium on Intelligence Information Processing and Trusted Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPTC.2011.14","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The purpose of a focused crawler is to crawl more topical portions of the Internet precisely. How to predict the visit priorities of candidate URLs whose corresponding pages have yet to be fetched is the determining factor in the focused crawler's ability of getting more relevant pages. This paper introduces a comprehensive prediction method to address this problem. In this method, a page partition algorithm that partitions the page into smaller blocks and interclass rules that statistically capture linkage relationships among the topic classes are adopted to help the focused crawler cross tunnel and to enlarge the focused crawler's coverage, URL's address, anchor text and block content are used to predict visit priority more precisely. Experiments are carried out on the target topic of tennis and the results show that crawler based on this method is more effective than a rule-based crawler on harvest ratio.