{"title":"Efficient focused crawling based on best first search","authors":"S. Rawat, D. R. Patil","doi":"10.1109/IADCC.2013.6514347","DOIUrl":null,"url":null,"abstract":"The World Wide Web continues to grow at an exponential rate, so fetching information about a special-topic is gaining importance which poses exceptional scaling challenges for general-purpose crawlers and search engines. This paper describes a web crawling approach based on best first search. As the goal of a focused crawler is to selectively seek out pages that are relevant to given keywords. Rather than collecting and indexing all available web documents to be able to answer all possible queries, a focused crawler analyze its crawl boundary to hit upon the links that are likely to be most relevant for the crawl, and avoids irrelevant links of the document. This leads to significant savings in hardware as well as network resources and also helps keep the crawl more up-to-date. To accomplish such goal-directed crawling, we select top most k relevant documents for a given query and then expand the most promising link chosen according to link score, to circumvent irrelevant regions of the web.","PeriodicalId":325901,"journal":{"name":"2013 3rd IEEE International Advance Computing Conference (IACC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 3rd IEEE International Advance Computing Conference (IACC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IADCC.2013.6514347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
The World Wide Web continues to grow at an exponential rate, so fetching information about a special-topic is gaining importance which poses exceptional scaling challenges for general-purpose crawlers and search engines. This paper describes a web crawling approach based on best first search. As the goal of a focused crawler is to selectively seek out pages that are relevant to given keywords. Rather than collecting and indexing all available web documents to be able to answer all possible queries, a focused crawler analyze its crawl boundary to hit upon the links that are likely to be most relevant for the crawl, and avoids irrelevant links of the document. This leads to significant savings in hardware as well as network resources and also helps keep the crawl more up-to-date. To accomplish such goal-directed crawling, we select top most k relevant documents for a given query and then expand the most promising link chosen according to link score, to circumvent irrelevant regions of the web.