Adaptive Focused Website Segment Crawler

2016 19th International Conference on Network-Based Information Systems (NBiS) Pub Date : 2016-12-16 DOI:10.1109/NBiS.2016.5

Tanaphol Suebchua, A. Rungsawang, H. Yamana

{"title":"Adaptive Focused Website Segment Crawler","authors":"Tanaphol Suebchua, A. Rungsawang, H. Yamana","doi":"10.1109/NBiS.2016.5","DOIUrl":null,"url":null,"abstract":"Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.","PeriodicalId":390397,"journal":{"name":"2016 19th International Conference on Network-Based Information Systems (NBiS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 19th International Conference on Network-Based Information Systems (NBiS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NBiS.2016.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自适应聚焦网站分段爬虫

集中的网络爬虫已经成为垂直搜索引擎为专门的数据集提供搜索服务不可或缺的工具。这些垂直搜索引擎必须收集网络空间中的特定网页，而b谷歌和必应等搜索引擎则收集来自世界各地的网页。集中爬行研究的问题是如何用最少的计算资源来收集特定的网页。我们之前提出了一种聚焦爬行策略来解决这个问题，该策略利用集成机器学习分类器来查找相关网页组，称为相关网站段。在本文中，我们对所提出的爬虫进行了以下改进:1)我们通过准备两个预测器来提高预测网站段的准确性:一个是通过从相关的源网站段中提取的特征学习的预测器，另一个是通过从不相关的源网站段中提取的特征学习的预测器。这个想法是，这两种类型的源网站段之间可能存在不同的特征。2)我们还提出了一种在爬行过程中增量更新预测器的噪声数据消除方法。初步实验表明，我们的增强型爬虫的性能最多比不使用这两种方法的爬虫高出12%左右。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 19th International Conference on Network-Based Information Systems (NBiS)

自引率

0.00%

发文量