Adaptive Focused Website Segment Crawler

Tanaphol Suebchua, A. Rungsawang, H. Yamana
{"title":"Adaptive Focused Website Segment Crawler","authors":"Tanaphol Suebchua, A. Rungsawang, H. Yamana","doi":"10.1109/NBiS.2016.5","DOIUrl":null,"url":null,"abstract":"Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.","PeriodicalId":390397,"journal":{"name":"2016 19th International Conference on Network-Based Information Systems (NBiS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 19th International Conference on Network-Based Information Systems (NBiS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NBiS.2016.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自适应聚焦网站分段爬虫
集中的网络爬虫已经成为垂直搜索引擎为专门的数据集提供搜索服务不可或缺的工具。这些垂直搜索引擎必须收集网络空间中的特定网页,而b谷歌和必应等搜索引擎则收集来自世界各地的网页。集中爬行研究的问题是如何用最少的计算资源来收集特定的网页。我们之前提出了一种聚焦爬行策略来解决这个问题,该策略利用集成机器学习分类器来查找相关网页组,称为相关网站段。在本文中,我们对所提出的爬虫进行了以下改进:1)我们通过准备两个预测器来提高预测网站段的准确性:一个是通过从相关的源网站段中提取的特征学习的预测器,另一个是通过从不相关的源网站段中提取的特征学习的预测器。这个想法是,这两种类型的源网站段之间可能存在不同的特征。2)我们还提出了一种在爬行过程中增量更新预测器的噪声数据消除方法。初步实验表明,我们的增强型爬虫的性能最多比不使用这两种方法的爬虫高出12%左右。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multilingual Digital Signage Using iBeacon Communication Adaptive Focused Website Segment Crawler 3D Modeling of Reconstruction Plan at Sanriku Coast for Great East Japan Earthquake: Visualization of the Reconstruction Plan for Effective Information Sharing Two-Stage Centroid Localization for Wireless Sensor Networks Using Received Signal Strength Control of Envelope Pulse through Nonlinear and Dispersive Medium
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1