网页分类的优化聚焦爬行

Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar
{"title":"网页分类的优化聚焦爬行","authors":"Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar","doi":"10.1109/ICEPECC57281.2023.10209473","DOIUrl":null,"url":null,"abstract":"The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.","PeriodicalId":102289,"journal":{"name":"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimized Focused Crawling for Web page Classification\",\"authors\":\"Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar\",\"doi\":\"10.1109/ICEPECC57281.2023.10209473\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.\",\"PeriodicalId\":102289,\"journal\":{\"name\":\"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEPECC57281.2023.10209473\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEPECC57281.2023.10209473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

地球上的地理位置通常是通过地理地图来搜索的。同样的相似之处也适用于网络搜索引擎。Web搜索引擎的主要工作是从万维网(WWW)上为其最终用户获取相关的Web页面。这些网页由网络爬虫(自动程序)检索。Web爬行过程从队列中管理的url种子列表开始。每个种子URL对应的Web页面被一个接一个地获取。提取Web页面上的超链接,并在与归档任务相关的情况下进一步添加到队列中。这个过程一直持续到获取所需数量的url为止。焦点爬虫是一种Web爬虫,它只提取与用户指定的主题相关的Web页面。主题通常由关键字或一些示例文档指定,重点关注的Web爬虫决定Web页面是否与主题相关。在集中的Web爬虫中,自动Web页面分类用于对相关和不相关的Web页面进行分类。在本文中,我们使用了一种改进的基于遗传算法(GA)的自动网页分类器。该方法将关键词作为分类过程的特征集。抓取的网页被标记为相关或不相关。利用余弦相似度函数的遗传算法选择最优特征。这些提取的特征被分类器用于对相关网页进行分类。使用关键词作为特征,可以获得更好的准确率、查全率、准确率和F1分数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Optimized Focused Crawling for Web page Classification
The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Generative Adversarial Networks (GANs) Video Framework: A Systematic Literature Review A Novel Decentralized Coordination Control Scheme for the Complex Transactive Energy Prosumers Biometric Electroencephalogram Based Random Number Generator Seasonal Impact on the Storage Capacity Sizing in a Renewable Energy System Under the Condition of Safe Operation Investigation of Optimum Phase Change Material for PV Panels in Malaysian Climatic Conditions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1