Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar
{"title":"网页分类的优化聚焦爬行","authors":"Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar","doi":"10.1109/ICEPECC57281.2023.10209473","DOIUrl":null,"url":null,"abstract":"The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.","PeriodicalId":102289,"journal":{"name":"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimized Focused Crawling for Web page Classification\",\"authors\":\"Nazifa Fatima, M. Faheem, Muhammad Ziad Nayyer Dar\",\"doi\":\"10.1109/ICEPECC57281.2023.10209473\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.\",\"PeriodicalId\":102289,\"journal\":{\"name\":\"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEPECC57281.2023.10209473\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Energy, Power, Environment, Control, and Computing (ICEPECC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEPECC57281.2023.10209473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimized Focused Crawling for Web page Classification
The geographical location in the globe is usually searched by following the geographical map. The same resemblance works for the Web search engine. The main job of the Web search engine is to get relevant Web pages from World Wide Web (WWW) for its end user. These Web pages are retrieved by Web crawlers (automated program). The Web crawling process starts with the seed list of URLs managed in queue. The Web page against each seed URL is fetched one after another. The hyperlinks on the Web pages are extracted and further added in the queue if relevant to the archiving task. This process continues until the required number of URLs are fetched. Focused crawler is one of the types of Web crawler that extracts only those Web pages that are relevant to the topic specified by the user. A topic is usually specified by keywords or some exemplary documents on which focused Web crawler decides whether a Web page is relevant to the topic or not. In focused Web crawlers automated Web page classification is used to classify the relevant and irrelevant Web pages. In this paper we have used a modified Genetic Algorithm (GA) based automated web page classifier. In this method keywords are used as feature set for the classification process. The crawled web pages are labeled as relevant or irrelevant. The best features are selected by the Genetic algorithm using Cosine Similarity function. These extracted features are used by classifier for the classification of relevant Web pages. Using the keywords as a feature, better precision, recall, accuracy and F1 scores are achieved.