使用web服务设计和实现称职的web爬虫和索引器

2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies Pub Date : 2014-05-08 DOI:10.1109/ICACCCT.2014.7019393

D. K. Santhosh Kumar, M. Kamath

{"title":"使用web服务设计和实现称职的web爬虫和索引器","authors":"D. K. Santhosh Kumar, M. Kamath","doi":"10.1109/ICACCCT.2014.7019393","DOIUrl":null,"url":null,"abstract":"Today the internet has become a part of human beings life. To get the information what the user is requesting is the job of search engine which indeed takes the help of web crawler. Designing and developing a competent web crawler is a challenging task. This paper proposes Web crawler and Indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The crawler and indexer services communicate using XML, SOAP and WSDL. The web pages are fetched and parsed for retrieving all the hyperlinks by the crawler service, and then the same process is continued recursively using the Breadth-First strategy. The result of crawler service is downloaded and given as an input to the indexer services by passing the message using web services. Then the indexer service parses the HTML pages, removes stop words, stemming of keywords are carried out as pre-processing steps. Finally the result is stored in the form of inverted index.","PeriodicalId":239918,"journal":{"name":"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Design and implementation of competent web crawler and indexer using web services\",\"authors\":\"D. K. Santhosh Kumar, M. Kamath\",\"doi\":\"10.1109/ICACCCT.2014.7019393\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today the internet has become a part of human beings life. To get the information what the user is requesting is the job of search engine which indeed takes the help of web crawler. Designing and developing a competent web crawler is a challenging task. This paper proposes Web crawler and Indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The crawler and indexer services communicate using XML, SOAP and WSDL. The web pages are fetched and parsed for retrieving all the hyperlinks by the crawler service, and then the same process is continued recursively using the Breadth-First strategy. The result of crawler service is downloaded and given as an input to the indexer services by passing the message using web services. Then the indexer service parses the HTML pages, removes stop words, stemming of keywords are carried out as pre-processing steps. Finally the result is stored in the form of inverted index.\",\"PeriodicalId\":239918,\"journal\":{\"name\":\"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICACCCT.2014.7019393\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACCCT.2014.7019393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

今天，互联网已经成为人类生活的一部分。获取用户所需要的信息是搜索引擎的工作，而搜索引擎需要网络爬虫的帮助。设计和开发一个称职的网络爬虫是一项具有挑战性的任务。本文提出了网络爬虫和索引器。WebCrawler由爬虫服务和索引服务组成，实现为web服务。爬虫和索引服务使用XML、SOAP和WSDL进行通信。爬虫服务获取并解析网页以检索所有超链接，然后使用广度优先策略递归地继续执行相同的过程。通过使用web服务传递消息，下载爬虫服务的结果并将其作为输入提供给索引器服务。然后索引服务解析HTML页面，删除停止词，关键词的词干提取作为预处理步骤进行。最后将结果以倒排索引的形式存储。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Design and implementation of competent web crawler and indexer using web services

Today the internet has become a part of human beings life. To get the information what the user is requesting is the job of search engine which indeed takes the help of web crawler. Designing and developing a competent web crawler is a challenging task. This paper proposes Web crawler and Indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The crawler and indexer services communicate using XML, SOAP and WSDL. The web pages are fetched and parsed for retrieving all the hyperlinks by the crawler service, and then the same process is continued recursively using the Breadth-First strategy. The result of crawler service is downloaded and given as an input to the indexer services by passing the message using web services. Then the indexer service parses the HTML pages, removes stop words, stemming of keywords are carried out as pre-processing steps. Finally the result is stored in the form of inverted index.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies

自引率

0.00%

发文量