Design and implementation of competent web crawler and indexer using web services

D. K. Santhosh Kumar, M. Kamath
{"title":"Design and implementation of competent web crawler and indexer using web services","authors":"D. K. Santhosh Kumar, M. Kamath","doi":"10.1109/ICACCCT.2014.7019393","DOIUrl":null,"url":null,"abstract":"Today the internet has become a part of human beings life. To get the information what the user is requesting is the job of search engine which indeed takes the help of web crawler. Designing and developing a competent web crawler is a challenging task. This paper proposes Web crawler and Indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The crawler and indexer services communicate using XML, SOAP and WSDL. The web pages are fetched and parsed for retrieving all the hyperlinks by the crawler service, and then the same process is continued recursively using the Breadth-First strategy. The result of crawler service is downloaded and given as an input to the indexer services by passing the message using web services. Then the indexer service parses the HTML pages, removes stop words, stemming of keywords are carried out as pre-processing steps. Finally the result is stored in the form of inverted index.","PeriodicalId":239918,"journal":{"name":"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACCCT.2014.7019393","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Today the internet has become a part of human beings life. To get the information what the user is requesting is the job of search engine which indeed takes the help of web crawler. Designing and developing a competent web crawler is a challenging task. This paper proposes Web crawler and Indexer. The WebCrawler consist of crawler services and indexer services and realized as web services. The crawler and indexer services communicate using XML, SOAP and WSDL. The web pages are fetched and parsed for retrieving all the hyperlinks by the crawler service, and then the same process is continued recursively using the Breadth-First strategy. The result of crawler service is downloaded and given as an input to the indexer services by passing the message using web services. Then the indexer service parses the HTML pages, removes stop words, stemming of keywords are carried out as pre-processing steps. Finally the result is stored in the form of inverted index.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用web服务设计和实现称职的web爬虫和索引器
今天,互联网已经成为人类生活的一部分。获取用户所需要的信息是搜索引擎的工作,而搜索引擎需要网络爬虫的帮助。设计和开发一个称职的网络爬虫是一项具有挑战性的任务。本文提出了网络爬虫和索引器。WebCrawler由爬虫服务和索引服务组成,实现为web服务。爬虫和索引服务使用XML、SOAP和WSDL进行通信。爬虫服务获取并解析网页以检索所有超链接,然后使用广度优先策略递归地继续执行相同的过程。通过使用web服务传递消息,下载爬虫服务的结果并将其作为输入提供给索引器服务。然后索引服务解析HTML页面,删除停止词,关键词的词干提取作为预处理步骤进行。最后将结果以倒排索引的形式存储。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A hybrid approach to synchronization in real time multiprocessor systems An effective tree metrics graph cut algorithm for MR brain image segmentation and tumor Identification Performance tradeoffs between diversity schemes in wireless systems Fixed point pipelined architecture for QR decomposition Reliability of different levels of cascaded H-Bridge inverter: An investigation and comparison
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1