Development of Focused Crawlers for Building Large Punjabi News Corpus

IF 0.6 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of ICT Research and Applications Pub Date : 2021-12-28 DOI:10.5614/itbj.ict.res.appl.2021.15.3.1

Gurjot Singh Mahi, A. Verma

引用次数: 0

Abstract

Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向大型旁遮普语新闻语料库的聚焦爬虫的开发

网络爬虫和互联网一样古老，搜索引擎最常用它来访问网站并将其索引到存储库中。它们不仅限于搜索引擎，还被广泛用于构建不同领域和语言的语料库。这项研究为三个旁遮普新闻网站开发了一组重点关注的网络爬虫。开发网络爬虫是为了提取高质量的文本文章，并将其添加到本地存储库中以用于进一步的研究。这些爬虫是使用Python编程语言实现的，用于构建一个由九种不同新闻类型的134000多篇新闻文章组成的语料库。爬行器代码和提取的语料库已向科学界公开，用于研究目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of ICT Research and Applications COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

1.60

自引率

0.00%

发文量

审稿时长

24 weeks

期刊介绍： Journal of ICT Research and Applications welcomes full research articles in the area of Information and Communication Technology from the following subject areas: Information Theory, Signal Processing, Electronics, Computer Network, Telecommunication, Wireless & Mobile Computing, Internet Technology, Multimedia, Software Engineering, Computer Science, Information System and Knowledge Management. Authors are invited to submit articles that have not been published previously and are not under consideration elsewhere.