Periodic update and automatic extraction of web data for creating a Google Earth based tool

2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS) Pub Date : 2015-10-01 DOI:10.1109/ICACSIS.2015.7415157

T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana

{"title":"Periodic update and automatic extraction of web data for creating a Google Earth based tool","authors":"T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana","doi":"10.1109/ICACSIS.2015.7415157","DOIUrl":null,"url":null,"abstract":"A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.","PeriodicalId":325539,"journal":{"name":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2015.7415157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

定期更新和自动提取web数据，用于创建基于谷歌地球的工具

印度尼西亚新闻门户网站在网上报道了许多发生在印度尼西亚的热带疾病病例。在线新闻门户网站现在正成为重要的信息来源，因为在线新闻文章经常更新。一种基于规则的，结合机器学习算法，来识别案例的位置已经开发出来。本文给出了常规搜索、抓取、清理、分类、提取并将提取的实体整合到Google Earth中的完整流程。该算法首先使用一组选定的查询和谷歌网站搜索API搜索印尼新闻文章，然后对它们进行爬行。在抓取文章后，对其进行清理和分类。对讨论热带病病例(分类为阳性)的文章进行进一步检查，以提取发病率的措辞，并确定包含发生日期和伤亡人数的句子。然后将提取的实体存储在关系数据库中，并用XML keyhole标记语言符号进行注释，以便在Google Earth中创建地理可视化。评估表明，从5篇Web文章中搜索、抓取、清理、分类、提取并将提取的实体标注为XML keyhole标记语言符号大约需要6分钟。换句话说，处理一个新页面大约需要72.40秒。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)

自引率

0.00%

发文量