Periodic update and automatic extraction of web data for creating a Google Earth based tool

T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana
{"title":"Periodic update and automatic extraction of web data for creating a Google Earth based tool","authors":"T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana","doi":"10.1109/ICACSIS.2015.7415157","DOIUrl":null,"url":null,"abstract":"A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.","PeriodicalId":325539,"journal":{"name":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2015.7415157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
定期更新和自动提取web数据,用于创建基于谷歌地球的工具
印度尼西亚新闻门户网站在网上报道了许多发生在印度尼西亚的热带疾病病例。在线新闻门户网站现在正成为重要的信息来源,因为在线新闻文章经常更新。一种基于规则的,结合机器学习算法,来识别案例的位置已经开发出来。本文给出了常规搜索、抓取、清理、分类、提取并将提取的实体整合到Google Earth中的完整流程。该算法首先使用一组选定的查询和谷歌网站搜索API搜索印尼新闻文章,然后对它们进行爬行。在抓取文章后,对其进行清理和分类。对讨论热带病病例(分类为阳性)的文章进行进一步检查,以提取发病率的措辞,并确定包含发生日期和伤亡人数的句子。然后将提取的实体存储在关系数据库中,并用XML keyhole标记语言符号进行注释,以便在Google Earth中创建地理可视化。评估表明,从5篇Web文章中搜索、抓取、清理、分类、提取并将提取的实体标注为XML keyhole标记语言符号大约需要6分钟。换句话说,处理一个新页面大约需要72.40秒。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An automatic health surveillance chart interpretation system based on Indonesian language Road detection system based on RGB histogram filterization and boundary classifier Developing smart telehealth system in Indonesia: Progress and challenge Evolutionary segment selection for higher-order conditional random fields in semantic image segmentation Enhancing efficiency of enterprise digital rights management
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1