{"title":"Periodic update and automatic extraction of web data for creating a Google Earth based tool","authors":"T. Abidin, M. Subianto, T. A. Gani, R. Ferdhiana","doi":"10.1109/ICACSIS.2015.7415157","DOIUrl":null,"url":null,"abstract":"A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.","PeriodicalId":325539,"journal":{"name":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2015.7415157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
A lot of tropical disease cases that occurred in Indonesia are reported online in Indonesian news portals. Online news portals are now becoming great sources of information because online news articles are updated frequently. A rule-based, combined with machine learning algorithm, to identify the location of the cases has been developed. In this paper, a complete flow to routinely search, crawl, clean, classify, extract, and integrate the extracted entities into Google Earth is presented. The algorithm is started by searching for Indonesian news articles using a set of selected queries and Google Site Search API, and then crawling them. After the articles are crawled, they are cleaned and classified. The articles that discuss about tropical disease cases (classified as positive) are further examined to extract the locution of the incidence and to determine the sentences containing the date of occurrence and the number of casualties. The extracted entities are then stored in a relational database and annotated in an XML keyhole markup language notation to create a geographic visualization in Google Earth. The evaluation shows that it takes approximately 6 minutes to search, crawl, clean, classify, extract, and annotate the extracted entities into an XML keyhole markup language notation from 5 Web articles. In other words, it takes about 72.40 seconds to process a new page.