PoN:实时数据分析的开源解决方案

2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC) Pub Date : 2016-07-06 DOI:10.1109/DIPDMWC.2016.7529409

Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin

{"title":"PoN:实时数据分析的开源解决方案","authors":"Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin","doi":"10.1109/DIPDMWC.2016.7529409","DOIUrl":null,"url":null,"abstract":"With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.","PeriodicalId":298218,"journal":{"name":"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"PoN: Open source solution for real-time data analysis\",\"authors\":\"Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin\",\"doi\":\"10.1109/DIPDMWC.2016.7529409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.\",\"PeriodicalId\":298218,\"journal\":{\"name\":\"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DIPDMWC.2016.7529409\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DIPDMWC.2016.7529409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着快速的创新和互联网人口的增长，每秒产生的信息达到数拍字节。如今，处理和分析这些庞大的数据是一个乏味的过程。实时数据量正在急剧增长。近80%的数据是非结构化格式的。实时分析非结构化数据是一项非常具有挑战性的任务。现有的传统商业智能(BI)工具只有在预定义的模式中才能发挥最佳性能。大多数实时数据都是日志，没有任何已定义的模式。对这些大型数据集进行查询需要很长时间。在实时数据流期间，从数据源中提取了许多不需要的信息，从而导致系统开销。这导致了建筑和维护成本的增加。每一秒，新的数据流都在系统中不断积累，关于世界上正在发生的事情。收集和处理这些数据是准备一份重要报告的基本技能。在本文中，我们提出了一个PoN端到端解决方案，我们使用适当的Hadoop组件进行实时数据分析。我们的目标是从正常的新闻数据中提取健康数据，以便我们可以立即预测任何实时爆发。我们不是收集所有的新闻，而是根据一定的阈值只过滤重要的新闻，从而降低了成本。我们将历史数据与实时数据进行了比较，这导致我们迅速采取行动，因为我们已经从以前的数据中了解了疫情。领先一步，我们甚至可以在世界上任何人之前发现任何危险的疫情。我们不仅使用Hadoop组件进行实时分析，还使用Hive和Pig对收集到的新闻数据集进行查询。最后，对它们进行了性能比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PoN: Open source solution for real-time data analysis

With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)

自引率

0.00%

发文量