PoN:实时数据分析的开源解决方案

Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin
{"title":"PoN:实时数据分析的开源解决方案","authors":"Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin","doi":"10.1109/DIPDMWC.2016.7529409","DOIUrl":null,"url":null,"abstract":"With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.","PeriodicalId":298218,"journal":{"name":"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"PoN: Open source solution for real-time data analysis\",\"authors\":\"Nikitha Johnsirani Venkatesan, Earl Kim, D. Shin\",\"doi\":\"10.1109/DIPDMWC.2016.7529409\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.\",\"PeriodicalId\":298218,\"journal\":{\"name\":\"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)\",\"volume\":\"138 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DIPDMWC.2016.7529409\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DIPDMWC.2016.7529409","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着快速的创新和互联网人口的增长,每秒产生的信息达到数拍字节。如今,处理和分析这些庞大的数据是一个乏味的过程。实时数据量正在急剧增长。近80%的数据是非结构化格式的。实时分析非结构化数据是一项非常具有挑战性的任务。现有的传统商业智能(BI)工具只有在预定义的模式中才能发挥最佳性能。大多数实时数据都是日志,没有任何已定义的模式。对这些大型数据集进行查询需要很长时间。在实时数据流期间,从数据源中提取了许多不需要的信息,从而导致系统开销。这导致了建筑和维护成本的增加。每一秒,新的数据流都在系统中不断积累,关于世界上正在发生的事情。收集和处理这些数据是准备一份重要报告的基本技能。在本文中,我们提出了一个PoN端到端解决方案,我们使用适当的Hadoop组件进行实时数据分析。我们的目标是从正常的新闻数据中提取健康数据,以便我们可以立即预测任何实时爆发。我们不是收集所有的新闻,而是根据一定的阈值只过滤重要的新闻,从而降低了成本。我们将历史数据与实时数据进行了比较,这导致我们迅速采取行动,因为我们已经从以前的数据中了解了疫情。领先一步,我们甚至可以在世界上任何人之前发现任何危险的疫情。我们不仅使用Hadoop组件进行实时分析,还使用Hive和Pig对收集到的新闻数据集进行查询。最后,对它们进行了性能比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PoN: Open source solution for real-time data analysis
With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Two layers of beam alignment for millimeter-wave communications The Information Technologists that were desired by enterprises in Thailand Improvement and discussion on pronunciation method of DIVA model based on auditory perception space A study of QoS feedback schemes on WiFi multicast for media streaming services Variable decomposition in total variant regularizer for denoising/deblurring image
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1