Scalable hybrid stream and hadoop network analysis system

Proceedings of the 5th ACM/SPEC international conference on Performance engineering Pub Date : 2014-03-22 DOI:10.1145/2568088.2568103

V. Bumgardner, V. Marek

{"title":"Scalable hybrid stream and hadoop network analysis system","authors":"V. Bumgardner, V. Marek","doi":"10.1145/2568088.2568103","DOIUrl":null,"url":null,"abstract":"Collections of network traces have long been used in network traffic analysis. Flow analysis can be used in network anomaly discovery, intrusion detection and more generally, discovery of actionable events on the network. The data collected during processing may be also used for prediction and avoidance of traffic congestion, network capacity planning, and the development of software-defined networking rules. As network flow rates increase and new network technologies are introduced on existing hardware platforms, many organizations find themselves either technically or financially unable to generate, collect, and/or analyze network flow data. The continued rapid growth of network trace data, requires new methods of scalable data collection and analysis. We report on our deployment of a system designed and implemented at the University of Kentucky that supports analysis of network traffic across the enterprise. Our system addresses problems of scale in existing systems, by using distributed computing methodologies, and is based on a combination of stream and batch processing techniques. In addition to collection, stream processing using Storm is utilized to enrich the data stream with ephemeral environment data. Enriched stream-data is then used for event detection and near real-time flow analysis by an in-line complex event processor. Batch processing is performed by the Hadoop MapReduce framework, from data stored in HBase BigTable storage. In benchmarks on our 10 node cluster, using actual network data, we were able to stream process over 315k flows/sec. In batch analysis were we able to process over 2.6M flows/sec with a storage compression ratio of 6.7:1.","PeriodicalId":243233,"journal":{"name":"Proceedings of the 5th ACM/SPEC international conference on Performance engineering","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th ACM/SPEC international conference on Performance engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2568088.2568103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

Collections of network traces have long been used in network traffic analysis. Flow analysis can be used in network anomaly discovery, intrusion detection and more generally, discovery of actionable events on the network. The data collected during processing may be also used for prediction and avoidance of traffic congestion, network capacity planning, and the development of software-defined networking rules. As network flow rates increase and new network technologies are introduced on existing hardware platforms, many organizations find themselves either technically or financially unable to generate, collect, and/or analyze network flow data. The continued rapid growth of network trace data, requires new methods of scalable data collection and analysis. We report on our deployment of a system designed and implemented at the University of Kentucky that supports analysis of network traffic across the enterprise. Our system addresses problems of scale in existing systems, by using distributed computing methodologies, and is based on a combination of stream and batch processing techniques. In addition to collection, stream processing using Storm is utilized to enrich the data stream with ephemeral environment data. Enriched stream-data is then used for event detection and near real-time flow analysis by an in-line complex event processor. Batch processing is performed by the Hadoop MapReduce framework, from data stored in HBase BigTable storage. In benchmarks on our 10 node cluster, using actual network data, we were able to stream process over 315k flows/sec. In batch analysis were we able to process over 2.6M flows/sec with a storage compression ratio of 6.7:1.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可扩展的混合流和hadoop网络分析系统

长期以来，网络轨迹集一直用于网络流量分析。流分析可用于网络异常发现、入侵检测以及更一般的网络上可操作事件的发现。处理过程中收集的数据还可用于预测和避免流量拥塞、规划网络容量、制定软件定义的网络规则等。随着网络流量的增加以及在现有硬件平台上引入新的网络技术，许多组织发现自己在技术上或经济上无法生成、收集和/或分析网络流量数据。网络跟踪数据的持续快速增长，需要新的可扩展的数据收集和分析方法。我们报告我们在肯塔基大学设计和实现的系统的部署情况，该系统支持对整个企业的网络流量进行分析。我们的系统解决了现有系统的规模问题，通过使用分布式计算方法，并基于流和批处理技术的组合。除了收集之外，使用Storm进行流处理还可以使用短暂的环境数据来丰富数据流。丰富的流数据，然后用于事件检测和近实时流分析由一个内联复杂事件处理器。批处理由Hadoop MapReduce框架对存储在HBase BigTable存储中的数据进行处理。在我们的10个节点集群的基准测试中，使用实际的网络数据，我们能够以超过315k流/秒的速度进行流处理。在批量分析中，我们能够处理超过2.6M流/秒，存储压缩比为6.7:1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

自引率

0.00%

发文量