An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks

Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan
{"title":"An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks","authors":"Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan","doi":"10.1109/SIEDS.2019.8735629","DOIUrl":null,"url":null,"abstract":"The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
面向企业网络入侵检测的自主标注管道
在过去的五年里,网络攻击的数量呈指数级增长,没有任何放缓的迹象。此外,攻击正在迅速演变,并且变得越来越复杂。网络公司和学者都转向机器学习来构建模型,学习数据驱动的威胁检测规则。然而,这些方法需要大量的训练数据,而且许多企业缺乏基础设施来标记自己的网络流量以进行监督学习。标签问题的一个额外的复杂性是IP地址经常被重新分配给新的主机。在本文中,我们为自动交通标签管道奠定了基础,该管道包含三种不同的地面事实来源,并且需要最少的人工干预。我们将标记管道应用于从弗吉尼亚大学获得的网络流量数据。我们使用一个名为Zeek的流行网络监控框架来处理网络流量,该框架提供了在一定时间间隔内源和目标之间交换的数据包的汇总统计信息。此外,标签管道综合了来自杜克大学STINGAR项目编制的蜜罐网络、一系列9个黑名单和一个名为思科保护伞的白名单的数据。我们使用集群、端口和ip位置分析表明,集成不同数据源的标记方法优于仅使用单个数据源的标记方法。本文提出的标记方法将有助于企业网络管理员构建健壮的入侵检测系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Impact of Artificial Intelligence and Internet of Things in the Transformation of E-Business Sector Gamification of eHealth Interventions to Increase User Engagement and Reduce Attrition Modeling User Context from Smartphone Data for Recognition of Health Status Developing a data pipeline to improve accessibility and utilization of Charlottesville's Open Data Portal Deep Learning for Detecting Diseases in Gastrointestinal Biopsy Images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1