Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan
{"title":"面向企业网络入侵检测的自主标注管道","authors":"Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan","doi":"10.1109/SIEDS.2019.8735629","DOIUrl":null,"url":null,"abstract":"The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.","PeriodicalId":265421,"journal":{"name":"2019 Systems and Information Engineering Design Symposium (SIEDS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks\",\"authors\":\"Ravi K U Rakesh, Boda Ye, D. Roden, Catherine Beazley, Karan Gadiya, Brendan Abraham, Donald E. Brown, M. Veeraraghavan\",\"doi\":\"10.1109/SIEDS.2019.8735629\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.\",\"PeriodicalId\":265421,\"journal\":{\"name\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIEDS.2019.8735629\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS.2019.8735629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks
The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.