在Hadoop框架下对TCPdump数据集高效应用K-means聚类的合适参数进行了分析

Jakrarin Therdphapiyanak, K. Piromsopa
{"title":"在Hadoop框架下对TCPdump数据集高效应用K-means聚类的合适参数进行了分析","authors":"Jakrarin Therdphapiyanak, K. Piromsopa","doi":"10.1109/ECTICON.2013.6559650","DOIUrl":null,"url":null,"abstract":"In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.","PeriodicalId":273802,"journal":{"name":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework\",\"authors\":\"Jakrarin Therdphapiyanak, K. Piromsopa\",\"doi\":\"10.1109/ECTICON.2013.6559650\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.\",\"PeriodicalId\":273802,\"journal\":{\"name\":\"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ECTICON.2013.6559650\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTICON.2013.6559650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

摘要

在本文中,我们使用Apache Mahout/Hadoop框架确定了对TCPdump数据集应用K-means聚类的适当数量的集群和适当数量的条目。我们的目标是找到合适的配置,以便在有限的时间内有效地分析大型数据集。我们的实现使用Hadoop进行大规模日志分析,以KDD'99大赛的数据集作为测试数据。在分布式系统框架下,我们可以通过首先应用我们的预处理来分析KDD'99的整个数据集。此外,我们使用异常检测模型进行日志分析。一个关键的挑战是使异常检测工作更准确。对于Kmeans算法,一个关键的挑战是如何设置合适的初始簇数(K)。此外,我们还讨论了日志文件中的条目数是否会影响系统的准确率和检测率。因此,我们的实现和实验结果描述了日志文件中适当数量的集群和适当数量的条目。最后给出了实验结果,包括准确率和初始聚类数(K)图、ROC曲线以及检测率和虚警率表。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CPW-fed dual wideband using stub split-ring rectangular slot antenna A comparative study on different techniques for Thai part-of-speech tagging Bismuth doped ZnO films as anti-reflection coatings for solar cells Multi-channel Collection Tree Protocol for Wireless Sensor Networks Multi-robot coordination using switching of methods for deriving equilibrium in game theory
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1