基于Apache Spark的Squid代理日志分析

D. Mishra, Salim Pathan, C. Murthy
{"title":"基于Apache Spark的Squid代理日志分析","authors":"D. Mishra, Salim Pathan, C. Murthy","doi":"10.1109/ANTS.2018.8710044","DOIUrl":null,"url":null,"abstract":"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.","PeriodicalId":273443,"journal":{"name":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Apache Spark Based Analytics of Squid Proxy Logs\",\"authors\":\"D. Mishra, Salim Pathan, C. Murthy\",\"doi\":\"10.1109/ANTS.2018.8710044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.\",\"PeriodicalId\":273443,\"journal\":{\"name\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ANTS.2018.8710044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANTS.2018.8710044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

今天,互联网是一个组织工作的组成部分。密切监控互联网流量是至关重要的,以便发现威胁和恶意活动,这些威胁和恶意活动不仅会影响组织的声誉,还会导致数据丢失。实现这一目标的一种方法是监视关键应用程序(如代理服务器)的日志,其中包含与Internet活动相关的关键信息。日志数据通常是巨大的,并且还在不断增长。此外,事件的取证分析不仅需要当前数据,还需要历史数据。这就给数据的高效、快速存储和检索带来了很大的问题。传统的RDBMS技术在这种情况下失败了,但是随着像Apache Hadoop和Apache Spark这样的大数据技术的出现,这项任务现在变得可行了。本文提出了一个基于Spark的Squid代理日志分析系统。使用该系统,我们生成统计数据,如顶级域名访问,顶级用户等,以研究组织内的流量行为和检测恶意活动。我们进一步研究了系统性能随数据量的增加和火花参数(如执行器数量、执行器内核数量和执行器内存)的变化。从我们的实验研究中我们得出结论,使用Spark进行日志分析非常快,并且随着数据量的增加没有明显的性能变化。然而,具有挑战性的任务是选择火花参数以获得最佳性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Apache Spark Based Analytics of Squid Proxy Logs
Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cost-Efficient Resource Sharing in Ethernet-based 5G Mobile Fronthaul Networks Investigation of an Enhanced Efficiency Class-E Power Amplifier with Input Wave Shaping Network Edge Assisted DASH Video Caching Mechanism for Multi-access Edge Computing CMNS: An Energy-Efficient Communication Scheme for Wireless Sensor Networks Fast algorithm for Blind Deinterleaving of a Block Interleaver using binary and non-binary Block codes in a telecommunication system
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1