基于Apache Spark的Squid代理日志分析

2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS) Pub Date : 2018-12-01 DOI:10.1109/ANTS.2018.8710044

D. Mishra, Salim Pathan, C. Murthy

{"title":"基于Apache Spark的Squid代理日志分析","authors":"D. Mishra, Salim Pathan, C. Murthy","doi":"10.1109/ANTS.2018.8710044","DOIUrl":null,"url":null,"abstract":"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.","PeriodicalId":273443,"journal":{"name":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Apache Spark Based Analytics of Squid Proxy Logs\",\"authors\":\"D. Mishra, Salim Pathan, C. Murthy\",\"doi\":\"10.1109/ANTS.2018.8710044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.\",\"PeriodicalId\":273443,\"journal\":{\"name\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ANTS.2018.8710044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANTS.2018.8710044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

今天，互联网是一个组织工作的组成部分。密切监控互联网流量是至关重要的，以便发现威胁和恶意活动，这些威胁和恶意活动不仅会影响组织的声誉，还会导致数据丢失。实现这一目标的一种方法是监视关键应用程序(如代理服务器)的日志，其中包含与Internet活动相关的关键信息。日志数据通常是巨大的，并且还在不断增长。此外，事件的取证分析不仅需要当前数据，还需要历史数据。这就给数据的高效、快速存储和检索带来了很大的问题。传统的RDBMS技术在这种情况下失败了，但是随着像Apache Hadoop和Apache Spark这样的大数据技术的出现，这项任务现在变得可行了。本文提出了一个基于Spark的Squid代理日志分析系统。使用该系统，我们生成统计数据，如顶级域名访问，顶级用户等，以研究组织内的流量行为和检测恶意活动。我们进一步研究了系统性能随数据量的增加和火花参数(如执行器数量、执行器内核数量和执行器内存)的变化。从我们的实验研究中我们得出结论，使用Spark进行日志分析非常快，并且随着数据量的增加没有明显的性能变化。然而，具有挑战性的任务是选择火花参数以获得最佳性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Apache Spark Based Analytics of Squid Proxy Logs

Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)

自引率

0.00%

发文量