{"title":"基于Apache Spark的Squid代理日志分析","authors":"D. Mishra, Salim Pathan, C. Murthy","doi":"10.1109/ANTS.2018.8710044","DOIUrl":null,"url":null,"abstract":"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.","PeriodicalId":273443,"journal":{"name":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Apache Spark Based Analytics of Squid Proxy Logs\",\"authors\":\"D. Mishra, Salim Pathan, C. Murthy\",\"doi\":\"10.1109/ANTS.2018.8710044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.\",\"PeriodicalId\":273443,\"journal\":{\"name\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ANTS.2018.8710044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ANTS.2018.8710044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Internet today is an integral part of an organization’s working. It is vital to monitor Internet traffic closely in order to detect threats and malicious activities which may not only impact the reputation of an organization but also lead to data loss. One way of achieving this goal is to monitor the logs of critical applications like proxy server which contains crucial information related to Internet activity. Log data is often huge and is ever growing. Also, forensic analysis of an event requires not only current data but also historical one. This poses a big problem of efficient and fast storage and retrieval of data. Traditional RDBMS technologies fail in such situations but with the advent of big data technologies like Apache Hadoop and Apache Spark this task has now become feasible. In this paper, we propose a Spark based system for analysis of Squid proxy logs. Using this system we generate statistics like top domains accessed, top users etc for studying traffic behavior within organization and detect malicious activity. We further study the variation in proposed system’s performance with increase in data volume and variation in spark parameters like number of executors, number of executor cores and executor memory. From our experimental study we conclude that log analysis with Spark is extremely fast with no significant performance variation observed with increase in data volume. The challenging task, however, is selecting spark parameters for getting optimal performance.