Mateusz Gniewkowski, H. Maciejewski, T. Surmacz, Wiktor Walentynowicz
{"title":"Sec2vec: Anomaly Detection in HTTP Traffic and Malicious URLs","authors":"Mateusz Gniewkowski, H. Maciejewski, T. Surmacz, Wiktor Walentynowicz","doi":"10.1145/3555776.3577663","DOIUrl":null,"url":null,"abstract":"In this paper, we show how methods known from Natural Language Processing (NLP) can be used to detect anomalies in HTTP requests and malicious URLs. Most of the current solutions focusing on a similar problem are either rule-based or trained using manually selected features. Modern NLP methods, however, have great potential in capturing a deep understanding of samples and therefore improving the classification results. Other methods, which rely on a similar idea, often ignore the interpretability of the results, which is so important in machine learning. We are trying to fill this gap. In addition, we show to what extent the proposed solutions are resistant to concept drift. In our work, we compare three different vectorization methods: simple BoW, fastText, and the current state-of-the-art language model RoBERTa. The obtained vectors are later used in the classification task. In order to explain our results, we utilize the SHAP method. We evaluate the feasibility of our methods on four different datasets: CSIC2010, UNSW-NB15, MALICIOUSURL, and ISCX-URL2016. The first two are related to HTTP traffic, the other two contain malicious URLs. The results we show are comparable to others or better, and most importantly - interpretable.","PeriodicalId":42971,"journal":{"name":"Applied Computing Review","volume":null,"pages":null},"PeriodicalIF":0.4000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555776.3577663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper, we show how methods known from Natural Language Processing (NLP) can be used to detect anomalies in HTTP requests and malicious URLs. Most of the current solutions focusing on a similar problem are either rule-based or trained using manually selected features. Modern NLP methods, however, have great potential in capturing a deep understanding of samples and therefore improving the classification results. Other methods, which rely on a similar idea, often ignore the interpretability of the results, which is so important in machine learning. We are trying to fill this gap. In addition, we show to what extent the proposed solutions are resistant to concept drift. In our work, we compare three different vectorization methods: simple BoW, fastText, and the current state-of-the-art language model RoBERTa. The obtained vectors are later used in the classification task. In order to explain our results, we utilize the SHAP method. We evaluate the feasibility of our methods on four different datasets: CSIC2010, UNSW-NB15, MALICIOUSURL, and ISCX-URL2016. The first two are related to HTTP traffic, the other two contain malicious URLs. The results we show are comparable to others or better, and most importantly - interpretable.