Tran Thao Phuong, A. Yamada, Kosuke Murakami, J. Urakawa, Y. Sawaya, A. Kubota
{"title":"Classification of Landing and Distribution Domains Using Whois’ Text Mining","authors":"Tran Thao Phuong, A. Yamada, Kosuke Murakami, J. Urakawa, Y. Sawaya, A. Kubota","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.213","DOIUrl":null,"url":null,"abstract":"Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, the victim is sent through a long chain of redirection operations in order to take down the offending pages. Concretely, the attack is triggered when a user visits a benign webpage that is compromised by the attacker (called landing page) and is inserted some malicious code inside. The user is then automatically redirected to an actual page that installs malware on the user's computer (called distribution page) without his/her consent or knowledge. While there is a large body of works targeting on detection of drive-by download attack, there is little attention on the redirection which is a crucial characteristic of the attack. In this paper, for the first time, we propose an approach to the classification of landing and distribution domains which are important components forming the head and tail of a redirection chain in the attack. The methodology in our approach is to use machine learning for text mining on the registered information of the domains called whois. We intensively implemented our approach with six popular supervised learning algorithms, compared the results and concluded that Linear-based Support Vector Machine and CART algorithm-based Decision Tree are the best models for our dataset which respectively give 98.55% and 99.28% of accuracy, 97.78% and 98.95% of F1 score, 98.35% and 99.45% of average precision.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, the victim is sent through a long chain of redirection operations in order to take down the offending pages. Concretely, the attack is triggered when a user visits a benign webpage that is compromised by the attacker (called landing page) and is inserted some malicious code inside. The user is then automatically redirected to an actual page that installs malware on the user's computer (called distribution page) without his/her consent or knowledge. While there is a large body of works targeting on detection of drive-by download attack, there is little attention on the redirection which is a crucial characteristic of the attack. In this paper, for the first time, we propose an approach to the classification of landing and distribution domains which are important components forming the head and tail of a redirection chain in the attack. The methodology in our approach is to use machine learning for text mining on the registered information of the domains called whois. We intensively implemented our approach with six popular supervised learning algorithms, compared the results and concluded that Linear-based Support Vector Machine and CART algorithm-based Decision Tree are the best models for our dataset which respectively give 98.55% and 99.28% of accuracy, 97.78% and 98.95% of F1 score, 98.35% and 99.45% of average precision.