基于潜狄利克雷分配和AdaBoost的钓鱼网站检测

2012 IEEE International Conference on Intelligence and Security Informatics Pub Date : 2012-06-11 DOI:10.1109/ISI.2012.6284100

Venkatesh Ramanathan, H. Wechsler

{"title":"基于潜狄利克雷分配和AdaBoost的钓鱼网站检测","authors":"Venkatesh Ramanathan, H. Wechsler","doi":"10.1109/ISI.2012.6284100","DOIUrl":null,"url":null,"abstract":"One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google's language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.","PeriodicalId":199734,"journal":{"name":"2012 IEEE International Conference on Intelligence and Security Informatics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"Phishing website detection using Latent Dirichlet Allocation and AdaBoost\",\"authors\":\"Venkatesh Ramanathan, H. Wechsler\",\"doi\":\"10.1109/ISI.2012.6284100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google's language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.\",\"PeriodicalId\":199734,\"journal\":{\"name\":\"2012 IEEE International Conference on Intelligence and Security Informatics\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE International Conference on Intelligence and Security Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISI.2012.6284100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Intelligence and Security Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2012.6284100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

犯罪分子在网络空间窃取身份的方法之一是使用网络钓鱼。攻击者拥有类似合法网站的钓鱼网站，并诱使用户点击超链接，将他们引导到这些假网站。攻击者利用这些虚假网站获取无辜受害者的个人信息，如登录名、密码和社会安全号码，然后利用这些信息实施犯罪。我们在这里提出了一种强大的方法来检测钓鱼网站，该方法采用主题建模技术潜狄利克雷分配(Latent Dirichlet Allocation)和分类技术AdaBoost进行语义分析。所开发的方法是一种内容驱动的方法，它与设备无关，与语言无关。采用智能网络爬虫对移动端和桌面端网站内容进行采集。非英文的网站内容将使用谷歌的语言翻译器翻译成英文。利用桌面和移动客户端翻译后的内容构建主题模型。钓鱼网站分类器是使用(i)使用潜狄利克雷分配(Latent Dirichlet Allocation)和(ii) AdaBoost投票技术对发现的主题作为特征的分布概率进行构建的。实验使用一个大型公共网站数据语料库进行，该语料库包含47500个钓鱼网站和52500个好网站。结果表明，该方法的f值为99%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Phishing website detection using Latent Dirichlet Allocation and AdaBoost

One of the ways criminals steal identity in the cyberspace is using phishing. Attackers host phishing websites that resemble a legitimate website and entice users to click on hyperlinks which directs them to these fake websites. Attackers use these fake sites to capture personal information such as login, passwords and social security numbers from innocent victims, which they later use to commit crimes. We propose here a robust methodology to detect phishing websites that employs for semantic analysis a topic modeling technique, Latent Dirichlet Allocation, and for classification, AdaBoost. The methodology developed is a content driven approach that is device independent and language neutral. The website content of mobile and desktop clients are collected by employing an intelligent web crawler. The website contents that are not in English are translated to English using Google's language translator. Topic model is built using the translated contents of desktop and mobile clients. The phishing website classifier is built using (i) distribution probabilities for the topics found as features using Latent Dirichlet Allocation and (ii) AdaBoost voting technique. Experiments were conducted using one of the large public corpus of website data containing 47500 phishing websites and 52500 good websites. Results show that our method achieves a F-measure of 99%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE International Conference on Intelligence and Security Informatics

自引率

0.00%

发文量

期刊最新文献

Detecting criminal networks: SNA models are compared to proprietary models Securing cyberspace: Identifying key actors in hacker communities Emergency decision support using an agent-based modeling approach Payment card fraud: Challenges and solutions Extracting action knowledge in security informatics