安全性中的HAC-T与快速相似性搜索

2020 International Conference on Omni-layer Intelligent Systems (COINS) Pub Date : 2020-08-01 DOI:10.1109/COINS49042.2020.9191381

Jonathan J. Oliver, Muqeet Ali, Josiah Hagen

{"title":"安全性中的HAC-T与快速相似性搜索","authors":"Jonathan J. Oliver, Muqeet Ali, Josiah Hagen","doi":"10.1109/COINS49042.2020.9191381","DOIUrl":null,"url":null,"abstract":"Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.","PeriodicalId":350108,"journal":{"name":"2020 International Conference on Omni-layer Intelligent Systems (COINS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"HAC-T and Fast Search for Similarity in Security\",\"authors\":\"Jonathan J. Oliver, Muqeet Ali, Josiah Hagen\",\"doi\":\"10.1109/COINS49042.2020.9191381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.\",\"PeriodicalId\":350108,\"journal\":{\"name\":\"2020 International Conference on Omni-layer Intelligent Systems (COINS)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Omni-layer Intelligent Systems (COINS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/COINS49042.2020.9191381\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Omni-layer Intelligent Systems (COINS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COINS49042.2020.9191381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

相似性摘要在许多安全应用程序(如黑名单/白名单)和查找恶意软件的类似变体中得到了普及。TLSH已被证明在寻找类似的恶意软件方面特别出色，并且与其他类似的消化方法(如ssdeep和shashh)相比，它具有抵抗逃避的能力。搜索和聚类是帮助安全分析师和安全运营中心(SOC)操作员查找和分析恶意软件的基本工具。目前旨在集群恶意软件的方法没有足够的可扩展性来跟上大量可用的恶意软件和好软件。在本文中，我们提出了允许快速搜索和聚类TLSH哈希摘要的技术，这可以帮助分析人员检查大量恶意软件/好软件。我们的方法建立在快速最近邻搜索技术的基础上，构建基于树的索引，该索引基于TLSH哈希摘要执行快速搜索。基于树的索引用于基于阈值的分层聚类(HAC-T)算法，该算法能够以可扩展的方式聚类摘要。我们的聚类技术平均可以在O (n logn)时间内对摘要进行聚类。我们通过将我们的方法与许多标准和最近的聚类技术进行比较来进行经验评估。我们证明了我们的方法具有更高的可扩展性，并且仍然能够产生良好的集群质量。我们使用从VirusTotal获得的1000万个样本的纯度来测量聚类质量。我们使用来自五个主要反病毒供应商(卡巴斯基、微软、赛门铁克、Sophos和McAfee)的标签获得了0.97到0.98的高纯度分数，这证明了所提出方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HAC-T and Fast Search for Similarity in Security

Similarity digests have gained popularity for many security applications like blacklisting/whitelisting, and finding similar variants of malware. TLSH has been shown to be particularly good at hunting similar malware, and is resistant to evasion as compared to other similarity digests like ssdeep and sdhash. Searching and clustering are fundamental tools which help the security analysts and security operations center (SOC) operators in hunting and analyzing malware. Current approaches which aim to cluster malware are not scalable enough to keep up with the vast amount of malware and goodware available in the wild. In this paper, we present techniques which allow for fast search and clustering of TLSH hash digests which can aid analysts to inspect large amounts of malware/goodware. Our approach builds on fast nearest neighbor search techniques to build a tree-based index which performs fast search based on TLSH hash digests. The tree-based index is used in our threshold based Hierarchical Agglomerative Clustering (HAC-T) algorithm which is able to cluster digests in a scalable manner. Our clustering technique can cluster digests in O (n logn) time on average. We performed an empirical evaluation by comparing our approach with many standard and recent clustering techniques. We demonstrate that our approach is much more scalable and still is able to produce good cluster quality. We measured cluster quality using purity on 10 million samples obtained from VirusTotal. We obtained a high purity score in the range from 0.97 to 0.98 using labels from five major anti-virus vendors (Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which demonstrates the effectiveness of the proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助