Finding common ground among experts' opinions on data clustering: With applications in malware analysis

2014 IEEE 30th International Conference on Data Engineering Pub Date : 2014-05-19 DOI:10.1109/ICDE.2014.6816636

Guanhua Yan

{"title":"Finding common ground among experts' opinions on data clustering: With applications in malware analysis","authors":"Guanhua Yan","doi":"10.1109/ICDE.2014.6816636","DOIUrl":null,"url":null,"abstract":"Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.","PeriodicalId":159130,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 30th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2014.6816636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Data clustering is a basic technique for knowledge discovery and data mining. As the volume of data grows significantly, data clustering becomes computationally prohibitive and resource demanding, and sometimes it is necessary to outsource these tasks to third party experts who specialize in data clustering. The goal of this work is to develop techniques that find common ground among experts' opinions on data clustering, which may be biased due to the features or algorithms used in clustering. Our work differs from the large body of existing approaches to consensus clustering, as we do not require all data objects be grouped into clusters. Rather, our work is motivated by real-world applications that demand high confidence in how data objects - if they are selected - are grouped together.We formulate the problem rigorously and show that it is NP-complete. We further develop a lightweight technique based on finding a maximum independent set in a 3-uniform hypergraph to select data objects that do not form conflicts among experts' opinions. We apply our proposed method to a real-world malware dataset with hundreds of thousands of instances to find malware clusters based on how multiple major AV (Anti-Virus) software classify these samples. Our work offers a new direction for consensus clustering by striking a balance between the clustering quality and the amount of data objects chosen to be clustered.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在数据聚类的专家意见中找到共同点:在恶意软件分析中的应用

数据聚类是知识发现和数据挖掘的一项基本技术。随着数据量的显著增长，数据聚类的计算能力和资源需求变得令人望而却步，有时需要将这些任务外包给专门从事数据聚类的第三方专家。这项工作的目标是开发技术，在专家对数据聚类的意见中找到共同点，这些意见可能由于聚类中使用的特征或算法而有偏见。我们的工作不同于大量现有的共识聚类方法，因为我们不需要将所有数据对象分组到聚类中。更确切地说，我们的工作是由现实世界的应用程序驱动的，这些应用程序要求对数据对象(如果它们是被选择的)如何分组有很高的信心。我们严格地公式化了这个问题，并证明了它是np完全的。我们进一步开发了一种基于在三均匀超图中寻找最大独立集的轻量级技术，以选择不形成专家意见冲突的数据对象。我们将我们提出的方法应用于具有数十万实例的真实恶意软件数据集，根据多个主要反病毒软件对这些样本的分类方式来查找恶意软件集群。我们的工作通过在聚类质量和选择聚类的数据对象数量之间取得平衡，为共识聚类提供了一个新的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 IEEE 30th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Managing uncertainty in spatial and spatio-temporal data Locality-sensitive operators for parallel main-memory database clusters KnowLife: A knowledge graph for health and life sciences We can learn your #hashtags: Connecting tweets to explicit topics A demonstration of MNTG - A web-based road network traffic generator