Data Analysis Using Representation Theory and Clustering Algorithms

WSEAS TRANSACTIONS ON COMPUTERS Pub Date : 2020-12-18 DOI:10.37394/23205.2020.19.38

Suboh Alkhushayni, T. Choi, Du’a Alzaleq

{"title":"Data Analysis Using Representation Theory and Clustering Algorithms","authors":"Suboh Alkhushayni, T. Choi, Du’a Alzaleq","doi":"10.37394/23205.2020.19.38","DOIUrl":null,"url":null,"abstract":"This work aims to expand the knowledge of the area of data analysis through both persistence homology, as well as representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical data set. By testing the data, we tried to find the agglomerative hierarchical clustering method that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we came with a conclusion that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with random forest are the most beneficial approaches when working with categorical variables. All these tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, for the business area, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators.","PeriodicalId":332148,"journal":{"name":"WSEAS TRANSACTIONS ON COMPUTERS","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WSEAS TRANSACTIONS ON COMPUTERS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.37394/23205.2020.19.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

This work aims to expand the knowledge of the area of data analysis through both persistence homology, as well as representations of directed graphs. To be specific, we looked for how we can analyze homology cluster groups using agglomerative Hierarchical Clustering algorithms and methods. Additionally, the Wine data, which is offered in R studio, was analyzed using various cluster algorithms such as Hierarchical Clustering, K-Means Clustering, and PAM Clustering. The goal of the analysis was to find out which cluster's method is proper for a given numerical data set. By testing the data, we tried to find the agglomerative hierarchical clustering method that will be the optimal clustering algorithm among these three; K-Means, PAM, and Random Forest methods. By comparing each model's accuracy value with cultivar coefficients, we came with a conclusion that K-Means methods are the most helpful when working with numerical variables. On the other hand, PAM clustering and Gower with random forest are the most beneficial approaches when working with categorical variables. All these tests can determine the optimal number of clustering groups, given the data set, and by doing the proper analysis. Using those the project, we can apply our method to several industrial areas such that clinical, business, and others. For example, people can make different groups based on each patient who has a common disease, required therapy, and other things in the clinical society. Additionally, for the business area, people can expect to get several clustered groups based on the marginal profit, marginal cost, or other economic indicators.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于表示理论和聚类算法的数据分析

这项工作旨在通过持久性同调以及有向图的表示来扩展数据分析领域的知识。具体地说，我们研究了如何使用聚合层次聚类算法和方法分析同源聚类组。此外，在R studio中提供的Wine数据使用各种聚类算法(如分层聚类、K-Means聚类和PAM聚类)进行分析。分析的目的是找出哪种聚类方法适合给定的数值数据集。通过对数据的测试，我们试图在这三种聚类算法中找到最优的聚类方法——聚类层次聚类方法;K-Means, PAM和随机森林方法。通过将各模型的精度值与品种系数进行比较，得出K-Means方法在处理数值变量时最有用的结论。另一方面，PAM聚类和Gower随机森林是处理分类变量时最有益的方法。在给定数据集的情况下，通过进行适当的分析，所有这些测试都可以确定聚类组的最佳数量。利用这些项目，我们可以将我们的方法应用于几个工业领域，如临床、商业和其他领域。例如，在临床社会中，人们可以根据每个患者的常见疾病，需要的治疗以及其他事情来划分不同的组。此外，对于业务领域，人们可以根据边际利润、边际成本或其他经济指标期望得到几个集群组。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

WSEAS TRANSACTIONS ON COMPUTERS

自引率

0.00%

发文量