Comparative Analysis of Improved Dirichlet Process Mixture Model

IF 0.8 Q3 MULTIDISCIPLINARY SCIENCES Malaysian Journal of Fundamental and Applied Sciences Pub Date : 2023-12-04 DOI:10.11113/mjfas.v19n6.3062
Lili Wu, P. Fam, Majid Khan Majahar Ali, Ying Tian, Mohd. Tahir Ismail, Siti Zulaikha Mohd Jamaludin
{"title":"Comparative Analysis of Improved Dirichlet Process Mixture Model","authors":"Lili Wu, P. Fam, Majid Khan Majahar Ali, Ying Tian, Mohd. Tahir Ismail, Siti Zulaikha Mohd Jamaludin","doi":"10.11113/mjfas.v19n6.3062","DOIUrl":null,"url":null,"abstract":"Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.","PeriodicalId":18149,"journal":{"name":"Malaysian Journal of Fundamental and Applied Sciences","volume":"6 4","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Fundamental and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/mjfas.v19n6.3062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Due to the development of information technology, large amounts of data are generated every day in various industries such as engineering, healthcare, finance, anomaly detection, image recognition, and artificial intelligence. This massive data poses the challenge of analyzing accurately and appropriate classifications. The traditional clustering methods require specifying the number of clusters and are mostly based on distance, which cannot effectively consider the correlations between different indicators of high-dimensional and multi-source data. Moreover, the number of clusters cannot automatically adjust when new data is generated. In order to improve the clustering analysis of high-dimensional and multi-source data in a big data environment, this study utilizes non-parametric mixture models based on distribution clustering, which does not require specifying the number of clusters and can auto update with the data. By combining Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and the non-parametric Bayesian method called Dirichlet Process Mixture Model (DPMM), the Bayesian non-parametric PCA model (PCA-DPMM) and Bayesian non-parametric t-SNE model (TSNE-DPMM) are proposed. The Chinese restaurant process of DPMM is used for sampling by introducing a finite normal mixture distribution. The clustering results on the iris dataset are compared and analyzed. The accuracy of DPMM and TSNE-DPMM reaches 0.97, while PCA-DPMM achieves a maximum accuracy of only 0.94. When different numbers of iterations are set, TSNE-DPMM maintains an accuracy ranging from 0.92 to 0.97, DPMM ranges from 0.66 to 0.97, and PCA-DPMM ranges from 0.73 to 0.94. Therefore, the proposed TSNE-DPMM ensures accuracy and exhibits better model stability in clustering results. Future research can explore the improvement of the model by incorporating deep learning algorithms, among others, to further enhance its performance. Additionally, applying the TSNE-DPMM model to data analysis in other fields is also a future research direction. Through these efforts, we can better tackle the challenges of analyzing high-dimensional and multi-source data in a big data environment and extract valuable information from it.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
改进的德里赫特过程混合模型比较分析
由于信息技术的发展,工程、医疗、金融、异常检测、图像识别、人工智能等各行各业每天都会产生大量的数据。如此庞大的数据对准确和适当的分类进行分析提出了挑战。传统的聚类方法需要指定聚类个数,且多基于距离,不能有效考虑高维多源数据不同指标之间的相关性。而且,当有新数据产生时,集群的数量不能自动调整。为了提高大数据环境下高维多源数据的聚类分析能力,本研究采用基于分布聚类的非参数混合模型,该模型不需要指定聚类个数,并且可以随数据自动更新。将主成分分析(PCA)、t分布随机邻居嵌入(t-SNE)和非参数贝叶斯方法Dirichlet过程混合模型(DPMM)相结合,提出了贝叶斯非参数PCA模型(PCA-DPMM)和贝叶斯非参数t-SNE模型(TSNE-DPMM)。通过引入有限正态混合分布,采用DPMM中餐馆过程进行抽样。对虹膜数据集的聚类结果进行了比较和分析。DPMM和tsn -DPMM的精度达到0.97,而PCA-DPMM的最大精度仅为0.94。设置不同迭代次数时,tsn -DPMM的精度范围为0.92 ~ 0.97,DPMM的精度范围为0.66 ~ 0.97,PCA-DPMM的精度范围为0.73 ~ 0.94。因此,本文提出的TSNE-DPMM在聚类结果中保证了准确性和更好的模型稳定性。未来的研究可以通过结合深度学习算法等探索模型的改进,进一步提高其性能。此外,将tsn - dpmm模型应用于其他领域的数据分析也是未来的研究方向。通过这些努力,我们可以更好地应对大数据环境下高维、多源数据分析的挑战,并从中提取有价值的信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
45
期刊最新文献
A Review on Synthesis and Physicochemical Properties-Photocatalytic Activity Relationships of Carbon Quantum Dots Graphitic Carbon Nitride in Reduction of Carbon Dioxide A Multi-Criteria Generalised L-R Intuitionistic Fuzzy TOPSIS with CRITIC for River Water Pollution Classification Phytochemical Screening and Antioxidant Activities of Geniotrigona thoracica Propolis Extracts Derived from Different Locations in Malaysia Two-Dimensional Heavy Metal Migration in Soil with Adsorption and Instantaneous Injection Fuzzy Intuitionistic Alpha-cut Interpolation Rational Bézier Curve Modeling for Shoreline Island Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1