Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA
{"title":"二进制漂白:自动模型选择的快速分布式并行方法","authors":"Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA","doi":"arxiv-2407.19125","DOIUrl":null,"url":null,"abstract":"In several Machine Learning (ML) clustering and dimensionality reduction\napproaches, such as non-negative matrix factorization (NMF), RESCAL, and\nK-Means clustering, users must select a hyper-parameter k to define the number\nof clusters or components that yield an ideal separation of samples or clean\nclusters. This selection, while difficult, is crucial to avoid overfitting or\nunderfitting the data. Several ML applications use scoring methods (e.g.,\nSilhouette and Davies Boulding scores) to evaluate the cluster pattern\nstability for a specific k. The score is calculated for different trials over a\nrange of k, and the ideal k is heuristically selected as the value before the\nmodel starts overfitting, indicated by a drop or increase in the score\nresembling an elbow curve plot. While the grid-search method can be used to\naccurately find a good k value, visiting a range of k can become time-consuming\nand computationally resource-intensive. In this paper, we introduce the Binary\nBleed method based on binary search, which significantly reduces the k search\nspace for these grid-search ML algorithms by truncating the target k values\nfrom the search space using a heuristic with thresholding over the scores.\nBinary Bleed is designed to work with single-node serial, single-node\nmulti-processing, and distributed computing resources. In our experiments, we\ndemonstrate the reduced search space gain over a naive sequential search of the\nideal k and the accuracy of the Binary Bleed in identifying the correct k for\nNMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding\nscores. We make our implementation of Binary Bleed for the NMF algorithm\navailable on GitHub.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"48 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection\",\"authors\":\"Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA\",\"doi\":\"arxiv-2407.19125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In several Machine Learning (ML) clustering and dimensionality reduction\\napproaches, such as non-negative matrix factorization (NMF), RESCAL, and\\nK-Means clustering, users must select a hyper-parameter k to define the number\\nof clusters or components that yield an ideal separation of samples or clean\\nclusters. This selection, while difficult, is crucial to avoid overfitting or\\nunderfitting the data. Several ML applications use scoring methods (e.g.,\\nSilhouette and Davies Boulding scores) to evaluate the cluster pattern\\nstability for a specific k. The score is calculated for different trials over a\\nrange of k, and the ideal k is heuristically selected as the value before the\\nmodel starts overfitting, indicated by a drop or increase in the score\\nresembling an elbow curve plot. While the grid-search method can be used to\\naccurately find a good k value, visiting a range of k can become time-consuming\\nand computationally resource-intensive. In this paper, we introduce the Binary\\nBleed method based on binary search, which significantly reduces the k search\\nspace for these grid-search ML algorithms by truncating the target k values\\nfrom the search space using a heuristic with thresholding over the scores.\\nBinary Bleed is designed to work with single-node serial, single-node\\nmulti-processing, and distributed computing resources. In our experiments, we\\ndemonstrate the reduced search space gain over a naive sequential search of the\\nideal k and the accuracy of the Binary Bleed in identifying the correct k for\\nNMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding\\nscores. We make our implementation of Binary Bleed for the NMF algorithm\\navailable on GitHub.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"48 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.19125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.19125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在一些机器学习(ML)聚类和降维方法(如非负矩阵因式分解(NMF)、RESCAL 和 K-Means 聚类)中,用户必须选择一个超参数 k 来定义能产生理想样本分离或干净聚类的聚类或成分的数量。这种选择虽然困难,但对于避免数据过拟合或欠拟合至关重要。一些 ML 应用程序使用评分方法(如 Silhouette 和 Davies Boulding 评分)来评估特定 k 的聚类模式稳定性。评分是根据 k 范围内的不同试验计算得出的,理想的 k 值是在模型开始过拟合之前启发式选出的,表现为评分的下降或上升,类似于肘部曲线图。虽然网格搜索法可以准确地找到一个好的 k 值,但访问一个 k 范围可能会耗费大量时间和计算资源。在本文中,我们介绍了基于二进制搜索的 BinaryBleed 方法,该方法通过使用对分数进行阈值化处理的启发式方法从搜索空间中截断目标 k 值,从而显著减少了这些网格搜索 ML 算法的 k 搜索空间。在我们的实验中,我们展示了二进制漂白在识别 NMFk、K-Means pyDNMFk 和具有 Silhouette 和 Davies Bouldings 分数的 pyDRESCALk 的正确 k 时,比顺序搜索理想 k 的天真方法所获得的搜索空间更小,以及二进制漂白的准确性。我们在 GitHub 上提供了 NMF 算法的二进制漂白实现。
Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection
In several Machine Learning (ML) clustering and dimensionality reduction
approaches, such as non-negative matrix factorization (NMF), RESCAL, and
K-Means clustering, users must select a hyper-parameter k to define the number
of clusters or components that yield an ideal separation of samples or clean
clusters. This selection, while difficult, is crucial to avoid overfitting or
underfitting the data. Several ML applications use scoring methods (e.g.,
Silhouette and Davies Boulding scores) to evaluate the cluster pattern
stability for a specific k. The score is calculated for different trials over a
range of k, and the ideal k is heuristically selected as the value before the
model starts overfitting, indicated by a drop or increase in the score
resembling an elbow curve plot. While the grid-search method can be used to
accurately find a good k value, visiting a range of k can become time-consuming
and computationally resource-intensive. In this paper, we introduce the Binary
Bleed method based on binary search, which significantly reduces the k search
space for these grid-search ML algorithms by truncating the target k values
from the search space using a heuristic with thresholding over the scores.
Binary Bleed is designed to work with single-node serial, single-node
multi-processing, and distributed computing resources. In our experiments, we
demonstrate the reduced search space gain over a naive sequential search of the
ideal k and the accuracy of the Binary Bleed in identifying the correct k for
NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding
scores. We make our implementation of Binary Bleed for the NMF algorithm
available on GitHub.