Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection.

JMIR bioinformatics and biotechnology Pub Date : 2023-06-06 DOI:10.2196/43665

Emine Guven

{"title":"Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection.","authors":"Emine Guven","doi":"10.2196/43665","DOIUrl":null,"url":null,"abstract":"Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":" ","pages":"e43665"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135234/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR bioinformatics and biotechnology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/43665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).

Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.

Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.

Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.

Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用单位不变膝法确定基因表达数据集非负矩阵分解模型的最优秩(预印本)

背景：目前亟需开发一种计算方法来分析和利用基因表达数据中包含的信息。最近在计算生物学中使用的非负矩阵因式分解（NMF）证明了从大量数据（尤其是基因表达微阵列）中提取重要细节的能力。非负矩阵因式分解中的一个常见问题是找到降级展示因子的适当秩数（r），但对于为此目的使用哪种技术最合适却没有一致意见。因此，人们提出了各种技术来选择秩因子（r）的最佳值：在这项工作中，根据肘法提出了一种新的秩选择度量，并与共轭度量进行了方法上的比较：为了确定最佳数秩（r），本研究重点研究了基因表达数据集上 NMF 的单位不变膝法（UIK）。由于 UIK 方法需要一个极值距离估计器，该估计器最终被用于拐点和膝点的识别，因此提出的方法以基因表达数据集上的 UIK 方法为目标矩阵，找到了所提算法残差平方和曲率的第一个拐点：使用急性淋巴细胞白血病和急性髓性白血病样本的基因表达数据对 UIK 任务进行了计算。因此，对不同算法的 NMF 结果进行了比较。所提出的 UIK 方法易于执行，速度快，不需要先验秩值输入，也不需要对模型功能有重大影响的初始参数：本研究表明，肘部方法既能为基因表达数据提供可靠的预测，也能精确估计已知维度的模拟突变过程数据。所提出的 UIK 方法比传统方法（包括利用共识矩阵作为秩选择标准的度量方法）更快，同时在不对曲线进行目视检查的情况下，计算效率明显更高。最后，建议的基于基因表达数据肘法的秩调整方法可以说在理论上优于共轭度量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR bioinformatics and biotechnology

CiteScore

2.90

自引率

0.00%

发文量