Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection.

Emine Guven
{"title":"Decision of the Optimal Rank of a Nonnegative Matrix Factorization Model for Gene Expression Data Sets Utilizing the Unit Invariant Knee Method: Development and Evaluation of the Elbow Method for Rank Selection.","authors":"Emine Guven","doi":"10.2196/43665","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).</p><p><strong>Objective: </strong>In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.</p><p><strong>Methods: </strong>To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.</p><p><strong>Results: </strong>Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.</p><p><strong>Conclusions: </strong>This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.</p>","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11135234/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR bioinformatics and biotechnology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/43665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).

Objective: In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.

Methods: To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.

Results: Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model's functionality.

Conclusions: This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用单位不变膝法确定基因表达数据集非负矩阵分解模型的最优秩(预印本)
背景:目前亟需开发一种计算方法来分析和利用基因表达数据中包含的信息。最近在计算生物学中使用的非负矩阵因式分解(NMF)证明了从大量数据(尤其是基因表达微阵列)中提取重要细节的能力。非负矩阵因式分解中的一个常见问题是找到降级展示因子的适当秩数(r),但对于为此目的使用哪种技术最合适却没有一致意见。因此,人们提出了各种技术来选择秩因子(r)的最佳值:在这项工作中,根据肘法提出了一种新的秩选择度量,并与共轭度量进行了方法上的比较:为了确定最佳数秩(r),本研究重点研究了基因表达数据集上 NMF 的单位不变膝法(UIK)。由于 UIK 方法需要一个极值距离估计器,该估计器最终被用于拐点和膝点的识别,因此提出的方法以基因表达数据集上的 UIK 方法为目标矩阵,找到了所提算法残差平方和曲率的第一个拐点:使用急性淋巴细胞白血病和急性髓性白血病样本的基因表达数据对 UIK 任务进行了计算。因此,对不同算法的 NMF 结果进行了比较。所提出的 UIK 方法易于执行,速度快,不需要先验秩值输入,也不需要对模型功能有重大影响的初始参数:本研究表明,肘部方法既能为基因表达数据提供可靠的预测,也能精确估计已知维度的模拟突变过程数据。所提出的 UIK 方法比传统方法(包括利用共识矩阵作为秩选择标准的度量方法)更快,同时在不对曲线进行目视检查的情况下,计算效率明显更高。最后,建议的基于基因表达数据肘法的秩调整方法可以说在理论上优于共轭度量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
2.90
自引率
0.00%
发文量
0
期刊最新文献
Ethical Considerations in Human-Centered AI: Advancing Oncology Chatbots Through Large Language Models. Enhancing Suicide Risk Prediction With Polygenic Scores in Psychiatric Emergency Settings: Prospective Study. Internet-Based Abnormal Chromosomal Diagnosis During Pregnancy Using a Noninvasive Innovative Approach to Detecting Chromosomal Abnormalities in the Fetus: Scoping Review. Comparison of the Neutralization Power of Sotrovimab Against SARS-CoV-2 Variants: Development of a Rapid Computational Method. Correction: Mutations of SARS-CoV-2 Structural Proteins in the Alpha, Beta, Gamma, and Delta Variants: Bioinformatics Analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1