Statistical Applications in Genetics and Molecular Biology最新文献_第3页

A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. 一种基于CNN和BiGRU-Attention的新型混合深度学习模型用于蛋白质功能预测。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2022-0057

Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy

Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.

蛋白质是所有生物的基本组成部分。如果要了解生命的分子机制，就必须确定蛋白质的功能。CNN擅长捕捉短期关系，而GRU和LSTM可以捕捉长期依赖关系。结合了这些深度学习模型的互补优势的混合方法激励了我们的工作。蛋白质语言模型使用注意力网络来收集有意义的数据并建立蛋白质的表示，近年来在处理蛋白质序列方面取得了巨大的成功。在本文中，我们提出了一种基于蛋白质语言模型嵌入的CNN + BiGRU-Attention混合模型，有效地将CNN的输出与BiGRU-Attention的输出相结合，用于预测蛋白质功能。我们在人类和酵母数据集上评估了我们提出的混合模型的性能。所提出的混合模型比最先进的模型sn2go在细胞成分预测任务中的Fmax值提高了1.9 %，在分子功能预测任务中提高了3.8 %，在生物过程预测任务中提高了0.6 %，在人类数据集和酵母数据集的细胞成分预测任务中提高了2.4 %，在分子功能预测任务中提高了5.2 %，在生物过程预测任务中提高了1.2 %。

{"title":"A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction.","authors":"Lavkush Sharma, Akshay Deepak, Ashish Ranjan, Gopalakrishnan Krishnasamy","doi":"10.1515/sagmb-2022-0057","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0057","url":null,"abstract":"Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAT PETR: a graphical user interface for differential analysis of phosphorylation and expression data CAT PETR：用于磷酸化和表达数据差异分析的图形用户界面

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1101/2023.01.04.522665

Keegan Flanagan, S. Pelech, Y. Av‐Gay, K. Dao Duc

Abstract Antibody microarray data provides a powerful and high-throughput tool to monitor global changes in cellular response to perturbation or genetic manipulation. However, while collecting such data has become increasingly accessible, a lack of specific computational tools has made their analysis limited. Here we present CAT PETR, a user friendly web application for the differential analysis of expression and phosphorylation data collected via antibody microarrays. Our application addresses the limitations of other GUI based tools by providing various data input options and visualizations. To illustrate its capabilities on real data, we show that CAT PETR both replicates previous findings, and reveals additional insights, using its advanced visualization and statistical options.

摘要抗体微阵列数据提供了一种强大的高通量工具来监测细胞对扰动或基因操纵反应的整体变化。然而，尽管收集此类数据变得越来越容易获得，但由于缺乏特定的计算工具，其分析受到限制。在这里，我们介绍了CAT PETR，一个用户友好的网络应用程序，用于通过抗体微阵列收集的表达和磷酸化数据的差异分析。我们的应用程序通过提供各种数据输入选项和可视化来解决其他基于GUI的工具的局限性。为了说明其在真实数据上的能力，我们展示了CAT PETR既复制了以前的发现，又利用其先进的可视化和统计选项揭示了更多的见解。

引用次数: 0

Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method. 交叉熵法用于高通量基因组数据分析中排列检验的准确、快速小p值估计。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2021-0067

Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang

Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.

当检验统计量在零假设下的抽样分布由于样本量有限而难以分析或不可靠时，排列检验被广泛用于统计假设检验。在基因组研究中应用排列测试的一个关键挑战是，通常需要大量的排列来获得非常小的p值的可靠估计，从而导致大量的计算工作。为了解决这个问题，我们开发了一种算法，用于准确有效地估计配对和独立的两组基因组数据的排列检验中的小p值，我们的方法利用一个新的框架，分别使用伯努利分布和条件伯努利分布，结合交叉熵方法，参数化这两类数据的排列样本空间。通过应用于两个模拟数据集和两个由微阵列和RNA-Seq技术生成的真实基因表达数据集，并与现有方法(如粗排列和SAMC)进行比较，证明了我们提出的算法的性能，结果表明我们的方法在估计小p值方面可以实现数量级的计算效率提升。我们的方法为提高现有排列测试程序的计算效率和开发基因组数据分析中使用排列的新测试方法提供了有希望的解决方案。

{"title":"Accurate and fast small p-value estimation for permutation tests in high-throughput genomic data analysis with the cross-entropy method.","authors":"Yang Shi, Weiping Shi, Mengqiao Wang, Ji-Hyun Lee, Huining Kang, Hui Jiang","doi":"10.1515/sagmb-2021-0067","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0067","url":null,"abstract":"Permutation tests are widely used for statistical hypothesis testing when the sampling distribution of the test statistic under the null hypothesis is analytically intractable or unreliable due to finite sample sizes. One critical challenge in the application of permutation tests in genomic studies is that an enormous number of permutations are often needed to obtain reliable estimates of very small p-values, leading to intensive computational effort. To address this issue, we develop algorithms for the accurate and efficient estimation of small p-values in permutation tests for paired and independent two-group genomic data, and our approaches leverage a novel framework for parameterizing the permutation sample spaces of those two types of data respectively using the Bernoulli and conditional Bernoulli distributions, combined with the cross-entropy method. The performance of our proposed algorithms is demonstrated through the application to two simulated datasets and two real-world gene expression datasets generated by microarray and RNA-Seq technologies and comparisons to existing methods such as crude permutations and SAMC, and the results show that our approaches can achieve orders of magnitude of computational efficiency gains in estimating small p-values. Our approaches offer promising solutions for the improvement of computational efficiencies of existing permutation test procedures and the development of new testing methods using permutations in genomic data analysis.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10229904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Bayesian model to identify multiple expression patterns with simultaneous FDR control for a multi-factor RNA-seq experiment. 一个贝叶斯模型，以识别多种表达模式与同时FDR控制的多因素RNA-seq实验。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2023-01-01 DOI: 10.1515/sagmb-2022-0025

Yuanyuan Bian, Chong He, Jing Qiu

It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.

识别在不同条件下(如组织、基因型等)满足特定表达模式的基因通常是研究兴趣。一种常见的做法是分别对每种情况进行差异表达分析，然后取每种情况下差异表达(DE)基因或非DE基因的交集，获得满足特定模式的基因。这种方法可能导致许多假阳性，特别是当所需的基因表达模式在一种条件下涉及等效表达时。在本文中，我们应用贝叶斯分割模型来识别所有期望模式的基因，同时控制它们的错误发现率(FDRs)。我们的模拟研究表明，对于涉及等效表达的模式，通常的做法无法控制组特定fdr，而所提出的贝叶斯方法同时控制了在所研究的所有设置下的组特定fdr。此外，当通用实践的FDR在识别仅涉及DE基因的模式的控制下时，所提出的方法更强大。我们的模拟研究还表明，识别涉及等效表达的模式比仅涉及差分表达的模式本质上更具挑战性。因此，与识别后一种模式相比，需要更大的样本量来获得识别前一种模式类型的相同目标功率。

{"title":"A Bayesian model to identify multiple expression patterns with simultaneous FDR control for a multi-factor RNA-seq experiment.","authors":"Yuanyuan Bian, Chong He, Jing Qiu","doi":"10.1515/sagmb-2022-0025","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0025","url":null,"abstract":"It is often of research interest to identify genes that satisfy a particular expression pattern across different conditions such as tissues, genotypes, etc. One common practice is to perform differential expression analysis for each condition separately and then take the intersection of differentially expressed (DE) genes or non-DE genes under each condition to obtain genes that satisfy a particular pattern. Such a method can lead to many false positives, especially when the desired gene expression pattern involves equivalent expression under one condition. In this paper, we apply a Bayesian partition model to identify genes of all desired patterns while simultaneously controlling their false discovery rates (FDRs). Our simulation studies show that the common practice fails to control group specific FDRs for patterns involving equivalent expression while the proposed Bayesian method simultaneously controls group specific FDRs at all settings studied. In addition, the proposed method is more powerful when the FDR of the common practice is under control for identifying patterns only involving DE genes. Our simulation studies also show that it is an inherently more challenging problem to identify patterns involving equivalent expression than patterns only involving differential expression. Therefore, larger sample sizes are required to obtain the same target power to identify the former types of patterns than the latter types of patterns.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"22 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9391532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples. pwrBRIDGE:一个用户友好的web应用程序，用于在依赖样本的批量混杂微阵列研究中估计功率和样本量。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-10-10 eCollection Date: 2022-01-01 DOI: 10.1515/sagmb-2022-0003

Qing Xia, Jeffrey A Thompson, Devin C Koestler

Batch effect Reduction of mIcroarray data with Dependent samples usinG Empirical Bayes (BRIDGE) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the BRIDGE methodology is the use of samples run as technical replicates in two or more batches, "bridging samples", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, M, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating M to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed pwrBRIDGE, a simulation-based approach to estimate the bridging sample size, M, in batch-confounded longitudinal microarray studies. To illustrate the use of pwrBRIDGE, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. pwrBRIDGE helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.

使用经验贝叶斯(BRIDGE)减少依赖样本的微阵列数据的批量效应是最近发展起来的一种统计方法，用于解决具有依赖样本的批量混杂微阵列研究中的批量效应校正问题。BRIDGE方法的关键组成部分是使用在两个或多个批次中作为技术复制运行的样品，“桥接样品”，以通知批次效果校正/衰减。虽然先前发表的结果表明桥接样本的数量M与批量校正数据的下游统计检验的统计能力之间存在关系，但目前还没有正式的统计框架或用户友好的软件来估计M，以实现对批量校正数据进行假设检验的特定统计能力。为了填补这一空白，我们开发了pwrBRIDGE，这是一种基于模拟的方法，用于估计批量混淆纵向微阵列研究中的桥接样本量M。为了说明pwrBRIDGE的使用，我们考虑了一项假设的纵向批量混淆研究，其目标是在5年随访后确定人类血液中从遗忘性轻度认知障碍(aMCI)到AD的阿尔茨海默病(AD)进展相关基因。pwrBRIDGE帮助研究人员设计和计划与依赖样本的批量混淆微阵列研究，以避免过度或不足的研究。

{"title":"pwrBRIDGE: a user-friendly web application for power and sample size estimation in batch-confounded microarray studies with dependent samples.","authors":"Qing Xia, Jeffrey A Thompson, Devin C Koestler","doi":"10.1515/sagmb-2022-0003","DOIUrl":"https://doi.org/10.1515/sagmb-2022-0003","url":null,"abstract":"Batch effect Reduction of mIcroarray data with Dependent samples usinG Empirical Bayes (BRIDGE) is a recently developed statistical method to address the issue of batch effect correction in batch-confounded microarray studies with dependent samples. The key component of the BRIDGE methodology is the use of samples run as technical replicates in two or more batches, \"bridging samples\", to inform batch effect correction/attenuation. While previously published results indicate a relationship between the number of bridging samples, M, and the statistical power of downstream statistical testing on the batch-corrected data, there is of yet no formal statistical framework or user-friendly software, for estimating M to achieve a specific statistical power for hypothesis tests conducted on the batch-corrected data. To fill this gap, we developed pwrBRIDGE, a simulation-based approach to estimate the bridging sample size, M, in batch-confounded longitudinal microarray studies. To illustrate the use of pwrBRIDGE, we consider a hypothetical, longitudinal batch-confounded study whose goal is to identify Alzheimer's disease (AD) progression-associated genes from amnestic mild cognitive impairment (aMCI) to AD in human blood after a 5-year follow-up. pwrBRIDGE helps researchers design and plan batch-confounded microarray studies with dependent samples to avoid over- or under-powered studies.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"21 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9550194/pdf/sagmb-21-1-sagmb-2022-0003.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33519105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distinct characteristics of correlation analysis at the single-cell and the population level. 单细胞和群体水平相关性分析的不同特点。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-08-02 eCollection Date: 2022-01-01 DOI: 10.1515/sagmb-2022-0015

Guoyu Wu, Yuchao Li

Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.

相关性分析被广泛应用于生物研究，以推断生物网络中的分子关系。最近，单细胞分析因其获得高分辨率分子表型的能力而引起了极大的兴趣。事实证明，单细胞水平研究中发现的共表达基因与群体水平研究中发现的共表达基因几乎没有重叠。然而，单细胞水平与群体水平之间相关关系的性质仍不清楚。在本稿件中，我们旨在揭示单细胞水平相关系数与群体水平相关系数之间差异的根源，并弥合两者之间的差距。通过将单细胞和种群水平的相关性联系起来的公式，我们说明了根据种群内的变化和相关性，聚集相关性可能更强、更弱或与相应的个体相关性相等。当种群内的相关性弱于个体相关性时，聚集相关性就强于相应的个体相关性。此外，我们的数据表明，聚合相关性更有可能强于相应的个体相关性，而在单细胞水平上发现完全强相关性的基因对并不多见。通过自下而上的方法来模拟信号级联或多调控因子控制的基因表达中分子间的相互作用，我们意外地发现，不能简单地根据两个组分的低相关系数来排除它们之间相互作用的存在，这提示我们要重新考虑生物网络内部的连通性，因为这种连通性仅仅是由相关性分析得出的。我们还研究了技术随机测量误差对单细胞水平和群体水平相关系数的影响。结果表明，总体相关性相对稳健，受影响较小。由于单细胞之间存在异质性，根据单细胞水平的数据计算出的相关系数可能与群体水平的相关系数不同。根据我们提出的具体问题，在得出结论之前，应进行适当的取样和归一化处理。

{"title":"Distinct characteristics of correlation analysis at the single-cell and the population level.","authors":"Guoyu Wu, Yuchao Li","doi":"10.1515/sagmb-2022-0015","DOIUrl":"10.1515/sagmb-2022-0015","url":null,"abstract":"Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2022-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40578441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Use of SVM-based ensemble feature selection method for gene expression data analysis. 利用基于支持向量机的集成特征选择方法对基因表达数据进行分析。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-07-14 DOI: 10.1515/sagmb-2022-0002

Shizhi Zhang, Mingjin Zhang

Gene selection is one of the key steps for gene expression data analysis. An SVM-based ensemble feature selection method is proposed in this paper. Firstly, the method builds many subsets by using Monte Carlo sampling. Secondly, ranking all the features on each of the subsets and integrating them to obtain a final ranking list. Finally, the optimum feature set is determined by a backward feature elimination strategy. This method is applied to the analysis of 4 public datasets: the Leukemia, Prostate, Colorectal, and SMK_CAN, resulting 7, 10, 13, and 32 features. The AUC obtained from independent test sets are 0.9867, 0.9796, 0.9571, and 0.9575, respectively. These results indicate that the features selected by the proposed method can improve sample classification accuracy, and thus be effective for gene selection from gene expression data.

基因选择是基因表达数据分析的关键步骤之一。提出了一种基于支持向量机的集成特征选择方法。该方法首先利用蒙特卡罗采样方法构建多个子集;其次，对每个子集上的所有特征进行排序并进行积分，得到最终的排序表。最后，通过反向特征消除策略确定最优特征集。该方法应用于白血病、前列腺癌、结肠直肠癌和SMK_CAN 4个公共数据集的分析，得到7个、10个、13个和32个特征。独立测试集的AUC分别为0.9867、0.9796、0.9571和0.9575。这些结果表明，该方法所选择的特征可以提高样本分类精度，从而有效地从基因表达数据中进行基因选择。

引用次数: 0

A robust association test with multiple genetic variants and covariates. 一个与多个遗传变异和协变量的强大关联检验。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-06-06 DOI: 10.1515/sagmb-2021-0029

Jen-Yu Lee, Pao-Sheng Shen, Kuang-Fu Cheng

Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called T_REM , is derived based on a random effects model. T_REM has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of T_REM can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of T_REM , we propose a novel test, called T_ROB , which keeps the advantages of T_REM and is more robust than T_REM in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that T_ROB has a stable type I error rate and outperforms T_REM when the proportion of risk variants decreases to a certain level and its advantage over T_REM increases as the proportion decreases. Furthermore, T_ROB outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.

由于基因组测序技术的进步，外显子组测序取得了长足的进步，使得疾病与遗传变异的关联研究成为可能。已经提出了一些强大而知名的关联测试来测试一组基因与感兴趣的疾病之间的关联。然而，仍然存在一些挑战，特别是许多因素会影响测试能力的表现，例如样本量，因果变量和非因果变量的数量，以及因果变量的影响方向。最近，一个强大的测试，称为TREM，是基于随机效应模型推导出来的。TREM的优点是对包含非因果罕见变异或低影响常见变异或缺失基因型的存在不太敏感。然而，当部分因果变量具有相反方向的影响时，TREM的检验能力可能较低。为了改善TREM的缺点，我们提出了一种新的测试，称为TROB，它保留了TREM的优点，并且在具有相反作用方向的变体的情况下具有足够的功率，比TREM更稳健。仿真结果表明，当风险变量的比例减小到一定程度时，TROB具有稳定的I类错误率，优于TREM，并且其优于TREM的优势随着比例的减小而增大。此外，在大多数情况下，TROB优于其他几个竞争测试。所提出的方法用上海乳腺癌研究来说明。

{"title":"A robust association test with multiple genetic variants and covariates.","authors":"Jen-Yu Lee, Pao-Sheng Shen, Kuang-Fu Cheng","doi":"10.1515/sagmb-2021-0029","DOIUrl":"https://doi.org/10.1515/sagmb-2021-0029","url":null,"abstract":"Due to the advancement of genome sequencing techniques, a great stride has been made in exome sequencing such that the association study between disease and genetic variants has become feasible. Some powerful and well-known association tests have been proposed to test the association between a group of genes and the disease of interest. However, some challenges still remain, in particular, many factors can affect the performance of testing power, e.g., the sample size, the number of causal and non-causal variants, and direction of the effect of causal variants. Recently, a powerful test, called TREM , is derived based on a random effects model. TREM has the advantages of being less sensitive to the inclusion of non-causal rare variants or low effect common variants or the presence of missing genotypes. However, the testing power of TREM can be low when a portion of causal variants has effects in opposite directions. To improve the drawback of TREM , we propose a novel test, called TROB , which keeps the advantages of TREM and is more robust than TREM in terms of having adequate power in the case of variants with opposite directions of effect. Simulation results show that TROB has a stable type I error rate and outperforms TREM when the proportion of risk variants decreases to a certain level and its advantage over TREM increases as the proportion decreases. Furthermore, TROB outperforms several other competing tests in most scenarios. The proposed methodology is illustrated using the Shanghai Breast Cancer Study.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40515867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Challenges for machine learning in RNA-protein interaction prediction. 机器学习在rna -蛋白相互作用预测中的挑战。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-05-02 DOI: 10.1515/sagmb-2021-0087

Viplove Arora, Guido Sanguinetti

RNA-protein interactions have long being recognised as crucial regulators of gene expression. Recently, the development of scalable experimental techniques to measure these interactions has revolutionised the field, leading to the production of large-scale datasets which offer both opportunities and challenges for machine learning techniques. In this brief note, we will discuss some of the major stumbling blocks towards the use of machine learning in computational RNA biology, focusing specifically on the problem of predicting RNA-protein interactions from next-generation sequencing data.

长期以来，rna -蛋白相互作用一直被认为是基因表达的关键调控因子。最近，测量这些相互作用的可扩展实验技术的发展已经彻底改变了该领域，导致大规模数据集的产生，这为机器学习技术提供了机遇和挑战。在这篇简短的文章中，我们将讨论在计算RNA生物学中使用机器学习的一些主要障碍，特别关注从下一代测序数据预测RNA-蛋白质相互作用的问题。

引用次数: 1

Estimation of the covariance structure from SNP allele frequencies 从SNP等位基因频率估计协方差结构

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2022-01-01 DOI: 10.1515/sagmb-2022-0005

J. van Waaij, Zilong Li, C. Wiuf

Abstract We propose two new statistics, V ̂ $hat{V}$ and S ̂ $hat{S}$ , to disentangle the population history of related populations from SNP frequency data. If the populations are related by a tree, we show by theoretical means as well as by simulation that the new statistics are able to identify the root of a tree correctly, in contrast to standard statistics, such as the observed matrix of F 2-statistics (distances between pairs of populations). The statistic V ̂ $hat{V}$ is obtained by averaging over all SNPs (similar to standard statistics). Its expectation is the true covariance matrix of the observed population SNP frequencies, offset by a matrix with identical entries. In contrast, the statistic S ̂ $hat{S}$ is put in a Bayesian context and is obtained by averaging over pairs of SNPs, such that each SNP is only used once. It thus makes use of the joint distribution of pairs of SNPs. In addition, we provide a number of novel mathematical results about old and new statistics, and their mutual relationship.

摘要:本文提出了两个新的统计量V´$hat{V}$和S´$hat{S}$，用于从SNP频率数据中分离相关种群的种群历史。如果种群与树相关，我们通过理论手段和模拟表明，与标准统计(如观察到的f2统计矩阵(种群对之间的距离))相比，新的统计能够正确地识别树的根。统计量V´$hat{V}$是通过对所有snp进行平均得到的(类似于标准统计量)。它的期望是观察到的总体SNP频率的真实协方差矩阵，由具有相同条目的矩阵抵消。相比之下，统计S´$hat{S}$被放在贝叶斯上下文中，并通过对SNP进行平均来获得，这样每个SNP只使用一次。因此，它利用了snp对的联合分布。此外，我们还提供了一些关于新旧统计及其相互关系的新颖数学结果。

引用次数: 1