Statistical Applications in Genetics and Molecular Biology最新文献_第6页

AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions AdaReg：线性回归中的数据自适应稳健估计及其在GTEx基因表达中的应用

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-12-10 DOI: 10.1101/869362

Meng Wang, Lihua Jiang, M. Snyder

Abstract The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.

基因型-组织表达(GTEx)项目提供了跨多种组织类型的大规模基因表达的宝贵资源。在各种技术噪声和未知或不可测因素的影响下，如何稳健地估计主要组织效应成为一个挑战。此外，不同的基因在不同的组织类型中表现出异质表达。因此，我们需要一种适应基因表达异质性的鲁棒性方法来提高对组织效应的估计。我们采用了Fujisawa, H.和Eguchi, S.(2008)的基于γ-密度-功率权值的稳健估计方法。对重污染具有小偏差的鲁棒参数估计。[j] .地理科学与管理，1999(1):1 - 3。鲁棒模型拟合。j·罗伊。统计,Soc。B: 599-609，其中γ是控制偏差和方差之间平衡的密度权重的指数。据我们所知，我们的工作是第一个提出一个过程来调整参数γ，以平衡混合模型下的偏差-方差权衡。在高斯总体分布与未知离群分布混合的混合模型中，构建了基于加权密度的稳健似然准则，并开发了嵌入稳健估计的数据自适应γ选择程序。我们对选择准则进行了启发式分析，发现我们在各种平均性能γ下的实际选择趋势与我们在一系列设置下的模拟研究中不可估计的均方误差(MSE)趋势具有相似的捕获最小化γ的能力。与固定γ方法和其他鲁棒方法相比，我们在线性回归问题(AdaReg)中的数据自适应鲁棒化方法在GTEx项目中估计心脏样本组织效应的模拟研究和实际数据应用中都显示出显著的优势。最后，对该方法的局限性和今后的工作进行了讨论。

{"title":"AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions","authors":"Meng Wang, Lihua Jiang, M. Snyder","doi":"10.1101/869362","DOIUrl":"https://doi.org/10.1101/869362","url":null,"abstract":"Abstract The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"20 1","pages":"51 - 71"},"PeriodicalIF":0.9,"publicationDate":"2019-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41415775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Bayesian framework for identifying consistent patterns of microbial abundance between body sites. 用于识别身体部位之间微生物丰度一致模式的贝叶斯框架。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-11-08 DOI: 10.1515/sagmb-2019-0027

Richard Meier, Jeffrey A Thompson, Mei Chung, Naisi Zhao, Karl T Kelsey, Dominique S Michaud, Devin C Koestler

Recent studies have found that the microbiome in both gut and mouth are associated with diseases of the gut, including cancer. If resident microbes could be found to exhibit consistent patterns between the mouth and gut, disease status could potentially be assessed non-invasively through profiling of oral samples. Currently, there exists no generally applicable method to test for such associations. Here we present a Bayesian framework to identify microbes that exhibit consistent patterns between body sites, with respect to a phenotypic variable. For a given operational taxonomic unit (OTU), a Bayesian regression model is used to obtain Markov-Chain Monte Carlo estimates of abundance among strata, calculate a correlation statistic, and conduct a formal test based on its posterior distribution. Extensive simulation studies demonstrate overall viability of the approach, and provide information on what factors affect its performance. Applying our method to a dataset containing oral and gut microbiome samples from 77 pancreatic cancer patients revealed several OTUs exhibiting consistent patterns between gut and mouth with respect to disease subtype. Our method is well powered for modest sample sizes and moderate strength of association and can be flexibly extended to other research settings using any currently established Bayesian analysis programs.

最近的研究发现，肠道和口腔中的微生物组与包括癌症在内的肠道疾病有关。如果可以发现常驻微生物在口腔和肠道之间表现出一致的模式，那么可以通过口腔样本的分析来非侵入性地评估疾病状态。目前，没有普遍适用的方法来测试这种关联。在这里，我们提出了一个贝叶斯框架来识别在表型变量方面，身体部位之间表现出一致模式的微生物。对于给定的操作分类单元（OTU），使用贝叶斯回归模型来获得地层间丰度的马尔可夫链蒙特卡罗估计，计算相关统计量，并基于其后验分布进行形式检验。大量的模拟研究证明了该方法的整体可行性，并提供了影响其性能的因素的信息。将我们的方法应用于包含来自77名癌症患者的口腔和肠道微生物组样本的数据集，发现几个OTU在疾病亚型方面在肠道和口腔之间表现出一致的模式。我们的方法适用于适度的样本量和适度的关联强度，并且可以使用任何当前建立的贝叶斯分析程序灵活地扩展到其他研究环境。

{"title":"A Bayesian framework for identifying consistent patterns of microbial abundance between body sites.","authors":"Richard Meier, Jeffrey A Thompson, Mei Chung, Naisi Zhao, Karl T Kelsey, Dominique S Michaud, Devin C Koestler","doi":"10.1515/sagmb-2019-0027","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0027","url":null,"abstract":"Recent studies have found that the microbiome in both gut and mouth are associated with diseases of the gut, including cancer. If resident microbes could be found to exhibit consistent patterns between the mouth and gut, disease status could potentially be assessed non-invasively through profiling of oral samples. Currently, there exists no generally applicable method to test for such associations. Here we present a Bayesian framework to identify microbes that exhibit consistent patterns between body sites, with respect to a phenotypic variable. For a given operational taxonomic unit (OTU), a Bayesian regression model is used to obtain Markov-Chain Monte Carlo estimates of abundance among strata, calculate a correlation statistic, and conduct a formal test based on its posterior distribution. Extensive simulation studies demonstrate overall viability of the approach, and provide information on what factors affect its performance. Applying our method to a dataset containing oral and gut microbiome samples from 77 pancreatic cancer patients revealed several OTUs exhibiting consistent patterns between gut and mouth with respect to disease subtype. Our method is well powered for modest sample sizes and moderate strength of association and can be flexibly extended to other research settings using any currently established Bayesian analysis programs.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 6","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41180334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Bi-level feature selection in high dimensional AFT models with applications to a genomic study 高维AFT模型的双水平特征选择及其在基因组研究中的应用

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-09-17 DOI: 10.1515/sagmb-2019-0016

Hailin Huang, Jizi Shangguan, Peifeng Ruan, Hua Liang

Abstract We propose a new bi-level feature selection method for high dimensional accelerated failure time models by formulating the models to a single index model. The method yields sparse solutions at both the group and individual feature levels along with an expedient algorithm, which is computationally efficient and easily implemented. We analyze a genomic dataset for an illustration, and present a simulation study to show the finite sample performance of the proposed method.

摘要我们提出了一种新的高维加速失效时间模型的双层特征选择方法，将模型公式化为单指标模型。该方法在组和个体特征级别上产生稀疏解，并提供了一种计算高效且易于实现的权宜算法。我们分析了一个基因组数据集进行说明，并进行了模拟研究，以显示所提出方法的有限样本性能。

引用次数: 2

Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions 单细胞rna测序表达数据的聚类方法:不同样本量和细胞组成的性能评估

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-08-14 DOI: 10.1515/sagmb-2019-0004

A. Suner

Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.

为了准确分析单细胞rna测序(scRNA-seq)表达数据，目前已经开发了许多专门的聚类方法，并发表了一些报告，记录了这些聚类方法在不同条件下的性能指标。然而，到目前为止，还没有关于考虑到给定scRNA-seq数据集的样本量和细胞组成的聚类方法性能指标的系统评估的研究。本文使用已知样本量和亚群数量以及不同转录组复杂性水平的合成数据集，对11种选定的scRNA-seq聚类方法进行了综合性能评估研究。结果表明，所研究的聚类方法的总体性能高度依赖于scRNA-seq数据集的样本量和复杂性。在大多数情况下，随着给定表达数据集中的细胞数量的增加，聚类性能会得到更好的提高。本研究的发现还强调了样本大小对于使用适当的聚类工具成功检测稀有细胞亚群的重要性。

{"title":"Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions","authors":"A. Suner","doi":"10.1515/sagmb-2019-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0004","url":null,"abstract":"Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0004","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48981400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Inference of finite mixture models and the effect of binning 有限混合模型的推理与分形效应

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-07-26 DOI: 10.1515/sagmb-2018-0035

Eva-Maria Geissen, J. Hasenauer, N. Radde

Abstract Finite mixture models are widely used in the life sciences for data analysis. Yet, the calibration of these models to data is still challenging as the optimization problems are often ill-posed. This holds for censored and uncensored data, and is caused by symmetries and other types of non-identifiabilities. Here, we discuss the problem of parameter estimation and model selection for finite mixture models from a theoretical perspective. We provide a review of the existing literature and illustrate the ill-posedness of the calibration problem for mixtures of uniform distributions and mixtures of normal distributions. Furthermore, we assess the effect of interval censoring on this estimation problem. Interestingly, we find that a proper treatment of censoring can facilitate the estimation of the number of mixture components compared to inference from uncensored data, which is an at first glance surprising result. The aim of the manuscript is to raise awareness of challenges in the calibration of finite mixture models and to provide an overview about available techniques.

有限混合模型在生命科学中广泛应用于数据分析。然而，这些模型与数据的校准仍然具有挑战性，因为优化问题通常是病态的。这适用于经过审查和未经审查的数据，并且是由对称性和其他类型的不可识别性引起的。本文从理论角度讨论了有限混合模型的参数估计和模型选择问题。我们提供了现有文献的回顾，并说明了均匀分布和正态分布混合的校准问题的不适定性。进一步，我们评估了区间滤波对该估计问题的影响。有趣的是，我们发现，与未经审查的数据推断相比，适当的审查处理可以促进混合成分数量的估计，这是一个乍一看令人惊讶的结果。该手稿的目的是提高对有限混合模型校准挑战的认识，并提供有关可用技术的概述。

引用次数: 1

A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus. 一种新的个体化药物重新定位方法预测1型糖尿病个体化候选药物。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-07-09 DOI: 10.1515/sagmb-2018-0052

Hong Zheng

The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.

高成本消耗和高失败率的存在提示在药物发现中应促进药物重新定位。现有的药物重新定位技术主要集中在发现一种疾病的候选药物，不适合预测单个样本的候选药物。1型糖尿病(T1DM)是一种由自身免疫破坏胰腺β细胞引起的葡萄糖稳态紊乱。在这里，我们提出了一种新的单样本药物重新定位方法，用于预测个体化T1DM候选药物。我们的方法是基于观察药物-疾病关联，通过使用Kolmogorov-Smirnov加权富集评分算法测量疾病和各种药物引起的个体化通路异常的相似性。使用这种方法，我们预测了几种潜在的T1DM候选药物。其中一些已被报道用于治疗糖尿病，一些目前具有治疗其他疾病适应症的药物可能被重新用于治疗T1DM。本研究通过检测疾病和药物作用之间的功能联系，在个性化或定制的基础上进行药物发现。我们的框架为复杂疾病的系统个性化药物发现提供了一种合理的方法，并有助于未来定制治疗决策的应用。

{"title":"A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus.","authors":"Hong Zheng","doi":"10.1515/sagmb-2018-0052","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0052","url":null,"abstract":"The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37407908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A penalized regression approach for DNA copy number study using the sequencing data. 使用测序数据进行DNA拷贝数研究的惩罚回归方法。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0001

Jaeeun Lee, Jie Chen

Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.

为研究DNA拷贝数变异(CNVs)而对肿瘤和对照样本进行分析的实验产生的高通量下一代测序(NGS)数据进行建模，在各个方面仍然是一个挑战。在这项应用工作中，我们提供了一种利用NGS读比数据检测多个CNVs的有效方法。该方法基于多个统计变化点模型和惩罚回归方法，一维融合LASSO，该方法是为一维结构中的有序数据设计的。此外，由于路径算法将解作为一个调谐参数的函数进行跟踪，因此可以有效地同时估计潜在CNV区域边界的数量和位置。为了优化参数选择，我们提出了一种新的改进的贝叶斯信息准则，称为JMIC，并将所提出的JMIC与文献中使用的三种不同的贝叶斯信息准则进行比较。仿真结果表明，与其他三种准则相比，JMIC准则在调优参数选择方面具有更好的性能。我们将我们的方法应用于乳腺肿瘤细胞系HCC1954与其匹配的正常细胞系BL 1954的reads比值测序数据，结果与文献发现一致。

{"title":"A penalized regression approach for DNA copy number study using the sequencing data.","authors":"Jaeeun Lee, Jie Chen","doi":"10.1515/sagmb-2018-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0001","url":null,"abstract":"Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data. 截断秩相关(TRC)作为质谱数据测试-重测信度的可靠度量。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0056

Johan Lim, Donghyeon Yu, Hsun-Chih Kuo, Hyungwon Choi, Scott Walmsley

In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.

在质谱(MS)实验中，在质荷比和色谱保留时间的空间中检测到数千个以上的峰，每个峰都与丰度测量相关。然而，很大一部分峰由实验噪声组成，低丰度化合物通常被噪声峰掩盖，从而影响数据的质量。在本文中，我们提出了一种新的测量对质谱实验之间相似性的方法，称为截断秩相关(TRC)。为了在嘈杂的高维数据中提供一个稳健的相似性度量，TRC使用截断的顶级秩(或顶级m-秩)来计算相关性。一项全面的数值研究表明，TRC优于传统的样本相关和肯德尔τ。我们应用TRC测量了HEK293细胞代谢组的生物重复分析和良性前列腺增生(BPH)患者代谢组谱两项质谱实验的重测信度。建议的trc及其相关功能的R包trc可在https://sites.google.com/site/dhyeonyu/software获得。

{"title":"Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data.","authors":"Johan Lim, Donghyeon Yu, Hsun-Chih Kuo, Hyungwon Choi, Scott Walmsley","doi":"10.1515/sagmb-2018-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0056","url":null,"abstract":"In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. 癌症研究中质谱蛋白质组学数据中生物标志物鉴定的可重复性。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-05-11 DOI: 10.1515/sagmb-2018-0039

Yulan Liang, Adam Kelemen, Arpad Kelemen

Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.

由于多种因素的影响，多组学疾病分析中疾病特征和临床生物标志物的可重复性一直是一个关键挑战。有限样本的异质性，各种生物因素，如环境混杂因素，固有的实验和技术噪音，再加上统计工具的不足，可能导致对结果的误解，随后是非常不同的生物学。在本文中，我们研究了生物标志物的可重复性问题，这可能是由于使用质谱法蛋白质组学卵巢肿瘤数据的不同分布假设或标记选择标准的统计方法的差异造成的。我们研究了效应大小、p值、柯西p值、错误发现率p值和在有限的异质样本中鉴定的数千种蛋白质的等级分数之间的关系。我们比较了从统计单特征选择方法和机器学习包装方法识别的标记。结果显示，在选择不同方法的蛋白质标记时存在显著差异，存在潜在的选择偏差和错误发现，这可能是由于效应小、分布假设不同以及p值类型标准与预测精度的关系。替代解决方案和其他相关问题进行讨论，以支持临床可操作结果的研究结果的可重复性。

{"title":"Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies.","authors":"Yulan Liang, Adam Kelemen, Arpad Kelemen","doi":"10.1515/sagmb-2018-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0039","url":null,"abstract":"Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37230336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. 结合基因表达数据和先验知识，利用结构限制通过贝叶斯网络推断基因调控网络。

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology

Pub Date : 2019-05-01 DOI: 10.1515/sagmb-2018-0042

Luis M de Campos, Andrés Cano, Javier G Castellano, Serafín Moral

Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.

基因调控网络(grn)被认为是提供细胞系统清晰见解和理解的最充分的工具。利用基因表达数据重构grn的最成功技术之一是贝叶斯网络(BN)，它已被证明是学习过程中异构数据集成的理想方法。然而，通过使用先验信念或使用网络作为搜索过程的起点，已经实现了对先验知识的整合。在这项工作中，利用不同类型的结构限制算法从基因表达数据中学习生物神经网络。这些限制将编纂先验知识，以这样一种方式，BN应该满足它们。因此，本研究的目的之一是详细回顾使用先验知识和基因表达数据从bp中推断grn，但本文的主要目的是研究从表达数据中提取bp的结构学习算法是否可以利用这些先验知识并使用结构限制来获得更好的结果。实验研究表明，这种结合先验知识的新方法可以使我们获得更好的反向工程网络。

{"title":"Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions.","authors":"Luis M de Campos, Andrés Cano, Javier G Castellano, Serafín Moral","doi":"10.1515/sagmb-2018-0042","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0042","url":null,"abstract":"Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37199702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6