首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Bi-level feature selection in high dimensional AFT models with applications to a genomic study 高维AFT模型的双水平特征选择及其在基因组研究中的应用
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-09-17 DOI: 10.1515/sagmb-2019-0016
Hailin Huang, Jizi Shangguan, Peifeng Ruan, Hua Liang
Abstract We propose a new bi-level feature selection method for high dimensional accelerated failure time models by formulating the models to a single index model. The method yields sparse solutions at both the group and individual feature levels along with an expedient algorithm, which is computationally efficient and easily implemented. We analyze a genomic dataset for an illustration, and present a simulation study to show the finite sample performance of the proposed method.
摘要我们提出了一种新的高维加速失效时间模型的双层特征选择方法,将模型公式化为单指标模型。该方法在组和个体特征级别上产生稀疏解,并提供了一种计算高效且易于实现的权宜算法。我们分析了一个基因组数据集进行说明,并进行了模拟研究,以显示所提出方法的有限样本性能。
{"title":"Bi-level feature selection in high dimensional AFT models with applications to a genomic study","authors":"Hailin Huang, Jizi Shangguan, Peifeng Ruan, Hua Liang","doi":"10.1515/sagmb-2019-0016","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0016","url":null,"abstract":"Abstract We propose a new bi-level feature selection method for high dimensional accelerated failure time models by formulating the models to a single index model. The method yields sparse solutions at both the group and individual feature levels along with an expedient algorithm, which is computationally efficient and easily implemented. We analyze a genomic dataset for an illustration, and present a simulation study to show the finite sample performance of the proposed method.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44345541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions 单细胞rna测序表达数据的聚类方法:不同样本量和细胞组成的性能评估
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-08-14 DOI: 10.1515/sagmb-2019-0004
A. Suner
Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.
为了准确分析单细胞rna测序(scRNA-seq)表达数据,目前已经开发了许多专门的聚类方法,并发表了一些报告,记录了这些聚类方法在不同条件下的性能指标。然而,到目前为止,还没有关于考虑到给定scRNA-seq数据集的样本量和细胞组成的聚类方法性能指标的系统评估的研究。本文使用已知样本量和亚群数量以及不同转录组复杂性水平的合成数据集,对11种选定的scRNA-seq聚类方法进行了综合性能评估研究。结果表明,所研究的聚类方法的总体性能高度依赖于scRNA-seq数据集的样本量和复杂性。在大多数情况下,随着给定表达数据集中的细胞数量的增加,聚类性能会得到更好的提高。本研究的发现还强调了样本大小对于使用适当的聚类工具成功检测稀有细胞亚群的重要性。
{"title":"Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions","authors":"A. Suner","doi":"10.1515/sagmb-2019-0004","DOIUrl":"https://doi.org/10.1515/sagmb-2019-0004","url":null,"abstract":"Abstract A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2019-0004","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48981400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Inference of finite mixture models and the effect of binning 有限混合模型的推理与分形效应
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-07-26 DOI: 10.1515/sagmb-2018-0035
Eva-Maria Geissen, J. Hasenauer, N. Radde
Abstract Finite mixture models are widely used in the life sciences for data analysis. Yet, the calibration of these models to data is still challenging as the optimization problems are often ill-posed. This holds for censored and uncensored data, and is caused by symmetries and other types of non-identifiabilities. Here, we discuss the problem of parameter estimation and model selection for finite mixture models from a theoretical perspective. We provide a review of the existing literature and illustrate the ill-posedness of the calibration problem for mixtures of uniform distributions and mixtures of normal distributions. Furthermore, we assess the effect of interval censoring on this estimation problem. Interestingly, we find that a proper treatment of censoring can facilitate the estimation of the number of mixture components compared to inference from uncensored data, which is an at first glance surprising result. The aim of the manuscript is to raise awareness of challenges in the calibration of finite mixture models and to provide an overview about available techniques.
有限混合模型在生命科学中广泛应用于数据分析。然而,这些模型与数据的校准仍然具有挑战性,因为优化问题通常是病态的。这适用于经过审查和未经审查的数据,并且是由对称性和其他类型的不可识别性引起的。本文从理论角度讨论了有限混合模型的参数估计和模型选择问题。我们提供了现有文献的回顾,并说明了均匀分布和正态分布混合的校准问题的不适定性。进一步,我们评估了区间滤波对该估计问题的影响。有趣的是,我们发现,与未经审查的数据推断相比,适当的审查处理可以促进混合成分数量的估计,这是一个乍一看令人惊讶的结果。该手稿的目的是提高对有限混合模型校准挑战的认识,并提供有关可用技术的概述。
{"title":"Inference of finite mixture models and the effect of binning","authors":"Eva-Maria Geissen, J. Hasenauer, N. Radde","doi":"10.1515/sagmb-2018-0035","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0035","url":null,"abstract":"Abstract Finite mixture models are widely used in the life sciences for data analysis. Yet, the calibration of these models to data is still challenging as the optimization problems are often ill-posed. This holds for censored and uncensored data, and is caused by symmetries and other types of non-identifiabilities. Here, we discuss the problem of parameter estimation and model selection for finite mixture models from a theoretical perspective. We provide a review of the existing literature and illustrate the ill-posedness of the calibration problem for mixtures of uniform distributions and mixtures of normal distributions. Furthermore, we assess the effect of interval censoring on this estimation problem. Interestingly, we find that a proper treatment of censoring can facilitate the estimation of the number of mixture components compared to inference from uncensored data, which is an at first glance surprising result. The aim of the manuscript is to raise awareness of challenges in the calibration of finite mixture models and to provide an overview about available techniques.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":" ","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46930703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus. 一种新的个体化药物重新定位方法预测1型糖尿病个体化候选药物。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-07-09 DOI: 10.1515/sagmb-2018-0052
Hong Zheng

The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.

高成本消耗和高失败率的存在提示在药物发现中应促进药物重新定位。现有的药物重新定位技术主要集中在发现一种疾病的候选药物,不适合预测单个样本的候选药物。1型糖尿病(T1DM)是一种由自身免疫破坏胰腺β细胞引起的葡萄糖稳态紊乱。在这里,我们提出了一种新的单样本药物重新定位方法,用于预测个体化T1DM候选药物。我们的方法是基于观察药物-疾病关联,通过使用Kolmogorov-Smirnov加权富集评分算法测量疾病和各种药物引起的个体化通路异常的相似性。使用这种方法,我们预测了几种潜在的T1DM候选药物。其中一些已被报道用于治疗糖尿病,一些目前具有治疗其他疾病适应症的药物可能被重新用于治疗T1DM。本研究通过检测疾病和药物作用之间的功能联系,在个性化或定制的基础上进行药物发现。我们的框架为复杂疾病的系统个性化药物发现提供了一种合理的方法,并有助于未来定制治疗决策的应用。
{"title":"A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus.","authors":"Hong Zheng","doi":"10.1515/sagmb-2018-0052","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0052","url":null,"abstract":"<p><p>The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 5","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37407908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A penalized regression approach for DNA copy number study using the sequencing data. 使用测序数据进行DNA拷贝数研究的惩罚回归方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0001
Jaeeun Lee, Jie Chen

Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.

为研究DNA拷贝数变异(CNVs)而对肿瘤和对照样本进行分析的实验产生的高通量下一代测序(NGS)数据进行建模,在各个方面仍然是一个挑战。在这项应用工作中,我们提供了一种利用NGS读比数据检测多个CNVs的有效方法。该方法基于多个统计变化点模型和惩罚回归方法,一维融合LASSO,该方法是为一维结构中的有序数据设计的。此外,由于路径算法将解作为一个调谐参数的函数进行跟踪,因此可以有效地同时估计潜在CNV区域边界的数量和位置。为了优化参数选择,我们提出了一种新的改进的贝叶斯信息准则,称为JMIC,并将所提出的JMIC与文献中使用的三种不同的贝叶斯信息准则进行比较。仿真结果表明,与其他三种准则相比,JMIC准则在调优参数选择方面具有更好的性能。我们将我们的方法应用于乳腺肿瘤细胞系HCC1954与其匹配的正常细胞系BL 1954的reads比值测序数据,结果与文献发现一致。
{"title":"A penalized regression approach for DNA copy number study using the sequencing data.","authors":"Jaeeun Lee,&nbsp;Jie Chen","doi":"10.1515/sagmb-2018-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0001","url":null,"abstract":"<p><p>Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data. 截断秩相关(TRC)作为质谱数据测试-重测信度的可靠度量。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0056
Johan Lim, Donghyeon Yu, Hsun-Chih Kuo, Hyungwon Choi, Scott Walmsley

In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.

在质谱(MS)实验中,在质荷比和色谱保留时间的空间中检测到数千个以上的峰,每个峰都与丰度测量相关。然而,很大一部分峰由实验噪声组成,低丰度化合物通常被噪声峰掩盖,从而影响数据的质量。在本文中,我们提出了一种新的测量对质谱实验之间相似性的方法,称为截断秩相关(TRC)。为了在嘈杂的高维数据中提供一个稳健的相似性度量,TRC使用截断的顶级秩(或顶级m-秩)来计算相关性。一项全面的数值研究表明,TRC优于传统的样本相关和肯德尔τ。我们应用TRC测量了HEK293细胞代谢组的生物重复分析和良性前列腺增生(BPH)患者代谢组谱两项质谱实验的重测信度。建议的trc及其相关功能的R包trc可在https://sites.google.com/site/dhyeonyu/software获得。
{"title":"Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data.","authors":"Johan Lim,&nbsp;Donghyeon Yu,&nbsp;Hsun-Chih Kuo,&nbsp;Hyungwon Choi,&nbsp;Scott Walmsley","doi":"10.1515/sagmb-2018-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0056","url":null,"abstract":"<p><p>In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 4","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. 癌症研究中质谱蛋白质组学数据中生物标志物鉴定的可重复性。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-11 DOI: 10.1515/sagmb-2018-0039
Yulan Liang, Adam Kelemen, Arpad Kelemen

Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.

由于多种因素的影响,多组学疾病分析中疾病特征和临床生物标志物的可重复性一直是一个关键挑战。有限样本的异质性,各种生物因素,如环境混杂因素,固有的实验和技术噪音,再加上统计工具的不足,可能导致对结果的误解,随后是非常不同的生物学。在本文中,我们研究了生物标志物的可重复性问题,这可能是由于使用质谱法蛋白质组学卵巢肿瘤数据的不同分布假设或标记选择标准的统计方法的差异造成的。我们研究了效应大小、p值、柯西p值、错误发现率p值和在有限的异质样本中鉴定的数千种蛋白质的等级分数之间的关系。我们比较了从统计单特征选择方法和机器学习包装方法识别的标记。结果显示,在选择不同方法的蛋白质标记时存在显著差异,存在潜在的选择偏差和错误发现,这可能是由于效应小、分布假设不同以及p值类型标准与预测精度的关系。替代解决方案和其他相关问题进行讨论,以支持临床可操作结果的研究结果的可重复性。
{"title":"Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies.","authors":"Yulan Liang,&nbsp;Adam Kelemen,&nbsp;Arpad Kelemen","doi":"10.1515/sagmb-2018-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0039","url":null,"abstract":"<p><p>Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37230336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. 结合基因表达数据和先验知识,利用结构限制通过贝叶斯网络推断基因调控网络。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-01 DOI: 10.1515/sagmb-2018-0042
Luis M de Campos, Andrés Cano, Javier G Castellano, Serafín Moral

Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.

基因调控网络(grn)被认为是提供细胞系统清晰见解和理解的最充分的工具。利用基因表达数据重构grn的最成功技术之一是贝叶斯网络(BN),它已被证明是学习过程中异构数据集成的理想方法。然而,通过使用先验信念或使用网络作为搜索过程的起点,已经实现了对先验知识的整合。在这项工作中,利用不同类型的结构限制算法从基因表达数据中学习生物神经网络。这些限制将编纂先验知识,以这样一种方式,BN应该满足它们。因此,本研究的目的之一是详细回顾使用先验知识和基因表达数据从bp中推断grn,但本文的主要目的是研究从表达数据中提取bp的结构学习算法是否可以利用这些先验知识并使用结构限制来获得更好的结果。实验研究表明,这种结合先验知识的新方法可以使我们获得更好的反向工程网络。
{"title":"Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions.","authors":"Luis M de Campos,&nbsp;Andrés Cano,&nbsp;Javier G Castellano,&nbsp;Serafín Moral","doi":"10.1515/sagmb-2018-0042","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0042","url":null,"abstract":"<p><p>Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37199702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships. 任意家谱关系的数据自适应多位点关联检验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-04-08 DOI: 10.1515/sagmb-2018-0030
Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore

Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) .

全基因组测序能够评估性状之间的关联以及基因和途径中变异的组合。但这样的评价需要多基因座关联检验,且检验的效力较好,而不考虑变异和性状特征。而且,由于分析家庭可能比分析不相关的个体产生更多的力量,我们需要适用于相关和不相关个体的多位点测试。在这里,我们描述了这样的测试,并介绍了SKAT-X,这是一种新的测试统计量,它使用从相关或不相关的受试者获得的全基因组数据来优化手头特定数据的功率。仿真结果表明:a)无论变异特性和性状特性如何,SKAT-X都具有良好的性能;b)对于二元性状,分析受影响的亲属比分析不相关的个体更有效,这与先前的单位点测试结果一致。我们将这些方法应用于肿瘤抑制基因BRCA2中罕见的未分类错义变异,并应用于多种族队列(MEC)中来自前列腺癌家族和不相关前列腺癌病例和对照的综合数据。这些方法可以使用开放源代码作为r包GATARS(任意相关主题的遗传关联测试)来实现。
{"title":"Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships.","authors":"Gail Gong,&nbsp;Wei Wang,&nbsp;Chih-Lin Hsieh,&nbsp;David J Van Den Berg,&nbsp;Christopher Haiman,&nbsp;Ingrid Oakley-Girvan,&nbsp;Alice S Whittemore","doi":"10.1515/sagmb-2018-0030","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0030","url":null,"abstract":"<p><p>Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 3","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37127926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate. 一个用于研究基因-模块共表达与连续协变量之间关系的多元线性模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-03-15 DOI: 10.1515/sagmb-2018-0008
Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski

A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.

研究细胞环境对基因共表达(即基因对相关性)的影响是提高我们对复杂疾病发生和发展的理解的一种方法。通常,通过对连续协变量进行分类,研究基因共表达的变化在两个或多个生物学条件下。然而,任意截断点的选择可能会对分析结果产生影响。为了解决这个问题,我们对相关数据使用一般线性模型(GLM)来研究基因-模块共表达与代谢物浓度等协变量之间的关系。GLM将基因对相关性指定为连续协变量的函数。使用GLM可以研究不同的(线性和非线性)共表达模式。此外,建模方法为测试关于可能的共表达模式的假设提供了一个形式化框架。在本文中,我们使用仿真研究来评估GLM的性能。将性能与先前提出的利用分类协变量的GLM进行比较。通过一个现实生活中的例子说明了该模型的多功能性。我们讨论了与检验统计量的构建相关的理论问题以及与拟合所提出的模型相关的计算挑战。
{"title":"A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate.","authors":"Trishanta Padayachee,&nbsp;Tatsiana Khamiakova,&nbsp;Ziv Shkedy,&nbsp;Perttu Salo,&nbsp;Markus Perola,&nbsp;Tomasz Burzykowski","doi":"10.1515/sagmb-2018-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0008","url":null,"abstract":"<p><p>A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 2","pages":""},"PeriodicalIF":0.9,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37234437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1