首页 > 最新文献

Statistical Applications in Genetics and Molecular Biology最新文献

英文 中文
Inference of finite mixture models and the effect of binning 有限混合模型的推理与分形效应
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-07-26 DOI: 10.1515/sagmb-2018-0035
Eva-Maria Geissen, J. Hasenauer, N. Radde
Abstract Finite mixture models are widely used in the life sciences for data analysis. Yet, the calibration of these models to data is still challenging as the optimization problems are often ill-posed. This holds for censored and uncensored data, and is caused by symmetries and other types of non-identifiabilities. Here, we discuss the problem of parameter estimation and model selection for finite mixture models from a theoretical perspective. We provide a review of the existing literature and illustrate the ill-posedness of the calibration problem for mixtures of uniform distributions and mixtures of normal distributions. Furthermore, we assess the effect of interval censoring on this estimation problem. Interestingly, we find that a proper treatment of censoring can facilitate the estimation of the number of mixture components compared to inference from uncensored data, which is an at first glance surprising result. The aim of the manuscript is to raise awareness of challenges in the calibration of finite mixture models and to provide an overview about available techniques.
有限混合模型在生命科学中广泛应用于数据分析。然而,这些模型与数据的校准仍然具有挑战性,因为优化问题通常是病态的。这适用于经过审查和未经审查的数据,并且是由对称性和其他类型的不可识别性引起的。本文从理论角度讨论了有限混合模型的参数估计和模型选择问题。我们提供了现有文献的回顾,并说明了均匀分布和正态分布混合的校准问题的不适定性。进一步,我们评估了区间滤波对该估计问题的影响。有趣的是,我们发现,与未经审查的数据推断相比,适当的审查处理可以促进混合成分数量的估计,这是一个乍一看令人惊讶的结果。该手稿的目的是提高对有限混合模型校准挑战的认识,并提供有关可用技术的概述。
{"title":"Inference of finite mixture models and the effect of binning","authors":"Eva-Maria Geissen, J. Hasenauer, N. Radde","doi":"10.1515/sagmb-2018-0035","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0035","url":null,"abstract":"Abstract Finite mixture models are widely used in the life sciences for data analysis. Yet, the calibration of these models to data is still challenging as the optimization problems are often ill-posed. This holds for censored and uncensored data, and is caused by symmetries and other types of non-identifiabilities. Here, we discuss the problem of parameter estimation and model selection for finite mixture models from a theoretical perspective. We provide a review of the existing literature and illustrate the ill-posedness of the calibration problem for mixtures of uniform distributions and mixtures of normal distributions. Furthermore, we assess the effect of interval censoring on this estimation problem. Interestingly, we find that a proper treatment of censoring can facilitate the estimation of the number of mixture components compared to inference from uncensored data, which is an at first glance surprising result. The aim of the manuscript is to raise awareness of challenges in the calibration of finite mixture models and to provide an overview about available techniques.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46930703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus. 一种新的个体化药物重新定位方法预测1型糖尿病个体化候选药物。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-07-09 DOI: 10.1515/sagmb-2018-0052
Hong Zheng

The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.

高成本消耗和高失败率的存在提示在药物发现中应促进药物重新定位。现有的药物重新定位技术主要集中在发现一种疾病的候选药物,不适合预测单个样本的候选药物。1型糖尿病(T1DM)是一种由自身免疫破坏胰腺β细胞引起的葡萄糖稳态紊乱。在这里,我们提出了一种新的单样本药物重新定位方法,用于预测个体化T1DM候选药物。我们的方法是基于观察药物-疾病关联,通过使用Kolmogorov-Smirnov加权富集评分算法测量疾病和各种药物引起的个体化通路异常的相似性。使用这种方法,我们预测了几种潜在的T1DM候选药物。其中一些已被报道用于治疗糖尿病,一些目前具有治疗其他疾病适应症的药物可能被重新用于治疗T1DM。本研究通过检测疾病和药物作用之间的功能联系,在个性化或定制的基础上进行药物发现。我们的框架为复杂疾病的系统个性化药物发现提供了一种合理的方法,并有助于未来定制治疗决策的应用。
{"title":"A novel individualized drug repositioning approach for predicting personalized candidate drugs for type 1 diabetes mellitus.","authors":"Hong Zheng","doi":"10.1515/sagmb-2018-0052","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0052","url":null,"abstract":"<p><p>The existence of high cost-consuming and high rate of drug failures suggests the promotion of drug repositioning in drug discovery. Existing drug repositioning techniques mainly focus on discovering candidate drugs for a kind of disease, and are not suitable for predicting candidate drugs for an individual sample. Type 1 diabetes mellitus (T1DM) is a disorder of glucose homeostasis caused by autoimmune destruction of the pancreatic β-cell. Here, we present a novel single sample drug repositioning approach for predicting personalized candidate drugs for T1DM. Our method is based on the observation of drug-disease associations by measuring the similarities of individualized pathway aberrance induced by disease and various drugs using a Kolmogorov-Smirnov weighted Enrichment Score algorithm. Using this method, we predicted several underlying candidate drugs for T1DM. Some of them have been reported for the treatment of diabetes mellitus, and some with a current indication to treat other diseases might be repurposed to treat T1DM. This study conducts drug discovery via detecting the functional connections among disease and drug action, on a personalized or customized basis. Our framework provides a rational way for systematic personalized drug discovery of complex diseases and contributes to the future application of custom therapeutic decisions.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0052","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37407908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A penalized regression approach for DNA copy number study using the sequencing data. 使用测序数据进行DNA拷贝数研究的惩罚回归方法。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0001
Jaeeun Lee, Jie Chen

Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.

为研究DNA拷贝数变异(CNVs)而对肿瘤和对照样本进行分析的实验产生的高通量下一代测序(NGS)数据进行建模,在各个方面仍然是一个挑战。在这项应用工作中,我们提供了一种利用NGS读比数据检测多个CNVs的有效方法。该方法基于多个统计变化点模型和惩罚回归方法,一维融合LASSO,该方法是为一维结构中的有序数据设计的。此外,由于路径算法将解作为一个调谐参数的函数进行跟踪,因此可以有效地同时估计潜在CNV区域边界的数量和位置。为了优化参数选择,我们提出了一种新的改进的贝叶斯信息准则,称为JMIC,并将所提出的JMIC与文献中使用的三种不同的贝叶斯信息准则进行比较。仿真结果表明,与其他三种准则相比,JMIC准则在调优参数选择方面具有更好的性能。我们将我们的方法应用于乳腺肿瘤细胞系HCC1954与其匹配的正常细胞系BL 1954的reads比值测序数据,结果与文献发现一致。
{"title":"A penalized regression approach for DNA copy number study using the sequencing data.","authors":"Jaeeun Lee,&nbsp;Jie Chen","doi":"10.1515/sagmb-2018-0001","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0001","url":null,"abstract":"<p><p>Modeling the high-throughput next generation sequencing (NGS) data, resulting from experiments with the goal of profiling tumor and control samples for the study of DNA copy number variants (CNVs), remains to be a challenge in various ways. In this application work, we provide an efficient method for detecting multiple CNVs using NGS reads ratio data. This method is based on a multiple statistical change-points model with the penalized regression approach, 1d fused LASSO, that is designed for ordered data in a one-dimensional structure. In addition, since the path algorithm traces the solution as a function of a tuning parameter, the number and locations of potential CNV region boundaries can be estimated simultaneously in an efficient way. For tuning parameter selection, we then propose a new modified Bayesian information criterion, called JMIC, and compare the proposed JMIC with three different Bayes information criteria used in the literature. Simulation results have shown the better performance of JMIC for tuning parameter selection, in comparison with the other three criterion. We applied our approach to the sequencing data of reads ratio between the breast tumor cell lines HCC1954 and its matched normal cell line BL 1954 and the results are in-line with those discovered in the literature.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data. 截断秩相关(TRC)作为质谱数据测试-重测信度的可靠度量。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-30 DOI: 10.1515/sagmb-2018-0056
Johan Lim, Donghyeon Yu, Hsun-Chih Kuo, Hyungwon Choi, Scott Walmsley

In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.

在质谱(MS)实验中,在质荷比和色谱保留时间的空间中检测到数千个以上的峰,每个峰都与丰度测量相关。然而,很大一部分峰由实验噪声组成,低丰度化合物通常被噪声峰掩盖,从而影响数据的质量。在本文中,我们提出了一种新的测量对质谱实验之间相似性的方法,称为截断秩相关(TRC)。为了在嘈杂的高维数据中提供一个稳健的相似性度量,TRC使用截断的顶级秩(或顶级m-秩)来计算相关性。一项全面的数值研究表明,TRC优于传统的样本相关和肯德尔τ。我们应用TRC测量了HEK293细胞代谢组的生物重复分析和良性前列腺增生(BPH)患者代谢组谱两项质谱实验的重测信度。建议的trc及其相关功能的R包trc可在https://sites.google.com/site/dhyeonyu/software获得。
{"title":"Truncated rank correlation (TRC) as a robust measure of test-retest reliability in mass spectrometry data.","authors":"Johan Lim,&nbsp;Donghyeon Yu,&nbsp;Hsun-Chih Kuo,&nbsp;Hyungwon Choi,&nbsp;Scott Walmsley","doi":"10.1515/sagmb-2018-0056","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0056","url":null,"abstract":"<p><p>In mass spectrometry (MS) experiments, more than thousands of peaks are detected in the space of mass-to-charge ratio and chromatographic retention time, each associated with an abundance measurement. However, a large proportion of the peaks consists of experimental noise and low abundance compounds are typically masked by noise peaks, compromising the quality of the data. In this paper, we propose a new measure of similarity between a pair of MS experiments, called truncated rank correlation (TRC). To provide a robust metric of similarity in noisy high-dimensional data, TRC uses truncated top ranks (or top m-ranks) for calculating correlation. A comprehensive numerical study suggests that TRC outperforms traditional sample correlation and Kendall's τ. We apply TRC to measuring test-retest reliability of two MS experiments, including biological replicate analysis of the metabolome in HEK293 cells and metabolomic profiling of benign prostate hyperplasia (BPH) patients. An R package trc of the proposed TRC and related functions is available at https://sites.google.com/site/dhyeonyu/software.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0056","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37289303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies. 癌症研究中质谱蛋白质组学数据中生物标志物鉴定的可重复性。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-11 DOI: 10.1515/sagmb-2018-0039
Yulan Liang, Adam Kelemen, Arpad Kelemen

Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.

由于多种因素的影响,多组学疾病分析中疾病特征和临床生物标志物的可重复性一直是一个关键挑战。有限样本的异质性,各种生物因素,如环境混杂因素,固有的实验和技术噪音,再加上统计工具的不足,可能导致对结果的误解,随后是非常不同的生物学。在本文中,我们研究了生物标志物的可重复性问题,这可能是由于使用质谱法蛋白质组学卵巢肿瘤数据的不同分布假设或标记选择标准的统计方法的差异造成的。我们研究了效应大小、p值、柯西p值、错误发现率p值和在有限的异质样本中鉴定的数千种蛋白质的等级分数之间的关系。我们比较了从统计单特征选择方法和机器学习包装方法识别的标记。结果显示,在选择不同方法的蛋白质标记时存在显著差异,存在潜在的选择偏差和错误发现,这可能是由于效应小、分布假设不同以及p值类型标准与预测精度的关系。替代解决方案和其他相关问题进行讨论,以支持临床可操作结果的研究结果的可重复性。
{"title":"Reproducibility of biomarker identifications from mass spectrometry proteomic data in cancer studies.","authors":"Yulan Liang,&nbsp;Adam Kelemen,&nbsp;Arpad Kelemen","doi":"10.1515/sagmb-2018-0039","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0039","url":null,"abstract":"<p><p>Reproducibility of disease signatures and clinical biomarkers in multi-omics disease analysis has been a key challenge due to a multitude of factors. The heterogeneity of the limited sample, various biological factors such as environmental confounders, and the inherent experimental and technical noises, compounded with the inadequacy of statistical tools, can lead to the misinterpretation of results, and subsequently very different biology. In this paper, we investigate the biomarker reproducibility issues, potentially caused by differences of statistical methods with varied distribution assumptions or marker selection criteria using Mass Spectrometry proteomic ovarian tumor data. We examine the relationship between effect sizes, p values, Cauchy p values, False Discovery Rate p values, and the rank fractions of identified proteins out of thousands in the limited heterogeneous sample. We compared the markers identified from statistical single features selection approaches with machine learning wrapper methods. The results reveal marked differences when selecting the protein markers from varied methods with potential selection biases and false discoveries, which may be due to the small effects, different distribution assumptions, and p value type criteria versus prediction accuracies. The alternative solutions and other related issues are discussed in supporting the reproducibility of findings for clinical actionable outcomes.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37230336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. 结合基因表达数据和先验知识,利用结构限制通过贝叶斯网络推断基因调控网络。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-05-01 DOI: 10.1515/sagmb-2018-0042
Luis M de Campos, Andrés Cano, Javier G Castellano, Serafín Moral

Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.

基因调控网络(grn)被认为是提供细胞系统清晰见解和理解的最充分的工具。利用基因表达数据重构grn的最成功技术之一是贝叶斯网络(BN),它已被证明是学习过程中异构数据集成的理想方法。然而,通过使用先验信念或使用网络作为搜索过程的起点,已经实现了对先验知识的整合。在这项工作中,利用不同类型的结构限制算法从基因表达数据中学习生物神经网络。这些限制将编纂先验知识,以这样一种方式,BN应该满足它们。因此,本研究的目的之一是详细回顾使用先验知识和基因表达数据从bp中推断grn,但本文的主要目的是研究从表达数据中提取bp的结构学习算法是否可以利用这些先验知识并使用结构限制来获得更好的结果。实验研究表明,这种结合先验知识的新方法可以使我们获得更好的反向工程网络。
{"title":"Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions.","authors":"Luis M de Campos,&nbsp;Andrés Cano,&nbsp;Javier G Castellano,&nbsp;Serafín Moral","doi":"10.1515/sagmb-2018-0042","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0042","url":null,"abstract":"<p><p>Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37199702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships. 任意家谱关系的数据自适应多位点关联检验。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-04-08 DOI: 10.1515/sagmb-2018-0030
Gail Gong, Wei Wang, Chih-Lin Hsieh, David J Van Den Berg, Christopher Haiman, Ingrid Oakley-Girvan, Alice S Whittemore

Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) .

全基因组测序能够评估性状之间的关联以及基因和途径中变异的组合。但这样的评价需要多基因座关联检验,且检验的效力较好,而不考虑变异和性状特征。而且,由于分析家庭可能比分析不相关的个体产生更多的力量,我们需要适用于相关和不相关个体的多位点测试。在这里,我们描述了这样的测试,并介绍了SKAT-X,这是一种新的测试统计量,它使用从相关或不相关的受试者获得的全基因组数据来优化手头特定数据的功率。仿真结果表明:a)无论变异特性和性状特性如何,SKAT-X都具有良好的性能;b)对于二元性状,分析受影响的亲属比分析不相关的个体更有效,这与先前的单位点测试结果一致。我们将这些方法应用于肿瘤抑制基因BRCA2中罕见的未分类错义变异,并应用于多种族队列(MEC)中来自前列腺癌家族和不相关前列腺癌病例和对照的综合数据。这些方法可以使用开放源代码作为r包GATARS(任意相关主题的遗传关联测试)来实现。
{"title":"Data-adaptive multi-locus association testing in subjects with arbitrary genealogical relationships.","authors":"Gail Gong,&nbsp;Wei Wang,&nbsp;Chih-Lin Hsieh,&nbsp;David J Van Den Berg,&nbsp;Christopher Haiman,&nbsp;Ingrid Oakley-Girvan,&nbsp;Alice S Whittemore","doi":"10.1515/sagmb-2018-0030","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0030","url":null,"abstract":"<p><p>Genome-wide sequencing enables evaluation of associations between traits and combinations of variants in genes and pathways. But such evaluation requires multi-locus association tests with good power, regardless of the variant and trait characteristics. And since analyzing families may yield more power than analyzing unrelated individuals, we need multi-locus tests applicable to both related and unrelated individuals. Here we describe such tests, and we introduce SKAT-X, a new test statistic that uses genome-wide data obtained from related or unrelated subjects to optimize power for the specific data at hand. Simulations show that: a) SKAT-X performs well regardless of variant and trait characteristics; and b) for binary traits, analyzing affected relatives brings more power than analyzing unrelated individuals, consistent with previous findings for single-locus tests. We illustrate the methods by application to rare unclassified missense variants in the tumor suppressor gene BRCA2, as applied to combined data from prostate cancer families and unrelated prostate cancer cases and controls in the Multi-ethnic Cohort (MEC). The methods can be implemented using open-source code for public use as the R-package GATARS (Genetic Association Tests for Arbitrarily Related Subjects) <https://gailg.github.io/gatars/>.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37127926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate. 一个用于研究基因-模块共表达与连续协变量之间关系的多元线性模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-03-15 DOI: 10.1515/sagmb-2018-0008
Trishanta Padayachee, Tatsiana Khamiakova, Ziv Shkedy, Perttu Salo, Markus Perola, Tomasz Burzykowski

A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.

研究细胞环境对基因共表达(即基因对相关性)的影响是提高我们对复杂疾病发生和发展的理解的一种方法。通常,通过对连续协变量进行分类,研究基因共表达的变化在两个或多个生物学条件下。然而,任意截断点的选择可能会对分析结果产生影响。为了解决这个问题,我们对相关数据使用一般线性模型(GLM)来研究基因-模块共表达与代谢物浓度等协变量之间的关系。GLM将基因对相关性指定为连续协变量的函数。使用GLM可以研究不同的(线性和非线性)共表达模式。此外,建模方法为测试关于可能的共表达模式的假设提供了一个形式化框架。在本文中,我们使用仿真研究来评估GLM的性能。将性能与先前提出的利用分类协变量的GLM进行比较。通过一个现实生活中的例子说明了该模型的多功能性。我们讨论了与检验统计量的构建相关的理论问题以及与拟合所提出的模型相关的计算挑战。
{"title":"A multivariate linear model for investigating the association between gene-module co-expression and a continuous covariate.","authors":"Trishanta Padayachee,&nbsp;Tatsiana Khamiakova,&nbsp;Ziv Shkedy,&nbsp;Perttu Salo,&nbsp;Markus Perola,&nbsp;Tomasz Burzykowski","doi":"10.1515/sagmb-2018-0008","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0008","url":null,"abstract":"<p><p>A way to enhance our understanding of the development and progression of complex diseases is to investigate the influence of cellular environments on gene co-expression (i.e. gene-pair correlations). Often, changes in gene co-expression are investigated across two or more biological conditions defined by categorizing a continuous covariate. However, the selection of arbitrary cut-off points may have an influence on the results of an analysis. To address this issue, we use a general linear model (GLM) for correlated data to study the relationship between gene-module co-expression and a covariate like metabolite concentration. The GLM specifies the gene-pair correlations as a function of the continuous covariate. The use of the GLM allows for investigating different (linear and non-linear) patterns of co-expression. Furthermore, the modeling approach offers a formal framework for testing hypotheses about possible patterns of co-expression. In our paper, a simulation study is used to assess the performance of the GLM. The performance is compared with that of a previously proposed GLM that utilizes categorized covariates. The versatility of the model is illustrated by using a real-life example. We discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting of the proposed model.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37234437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
netprioR: a probabilistic model for integrative hit prioritisation of genetic screens. netprioR:遗传筛选综合命中优先级的概率模型。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-03-06 DOI: 10.1515/sagmb-2018-0033
Fabian Schmich, Jack Kuipers, Gunter Merdes, Niko Beerenwinkel

In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.

在生物学大数据的后基因组时代,整合多个异构数据集的计算方法变得越来越重要。尽管有大量的组学数据,但基于遗传筛选实验的特定功能通路相关基因的优先级仍然是一项具有挑战性的任务。在这里,我们介绍了netprioR,一个概率生成模型,用于对命中基因进行半监督整合优先排序。该模型整合了多个网络数据集,这些数据集代表基因的相似性和文献中关于基因功能的先验知识,以及基于基因的协变量,例如通过RNA干扰或CRISPR/Cas9在遗传扰动筛选中测量的表型。我们在模拟数据上评估了netprioR,并表明该模型在许多情况下优于当前最先进的方法,在其他方面也不相上下。在实际生物学数据的应用中,我们整合了22个网络数据集,1784个先验知识类别标签和3840个RNA干扰表型,以便优先考虑果蝇Notch信号的新调节因子。我们的预测的生物学相关性是评估使用在硅和体内实验。netprioR的有效实现可以在http://bioconductor.org/packages/netprioR上以R包的形式获得。
{"title":"netprioR: a probabilistic model for integrative hit prioritisation of genetic screens.","authors":"Fabian Schmich,&nbsp;Jack Kuipers,&nbsp;Gunter Merdes,&nbsp;Niko Beerenwinkel","doi":"10.1515/sagmb-2018-0033","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0033","url":null,"abstract":"<p><p>In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene-gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.</p>","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37204570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences. 基于离散小波包变换的全基因组序列判别分析。
IF 0.9 4区 数学 Q3 Mathematics Pub Date : 2019-02-15 DOI: 10.1515/sagmb-2018-0045
Hsin-Hsiung Huang, Senthil Balaji Girimurugan
Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.
近年来,无比对方法在基因组序列比较中得到了广泛的应用,因为这些方法计算效率高,可以提供理想的系统发育分析结果。这些方法已成功地与分层聚类方法相结合,用于寻找系统发育树。然而,将这些无对齐的方法直接应用到现有的统计分类方法中可能并不合适,因为目前还缺乏与无对齐表示方法相结合的合适的统计分类理论。本文提出了一种利用离散小波包变换对全基因组序列进行分类的判别分析方法。所提出的特征的无对齐表示统计量渐近地服从联合正态分布。数据分析结果表明,该方法在实时性上取得了令人满意的分类效果。
{"title":"Discrete Wavelet Packet Transform Based Discriminant Analysis for Whole Genome Sequences.","authors":"Hsin-Hsiung Huang,&nbsp;Senthil Balaji Girimurugan","doi":"10.1515/sagmb-2018-0045","DOIUrl":"https://doi.org/10.1515/sagmb-2018-0045","url":null,"abstract":"Abstract In recent years, alignment-free methods have been widely applied in comparing genome sequences, as these methods compute efficiently and provide desirable phylogenetic analysis results. These methods have been successfully combined with hierarchical clustering methods for finding phylogenetic trees. However, it may not be suitable to apply these alignment-free methods directly to existing statistical classification methods, because an appropriate statistical classification theory for integrating with the alignment-free representation methods is still lacking. In this article, we propose a discriminant analysis method which uses the discrete wavelet packet transform to classify whole genome sequences. The proposed alignment-free representation statistics of features follow a joint normal distribution asymptotically. The data analysis results indicate that the proposed method provides satisfactory classification results in real time.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.9,"publicationDate":"2019-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36963300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Statistical Applications in Genetics and Molecular Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1