Pub Date : 2022-07-01Epub Date: 2021-05-20DOI: 10.1002/wics.1558
Dorothy Ellis, Dongyuan Wu, Susmita Datta
Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.
{"title":"SAREV: A review on statistical analytics of single-cell RNA sequencing data.","authors":"Dorothy Ellis, Dongyuan Wu, Susmita Datta","doi":"10.1002/wics.1558","DOIUrl":"10.1002/wics.1558","url":null,"abstract":"<p><p>Due to the development of next-generation RNA sequencing (NGS) technologies, there has been tremendous progress in research involving determining the role of genomics, transcriptomics and epigenomics in complex biological systems. However, scientists have realized that information obtained using earlier technology, frequently called 'bulk RNA-seq' data, provides information averaged across all the cells present in a tissue. Relatively newly developed single cell (scRNA-seq) technology allows us to provide transcriptomic information at a single-cell resolution. Nevertheless, these high-resolution data have their own complex natures and demand novel statistical data analysis methods to provide effective and highly accurate results on complex biological systems. In this review, we cover many such recently developed statistical methods for researchers wanting to pursue scRNA-seq statistical and computational research as well as scientific research about these existing methods and free software tools available for their generated data. This review is certainly not exhaustive due to page limitations. We have tried to cover the popular methods starting from quality control to the downstream analysis of finding differentially expressed genes and concluding with a brief description of network analysis.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9729203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Llorente, Luca Martino, E. Curbelo, J. Lopez-Santiago, D. Delgado
The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depend on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well‐known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real‐world application on exoplanet detection.
{"title":"On the safe use of prior densities for Bayesian model selection","authors":"F. Llorente, Luca Martino, E. Curbelo, J. Lopez-Santiago, D. Delgado","doi":"10.1002/wics.1595","DOIUrl":"https://doi.org/10.1002/wics.1595","url":null,"abstract":"The application of Bayesian inference for the purpose of model selection is very popular nowadays. In this framework, models are compared through their marginal likelihoods, or their quotients, called Bayes factors. However, marginal likelihoods depend on the prior choice. For model selection, even diffuse priors can be actually very informative, unlike for the parameter estimation problem. Furthermore, when the prior is improper, the marginal likelihood of the corresponding model is undetermined. In this work, we discuss the issue of prior sensitivity of the marginal likelihood and its role in model selection. We also comment on the use of uninformative priors, which are very common choices in practice. Several practical suggestions are discussed and many possible solutions, proposed in the literature, to design objective priors for model selection are described. Some of them also allow the use of improper priors. The connection between the marginal likelihood approach and the well‐known information criteria is also presented. We describe the main issues and possible solutions by illustrative numerical examples, providing also some related code. One of them involving a real‐world application on exoplanet detection.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44402673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, the explosive growth in text data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large‐scale data. Given the vast amount of this kind of unstructured data, the majority of it is not classified, hence unsupervised learning techniques show to be useful in this field. Document clustering has proven to be an efficient tool in organizing textual documents and it has been widely applied in different areas from information retrieval to topic modeling. Before introducing the proposals of document clustering algorithms, the principal steps of the whole process, including the mathematical representation of documents and the preprocessing phase, are discussed. Then, the main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐based, hierarchical, and model‐based approaches.
{"title":"Document clustering","authors":"Irene Cozzolino, M. Ferraro","doi":"10.1002/wics.1588","DOIUrl":"https://doi.org/10.1002/wics.1588","url":null,"abstract":"Nowadays, the explosive growth in text data emphasizes the need for developing new and computationally efficient methods and credible theoretical support tailored for analyzing such large‐scale data. Given the vast amount of this kind of unstructured data, the majority of it is not classified, hence unsupervised learning techniques show to be useful in this field. Document clustering has proven to be an efficient tool in organizing textual documents and it has been widely applied in different areas from information retrieval to topic modeling. Before introducing the proposals of document clustering algorithms, the principal steps of the whole process, including the mathematical representation of documents and the preprocessing phase, are discussed. Then, the main clustering algorithms used for text data are critically analyzed, considering prototype‐based, graph‐based, hierarchical, and model‐based approaches.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48927700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dionne Swift, Kellen Cresswell, Robert Johnson, Spiro C. Stilianoudakis, Xingtao Wei
The recent development of cost‐effective high‐throughput DNA sequencing technologies has tremendously increased microbiome research. However, it has been well documented that the observed microbiome data suffers from compositionality, sparsity, and high variability. All of which pose serious challenges when analyzing microbiome data. Over the last decade, there has been considerable amount of interest into statistical and computational methods to tackle these challenges. The choice of inference aids in the selection of the appropriate statistical methods since only a few methods allow inferences for absolute abundance while most methods allow inferences for relative abundances. An overview of recent methods for differential abundance analysis and normalization of microbiome data is presented, focusing on methods that are accessible but have not been widely covered in previous literature. In detailed descriptions of each method, we discuss assumptions and if and how these methods address the challenges of microbiome data. These methods are compared based on accuracy metrics in real and simulated settings. The goal is to provide a comprehensive but non‐exhaustive set of potential and easily‐accessible tools for differential abundance and normalization of microbiome data.
{"title":"A review of normalization and differential abundance methods for microbiome counts data","authors":"Dionne Swift, Kellen Cresswell, Robert Johnson, Spiro C. Stilianoudakis, Xingtao Wei","doi":"10.1002/wics.1586","DOIUrl":"https://doi.org/10.1002/wics.1586","url":null,"abstract":"The recent development of cost‐effective high‐throughput DNA sequencing technologies has tremendously increased microbiome research. However, it has been well documented that the observed microbiome data suffers from compositionality, sparsity, and high variability. All of which pose serious challenges when analyzing microbiome data. Over the last decade, there has been considerable amount of interest into statistical and computational methods to tackle these challenges. The choice of inference aids in the selection of the appropriate statistical methods since only a few methods allow inferences for absolute abundance while most methods allow inferences for relative abundances. An overview of recent methods for differential abundance analysis and normalization of microbiome data is presented, focusing on methods that are accessible but have not been widely covered in previous literature. In detailed descriptions of each method, we discuss assumptions and if and how these methods address the challenges of microbiome data. These methods are compared based on accuracy metrics in real and simulated settings. The goal is to provide a comprehensive but non‐exhaustive set of potential and easily‐accessible tools for differential abundance and normalization of microbiome data.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45693764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review.
{"title":"Projection‐based techniques for high‐dimensional optimal transport problems","authors":"Jingyi Zhang, Ping Ma, Wenxuan Zhong, Cheng Meng","doi":"10.1002/wics.1587","DOIUrl":"https://doi.org/10.1002/wics.1587","url":null,"abstract":"Optimal transport (OT) methods seek a transformation map (or plan) between two probability measures, such that the transformation has the minimum transportation cost. Such a minimum transport cost, with a certain power transform, is called the Wasserstein distance. Recently, OT methods have drawn great attention in statistics, machine learning, and computer science, especially in deep generative neural networks. Despite its broad applications, the estimation of high‐dimensional Wasserstein distances is a well‐known challenging problem owing to the curse‐of‐dimensionality. There are some cutting‐edge projection‐based techniques that tackle high‐dimensional OT problems. Three major approaches of such techniques are introduced, respectively, the slicing approach, the iterative projection approach, and the projection robust OT approach. Open challenges are discussed at the end of the review.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48984047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Zhang, Zhenwei Zhou, Hanfei Xu, Ching-Ti Liu
Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: concatenated clustering, clustering of clusters, and interactive clustering based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.
{"title":"Integrative clustering methods for multi-omics data.","authors":"Xiaoyu Zhang, Zhenwei Zhou, Hanfei Xu, Ching-Ti Liu","doi":"10.1002/wics.1553","DOIUrl":"https://doi.org/10.1002/wics.1553","url":null,"abstract":"<p><p>Integrative analysis of multi-omics data has drawn much attention from the scientific community due to the technological advancements which have generated various omics data. Leveraging these multi-omics data potentially provides a more comprehensive view of the disease mechanism or biological processes. Integrative multi-omics clustering is an unsupervised integrative method specifically used to find coherent groups of samples or features by utilizing information across multi-omics data. It aims to better stratify diseases and to suggest biological mechanisms and potential targeted therapies for the diseases. However, applying integrative multi-omics clustering is both statistically and computationally challenging due to various reasons such as high dimensionality and heterogeneity. In this review, we summarized integrative multi-omics clustering methods into three general categories: <i>concatenated clustering</i>, <i>clustering of clusters</i>, and <i>interactive clustering</i> based on when and how the multi-omics data are processed for clustering. We further classified the methods into different approaches under each category based on the main statistical strategy used during clustering. In addition, we have provided recommended practices tailored to four real-life scenarios to help researchers to strategize their selection in integrative multi-omics clustering methods for their future studies.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1553","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9379724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Craigmile, Radu Herbei, Geoffrey Liu, Grant Schneider
Many scientific fields have experienced growth in the use of stochastic differential equations (SDEs), also known as diffusion processes, to model scientific phenomena over time. SDEs can simultaneously capture the known deterministic dynamics of underlying variables of interest (e.g., ocean flow, chemical and physical characteristics of a body of water, presence, absence, and spread of a disease), while enabling a modeler to capture the unknown random dynamics in a stochastic setting. We focus on reviewing a wide range of statistical inference methods for likelihood‐based frequentist and Bayesian parametric inference based on discretely‐sampled diffusions. Exact parametric inference is not usually possible because the transition density is not available in closed form. Thus, we review the literature on approximate numerical methods (e.g., Euler, Milstein, local linearization, and Aït‐Sahalia) and simulation‐based approaches (e.g., data augmentation and exact sampling) that are used to carry out parametric statistical inference on SDE processes. We close with a brief discussion of other methods of inference for SDEs and more complex SDE processes such as spatio‐temporal SDEs.
{"title":"Statistical inference for stochastic differential equations","authors":"P. Craigmile, Radu Herbei, Geoffrey Liu, Grant Schneider","doi":"10.1002/wics.1585","DOIUrl":"https://doi.org/10.1002/wics.1585","url":null,"abstract":"Many scientific fields have experienced growth in the use of stochastic differential equations (SDEs), also known as diffusion processes, to model scientific phenomena over time. SDEs can simultaneously capture the known deterministic dynamics of underlying variables of interest (e.g., ocean flow, chemical and physical characteristics of a body of water, presence, absence, and spread of a disease), while enabling a modeler to capture the unknown random dynamics in a stochastic setting. We focus on reviewing a wide range of statistical inference methods for likelihood‐based frequentist and Bayesian parametric inference based on discretely‐sampled diffusions. Exact parametric inference is not usually possible because the transition density is not available in closed form. Thus, we review the literature on approximate numerical methods (e.g., Euler, Milstein, local linearization, and Aït‐Sahalia) and simulation‐based approaches (e.g., data augmentation and exact sampling) that are used to carry out parametric statistical inference on SDE processes. We close with a brief discussion of other methods of inference for SDEs and more complex SDE processes such as spatio‐temporal SDEs.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47620930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This review will look at function minimization and nonlinear least squares, possibly bounds constrained, using R. These tools derive from the more general context of numerical optimization and mathematical programming. How R developers have tried to make the application of such tools easier for users not familiar with optimization is highlighted. Some limitations of methods and their implementations are mentioned to provide perspective.
{"title":"Function minimization and nonlinear least squares in R","authors":"J. Nash","doi":"10.1002/wics.1580","DOIUrl":"https://doi.org/10.1002/wics.1580","url":null,"abstract":"This review will look at function minimization and nonlinear least squares, possibly bounds constrained, using R. These tools derive from the more general context of numerical optimization and mathematical programming. How R developers have tried to make the application of such tools easier for users not familiar with optimization is highlighted. Some limitations of methods and their implementations are mentioned to provide perspective.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45021708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this study, we explore the use of echelon analysis and its software named EcheScan for spatial lattice data. EcheScan is developed as a web application via an internet browser in R language and Shiny server for echelon analysis. The technique of echelon is proposed to analyze the topological structure for spatial lattice data. The echelon tree provides a dendrogram representation. Regional features, such as hierarchical spatial data structure and hotspots clusters, are shown in an echelon dendrogram. In addition, we introduce the conception of echelon with the values and neighbors for lattice data. We also explain the use of EcheScan for one‐ and two‐dimensional regular lattice data. Furthermore, coronavirus disease 2019 death data corresponding to 50 US states are illustrated using EcheScan as an example of geospatial lattice data.
{"title":"Echelon analysis and its software for spatial lattice data","authors":"K. Kurihara, Fumio Ishioka","doi":"10.1002/wics.1579","DOIUrl":"https://doi.org/10.1002/wics.1579","url":null,"abstract":"In this study, we explore the use of echelon analysis and its software named EcheScan for spatial lattice data. EcheScan is developed as a web application via an internet browser in R language and Shiny server for echelon analysis. The technique of echelon is proposed to analyze the topological structure for spatial lattice data. The echelon tree provides a dendrogram representation. Regional features, such as hierarchical spatial data structure and hotspots clusters, are shown in an echelon dendrogram. In addition, we introduce the conception of echelon with the values and neighbors for lattice data. We also explain the use of EcheScan for one‐ and two‐dimensional regular lattice data. Furthermore, coronavirus disease 2019 death data corresponding to 50 US states are illustrated using EcheScan as an example of geospatial lattice data.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2022-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48195347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}