Over the decades that have passed since they were introduced, copulae still remain a very powerful tool for modeling and estimating multivariate distributions. This work gives an overview of copula theory and it also summarizes the latest results. This article recalls the basic definition, the most important cases of bivariate copulae, and it then proceeds to a sketch of how multivariate copulae are developed both from bivariate copulae and from scratch. Regarding higher dimensions, the focus is on hierarchical Archimedean, vine, and factor copulae, which are the most often used and most flexible ways to introduce copulae to multivariate distributions. We also provide an overview of how copulae can be used in various fields of data science, including recent results. These fields include but are not limited to time series and machine learning. Finally, we describe estimation and testing methods for copulae in general, their application to the presented copula structures, and we give some specific testing and estimation procedures for those specific copulae.
{"title":"Copulae: An overview and recent developments","authors":"Joshua Größer, Ostap Okhrin","doi":"10.1002/wics.1557","DOIUrl":"https://doi.org/10.1002/wics.1557","url":null,"abstract":"Over the decades that have passed since they were introduced, copulae still remain a very powerful tool for modeling and estimating multivariate distributions. This work gives an overview of copula theory and it also summarizes the latest results. This article recalls the basic definition, the most important cases of bivariate copulae, and it then proceeds to a sketch of how multivariate copulae are developed both from bivariate copulae and from scratch. Regarding higher dimensions, the focus is on hierarchical Archimedean, vine, and factor copulae, which are the most often used and most flexible ways to introduce copulae to multivariate distributions. We also provide an overview of how copulae can be used in various fields of data science, including recent results. These fields include but are not limited to time series and machine learning. Finally, we describe estimation and testing methods for copulae in general, their application to the presented copula structures, and we give some specific testing and estimation procedures for those specific copulae.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1557","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46645427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-03-01Epub Date: 2020-04-06DOI: 10.1002/wics.1508
Ali Shojaie
Networks effectively capture interactions among components of complex systems, and have thus become a mainstay in many scientific disciplines. Growing evidence, especially from biology, suggest that networks undergo changes over time, and in response to external stimuli. In biology and medicine, these changes have been found to be predictive of complex diseases. They have also been used to gain insight into mechanisms of disease initiation and progression. Primarily motivated by biological applications, this article provides a review of recent statistical machine learning methods for inferring networks and identifying changes in their structures.
{"title":"Differential Network Analysis: A Statistical Perspective.","authors":"Ali Shojaie","doi":"10.1002/wics.1508","DOIUrl":"10.1002/wics.1508","url":null,"abstract":"<p><p>Networks effectively capture interactions among components of complex systems, and have thus become a mainstay in many scientific disciplines. Growing evidence, especially from biology, suggest that networks undergo changes over time, and in response to external stimuli. In biology and medicine, these changes have been found to be predictive of complex diseases. They have also been used to gain insight into mechanisms of disease initiation and progression. Primarily motivated by biological applications, this article provides a review of recent statistical machine learning methods for inferring networks and identifying changes in their structures.</p>","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":"13 2","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1508","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9364103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of Bayesian network structure learning (BNSL) from data. In particular, we focus on the score‐based approach rather than the constraint‐based approach and address what score we should use for the purpose. The Bayesian Dirichlet equivalent uniform (BDeu) has been mainly used within the community of BNs (not outside of it). We know that for any model selection and any data, the fitter the data to a model, the more complex the model, and vice versa. However, recently, it was proven that BDeu violates regularity, which means that it does not balance the two factors, although it works satisfactorily (consistently) when the sample size is infinitely large. In addition, we claim that the merit of using the regular scores over the BDeu is that tighter bounds of pruning rules are available when we consider efficient BNSL. Finally, using experiments, we compare the performances of the procedures to examine the claim. (This paper is for review and gives a unified viewpoint from the recent progress on the topic.)
{"title":"Why BDeu? Regular Bayesian network structure learning with discrete and continuous variables","authors":"J. Suzuki","doi":"10.1002/wics.1554","DOIUrl":"https://doi.org/10.1002/wics.1554","url":null,"abstract":"We consider the problem of Bayesian network structure learning (BNSL) from data. In particular, we focus on the score‐based approach rather than the constraint‐based approach and address what score we should use for the purpose. The Bayesian Dirichlet equivalent uniform (BDeu) has been mainly used within the community of BNs (not outside of it). We know that for any model selection and any data, the fitter the data to a model, the more complex the model, and vice versa. However, recently, it was proven that BDeu violates regularity, which means that it does not balance the two factors, although it works satisfactorily (consistently) when the sample size is infinitely large. In addition, we claim that the merit of using the regular scores over the BDeu is that tighter bounds of pruning rules are available when we consider efficient BNSL. Finally, using experiments, we compare the performances of the procedures to examine the claim. (This paper is for review and gives a unified viewpoint from the recent progress on the topic.)","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1554","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48140309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maryam Tayefi, Phuong D. Ngo, T. Chomutare, H. Dalianis, Elisa Salvi, A. Budrionis, F. Godtliebsen
Electronic health records (EHR) contain a lot of valuable information about individual patients and the whole population. Besides structured data, unstructured data in EHRs can provide extra, valuable information but the analytics processes are complex, time‐consuming, and often require excessive manual effort. Among unstructured data, clinical text and images are the two most popular and important sources of information. Advanced statistical algorithms in natural language processing, machine learning, deep learning, and radiomics have increasingly been used for analyzing clinical text and images. Although there exist many challenges that have not been fully addressed, which can hinder the use of unstructured data, there are clear opportunities for well‐designed diagnosis and decision support tools that efficiently incorporate both structured and unstructured data for extracting useful information and provide better outcomes. However, access to clinical data is still very restricted due to data sensitivity and ethical issues. Data quality is also an important challenge in which methods for improving data completeness, conformity and plausibility are needed. Further, generalizing and explaining the result of machine learning models are important problems for healthcare, and these are open challenges. A possible solution to improve data quality and accessibility of unstructured data is developing machine learning methods that can generate clinically relevant synthetic data, and accelerating further research on privacy preserving techniques such as deidentification and pseudonymization of clinical text.
{"title":"Challenges and opportunities beyond structured data in analysis of electronic health records","authors":"Maryam Tayefi, Phuong D. Ngo, T. Chomutare, H. Dalianis, Elisa Salvi, A. Budrionis, F. Godtliebsen","doi":"10.1002/wics.1549","DOIUrl":"https://doi.org/10.1002/wics.1549","url":null,"abstract":"Electronic health records (EHR) contain a lot of valuable information about individual patients and the whole population. Besides structured data, unstructured data in EHRs can provide extra, valuable information but the analytics processes are complex, time‐consuming, and often require excessive manual effort. Among unstructured data, clinical text and images are the two most popular and important sources of information. Advanced statistical algorithms in natural language processing, machine learning, deep learning, and radiomics have increasingly been used for analyzing clinical text and images. Although there exist many challenges that have not been fully addressed, which can hinder the use of unstructured data, there are clear opportunities for well‐designed diagnosis and decision support tools that efficiently incorporate both structured and unstructured data for extracting useful information and provide better outcomes. However, access to clinical data is still very restricted due to data sensitivity and ethical issues. Data quality is also an important challenge in which methods for improving data completeness, conformity and plausibility are needed. Further, generalizing and explaining the result of machine learning models are important problems for healthcare, and these are open challenges. A possible solution to improve data quality and accessibility of unstructured data is developing machine learning methods that can generate clinically relevant synthetic data, and accelerating further research on privacy preserving techniques such as deidentification and pseudonymization of clinical text.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1549","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42824997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Topological data analysis (TDA) uses information from topological structures in complex data for statistical analysis and learning. This paper discusses persistent homology, a part of computational (algorithmic) topology that converts data into simplicial complexes and elicits information about the persistence of homology classes in the data. It computes and outputs the birth and death of such topologies via a persistence diagram. Data inputs for persistent homology are usually represented as point clouds or as functions, while the outputs depend on the nature of the analysis and commonly consist of either a persistence diagram, or persistence landscapes. This paper gives an introductory level tutorial on computing these summaries for time series using R, followed by an overview on using these approaches for time series classification and clustering.
{"title":"An introduction to persistent homology for time series","authors":"N. Ravishanker, Renjie Chen","doi":"10.1002/wics.1548","DOIUrl":"https://doi.org/10.1002/wics.1548","url":null,"abstract":"Topological data analysis (TDA) uses information from topological structures in complex data for statistical analysis and learning. This paper discusses persistent homology, a part of computational (algorithmic) topology that converts data into simplicial complexes and elicits information about the persistence of homology classes in the data. It computes and outputs the birth and death of such topologies via a persistence diagram. Data inputs for persistent homology are usually represented as point clouds or as functions, while the outputs depend on the nature of the analysis and commonly consist of either a persistence diagram, or persistence landscapes. This paper gives an introductory level tutorial on computing these summaries for time series using R, followed by an overview on using these approaches for time series classification and clustering.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1548","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46586723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multivariate regression, which can also be posed as a multitask machine learning problem, is used to better understand multiple outputs based on a given set of inputs. Many methods have been proposed on how to utilize shared information about responses with applications in fields such as economics, genomics, advanced manufacturing, and precision medicine. Interest in these areas coupled with the rise of large data sets (“big data”) has generated interest in how to make the computations more efficient, but also to develop methods that account for the heterogeneity that may exist between responses. One way to exploit this heterogeneity between responses is to use methods that detect groups, also called clusters, of related responses. These methods provide a framework that can increase computational speed and account for complexity of relationships of a large number of responses. With this flexibility, comes additional challenges such as how to identify these clusters of responses, model selection, and the development of more complex algorithms that combine concepts from both the supervised and unsupervised learning literature. We explore current state of the art methods, present a framework to better understand methods that utilize or detect clusters of responses, and provide insights on the computational challenges associated with this framework. Specifically we present a simulation study that discusses the challenges with model selection when detecting clusters of responses of interest. We also comment on extensions and open problems that are of interest to both the research and practitioner communities.
{"title":"Detecting clusters in multivariate response regression","authors":"Bradley S. Price, Corban Allenbrand, Ben Sherwood","doi":"10.1002/wics.1551","DOIUrl":"https://doi.org/10.1002/wics.1551","url":null,"abstract":"Multivariate regression, which can also be posed as a multitask machine learning problem, is used to better understand multiple outputs based on a given set of inputs. Many methods have been proposed on how to utilize shared information about responses with applications in fields such as economics, genomics, advanced manufacturing, and precision medicine. Interest in these areas coupled with the rise of large data sets (“big data”) has generated interest in how to make the computations more efficient, but also to develop methods that account for the heterogeneity that may exist between responses. One way to exploit this heterogeneity between responses is to use methods that detect groups, also called clusters, of related responses. These methods provide a framework that can increase computational speed and account for complexity of relationships of a large number of responses. With this flexibility, comes additional challenges such as how to identify these clusters of responses, model selection, and the development of more complex algorithms that combine concepts from both the supervised and unsupervised learning literature. We explore current state of the art methods, present a framework to better understand methods that utilize or detect clusters of responses, and provide insights on the computational challenges associated with this framework. Specifically we present a simulation study that discusses the challenges with model selection when detecting clusters of responses of interest. We also comment on extensions and open problems that are of interest to both the research and practitioner communities.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1551","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48751204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haifeng Wang, Chang Pan, Xiao Guo, Chun Ji, Ke Deng
Text detection and recognition, which is also known as optical character recognition (OCR), is an active research area under quick development with a lot of exciting applications. Deep‐learning‐based methods represent the state‐of‐art of this area. However, these methods are largely deterministic: they give a deterministic output for each input. For both statisticians and general users, methods supporting uncertainty inference are of great appeal, leaving rich research opportunities to incorporate statistical models and methods with the established deep‐learning‐based approaches. In this paper, we provide a comprehensive review of the evolution history of research development on OCR with discussions on the statistical insights behind these developments and potential directions to enhance the current methods with statistical approaches. We hope this article can serve as a useful guidebook for statisticians who are seeking for a path toward edge‐cutting research in this exciting area.
{"title":"From object detection to text detection and recognition: A brief evolution history of optical character recognition","authors":"Haifeng Wang, Chang Pan, Xiao Guo, Chun Ji, Ke Deng","doi":"10.1002/wics.1547","DOIUrl":"https://doi.org/10.1002/wics.1547","url":null,"abstract":"Text detection and recognition, which is also known as optical character recognition (OCR), is an active research area under quick development with a lot of exciting applications. Deep‐learning‐based methods represent the state‐of‐art of this area. However, these methods are largely deterministic: they give a deterministic output for each input. For both statisticians and general users, methods supporting uncertainty inference are of great appeal, leaving rich research opportunities to incorporate statistical models and methods with the established deep‐learning‐based approaches. In this paper, we provide a comprehensive review of the evolution history of research development on OCR with discussions on the statistical insights behind these developments and potential directions to enhance the current methods with statistical approaches. We hope this article can serve as a useful guidebook for statisticians who are seeking for a path toward edge‐cutting research in this exciting area.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1547","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46185325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ordinal models can be seen as being composed from simpler, in particular binary models. This view on ordinal models allows to derive a taxonomy of models that includes basic ordinal regression models, models with more complex parameterizations, the class of hierarchically structured models, and the more recently developed finite mixture models. The structured overview that is given covers existing models and shows how models can be extended to account for further effects of explanatory variables. Particular attention is given to the modeling of additional heterogeneity as, for example, dispersion effects. The modeling is embedded into the framework of response styles and the exact meaning of heterogeneity terms in ordinal models is investigated. It is shown that the meaning of terms is crucially determined by the type of model that is used. Moreover, it is demonstrated how models with a complex category‐specific effect structure can be simplified to obtain simpler models that fit sufficiently well. The fitting of models is illustrated by use of a real data set, and a short overview of existing software is given.
{"title":"Ordinal regression: A review and a taxonomy of models","authors":"G. Tutz","doi":"10.1002/wics.1545","DOIUrl":"https://doi.org/10.1002/wics.1545","url":null,"abstract":"Ordinal models can be seen as being composed from simpler, in particular binary models. This view on ordinal models allows to derive a taxonomy of models that includes basic ordinal regression models, models with more complex parameterizations, the class of hierarchically structured models, and the more recently developed finite mixture models. The structured overview that is given covers existing models and shows how models can be extended to account for further effects of explanatory variables. Particular attention is given to the modeling of additional heterogeneity as, for example, dispersion effects. The modeling is embedded into the framework of response styles and the exact meaning of heterogeneity terms in ordinal models is investigated. It is shown that the meaning of terms is crucially determined by the type of model that is used. Moreover, it is demonstrated how models with a complex category‐specific effect structure can be simplified to obtain simpler models that fit sufficiently well. The fitting of models is illustrated by use of a real data set, and a short overview of existing software is given.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1545","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48668266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Gibbs sampler is a simple but very powerful algorithm used to simulate from a complex high‐dimensional distribution. It is particularly useful in Bayesian analysis when a complex Bayesian model involves a number of model parameters and the conditional posterior distribution of each component given the others can be derived as a standard distribution. In the presence of a strong correlation structure among components, however, the Gibbs sampler can be criticized for its slow convergence. Here we discuss several algorithmic strategies such as blocking, collapsing, and partial collapsing that are available for improving the convergence characteristics of the Gibbs sampler.
{"title":"Improving the Gibbs sampler","authors":"Taeyoung Park, Seunghan Lee","doi":"10.1002/wics.1546","DOIUrl":"https://doi.org/10.1002/wics.1546","url":null,"abstract":"The Gibbs sampler is a simple but very powerful algorithm used to simulate from a complex high‐dimensional distribution. It is particularly useful in Bayesian analysis when a complex Bayesian model involves a number of model parameters and the conditional posterior distribution of each component given the others can be derived as a standard distribution. In the presence of a strong correlation structure among components, however, the Gibbs sampler can be criticized for its slow convergence. Here we discuss several algorithmic strategies such as blocking, collapsing, and partial collapsing that are available for improving the convergence characteristics of the Gibbs sampler.","PeriodicalId":47779,"journal":{"name":"Wiley Interdisciplinary Reviews-Computational Statistics","volume":" ","pages":""},"PeriodicalIF":3.2,"publicationDate":"2021-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/wics.1546","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42829044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}