Coronavirus 2019 (COVID-19) has caused violent fluctuation in stock markets, and led to heated discussion in stock forums. The rise and fall of any specific stock is influenced by many other stocks and emotions expressed in forum discussions. Considering the transmission effect of emotions, we propose a new Textual Multiple Auto Regressive Moving Average (TM-ARMA) model to study the impact of COVID-19 on the Chinese stock market. The TM-ARMA model contains a new cross-textual term and a new cross-auto regressive (AR) term that measure the cross impacts of textual emotions and price fluctuations, respectively, and the adjacent matrix which measures the relationships among stocks is updated dynamically. We compute the textual sentiment scores by an emotion dictionary-based method, and estimate the parameter matrices by a maximum likelihood method. Our dataset includes the textual posts from the Eastmoney Stock Forum and the price data for the constituent stocks of the FTSE China A50 Index. We conduct a sliding-window online forecast approach to simulate the real-trading situations. The results show that TM-ARMA performs very well even after the attack of COVID-19.
{"title":"A study of the impact of COVID-19 on the Chinese stock market based on a new textual multiple ARMA model.","authors":"Weijun Xu, Zhineng Fu, Hongyi Li, Jinglong Huang, Weidong Xu, Yiyang Luo","doi":"10.1002/sam.11582","DOIUrl":"10.1002/sam.11582","url":null,"abstract":"<p><p>Coronavirus 2019 (COVID-19) has caused violent fluctuation in stock markets, and led to heated discussion in stock forums. The rise and fall of any specific stock is influenced by many other stocks and emotions expressed in forum discussions. Considering the transmission effect of emotions, we propose a new Textual Multiple Auto Regressive Moving Average (TM-ARMA) model to study the impact of COVID-19 on the Chinese stock market. The TM-ARMA model contains a new cross-textual term and a new cross-auto regressive (AR) term that measure the cross impacts of textual emotions and price fluctuations, respectively, and the adjacent matrix which measures the relationships among stocks is updated dynamically. We compute the textual sentiment scores by an emotion dictionary-based method, and estimate the parameter matrices by a maximum likelihood method. Our dataset includes the textual posts from the Eastmoney Stock Forum and the price data for the constituent stocks of the FTSE China A50 Index. We conduct a sliding-window online forecast approach to simulate the real-trading situations. The results show that TM-ARMA performs very well even after the attack of COVID-19.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2022-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9111149/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44032709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.
{"title":"Sample Selection Bias in Evaluation of Prediction Performance of Causal Models.","authors":"James P Long, Min Jin Ha","doi":"10.1002/sam.11559","DOIUrl":"https://doi.org/10.1002/sam.11559","url":null,"abstract":"<p><p>Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However prediction performance does depend on the selection of training and test sets. In particular biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren [5]. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association based estimators such as Lasso. Finally we compare the performance of causal estimators in simulation studies which reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 1","pages":"5-14"},"PeriodicalIF":1.3,"publicationDate":"2022-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9053600/pdf/nihms-1746637.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10589307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-01Epub Date: 2021-01-08DOI: 10.1002/sam.11495
Mingmei Tian, Rachael Hageman Blair, Lina Mu, Matthew Bonner, Richard Browne, Han Yu
Graphs can be used to represent the direct and indirect relationships between variables, and elucidate complex relationships and interdependencies. Detecting structure within a graph is a challenging problem. This problem is studied over a range of fields and is sometimes termed community detection, module detection, or graph partitioning. A popular class of algorithms for module detection relies on optimizing a function of modularity to identify the structure. In practice, graphs are often learned from the data, and thus prone to uncertainty. In these settings, the uncertainty of the network structure can become exaggerated by giving unreliable estimates of the module structure. In this work, we begin to address this challenge through the use of a nonparametric bootstrap approach to assessing the stability of module detection in a graph. Estimates of stability are presented at the level of the individual node, the inferred modules, and as an overall measure of performance for module detection in a given graph. Furthermore, bootstrap stability estimates are derived for complexity parameter selection that ultimately defines a graph from data in a way that optimizes stability. This approach is utilized in connection with correlation graphs but is generalizable to other graphs that are defined through the use of dissimilarity measures. We demonstrate our approach using a broad range of simulations and on a metabolomics dataset from the Beijing Olympics Air Pollution study. These approaches are implemented using bootcluster package that is available in the R programming language.
图形可用来表示变量之间的直接和间接关系,并阐明复杂的关系和相互依存关系。检测图中的结构是一个具有挑战性的问题。这个问题在很多领域都有研究,有时被称为群体检测、模块检测或图分割。模块检测的一类流行算法依赖于优化模块化函数来识别结构。在实践中,图通常是从数据中学来的,因此容易产生不确定性。在这种情况下,网络结构的不确定性可能会被夸大,对模块结构做出不可靠的估计。在这项工作中,我们通过使用非参数自举法评估图中模块检测的稳定性,开始应对这一挑战。对稳定性的估算体现在单个节点、推断模块的层面,以及对给定图中模块检测性能的整体衡量。此外,还针对复杂性参数选择得出了自举稳定性估计值,最终以优化稳定性的方式从数据中定义图形。这种方法适用于相关图,但也适用于其他通过使用差异度量来定义的图。我们利用各种模拟和北京奥运会空气污染研究的代谢组学数据集演示了我们的方法。这些方法使用 R 编程语言中的 bootcluster 软件包实现。
{"title":"A framework for stability-based module detection in correlation graphs.","authors":"Mingmei Tian, Rachael Hageman Blair, Lina Mu, Matthew Bonner, Richard Browne, Han Yu","doi":"10.1002/sam.11495","DOIUrl":"10.1002/sam.11495","url":null,"abstract":"<p><p>Graphs can be used to represent the direct and indirect relationships between variables, and elucidate complex relationships and interdependencies. Detecting structure within a graph is a challenging problem. This problem is studied over a range of fields and is sometimes termed community detection, module detection, or graph partitioning. A popular class of algorithms for module detection relies on optimizing a function of modularity to identify the structure. In practice, graphs are often learned from the data, and thus prone to uncertainty. In these settings, the uncertainty of the network structure can become exaggerated by giving unreliable estimates of the module structure. In this work, we begin to address this challenge through the use of a nonparametric bootstrap approach to assessing the <i>stability</i> of module detection in a graph. Estimates of stability are presented at the level of the individual node, the inferred modules, and as an overall measure of performance for module detection in a given graph. Furthermore, bootstrap stability estimates are derived for complexity parameter selection that ultimately defines a graph from data in a way that optimizes stability. This approach is utilized in connection with correlation graphs but is generalizable to other graphs that are defined through the use of dissimilarity measures. We demonstrate our approach using a broad range of simulations and on a metabolomics dataset from the Beijing Olympics Air Pollution study. These approaches are implemented using bootcluster package that is available in the R programming language.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 2","pages":"129-143"},"PeriodicalIF":1.3,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7986843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25525965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-01Epub Date: 2021-02-05DOI: 10.1002/sam.11498
Alejandro Mantero, Hemant Ishwaran
sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.
{"title":"Unsupervised random forests.","authors":"Alejandro Mantero, Hemant Ishwaran","doi":"10.1002/sam.11498","DOIUrl":"https://doi.org/10.1002/sam.11498","url":null,"abstract":"<p><p>sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi-institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 2","pages":"144-167"},"PeriodicalIF":1.3,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11498","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25583970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-01Epub Date: 2020-11-24DOI: 10.1002/sam.11488
Amy M Crawford, Nicholas S Berry, Alicia L Carriquiry
Handwritten documents can be characterized by their content or by the shape of the written characters. We focus on the problem of comparing a person's handwriting to a document of unknown provenance using the shape of the writing, as is done in forensic applications. To do so, we first propose a method for processing scanned handwritten documents to decompose the writing into small graphical structures, often corresponding to letters. We then introduce a measure of distance between two such structures that is inspired by the graph edit distance, and a measure of center for a collection of the graphs. These measurements are the basis for an outlier tolerant K-means algorithm to cluster the graphs based on structural attributes, thus creating a template for sorting new documents. Finally, we present a Bayesian hierarchical model to capture the propensity of a writer for producing graphs that are assigned to certain clusters. We illustrate the methods using documents from the Computer Vision Lab dataset. We show results of the identification task under the cluster assignments and compare to the same modeling, but with a less flexible grouping method that is not tolerant of incidental strokes or outliers.
{"title":"A clustering method for graphical handwriting components and statistical writership analysis.","authors":"Amy M Crawford, Nicholas S Berry, Alicia L Carriquiry","doi":"10.1002/sam.11488","DOIUrl":"10.1002/sam.11488","url":null,"abstract":"<p><p>Handwritten documents can be characterized by their content or by the shape of the written characters. We focus on the problem of comparing a person's handwriting to a document of unknown provenance using the shape of the writing, as is done in forensic applications. To do so, we first propose a method for processing scanned handwritten documents to decompose the writing into small graphical structures, often corresponding to letters. We then introduce a measure of distance between two such structures that is inspired by the graph edit distance, and a measure of center for a collection of the graphs. These measurements are the basis for an outlier tolerant <i>K</i>-means algorithm to cluster the graphs based on structural attributes, thus creating a template for sorting new documents. Finally, we present a Bayesian hierarchical model to capture the propensity of a writer for producing graphs that are assigned to certain clusters. We illustrate the methods using documents from the Computer Vision Lab dataset. We show results of the identification task under the cluster assignments and compare to the same modeling, but with a less flexible grouping method that is not tolerant of incidental strokes or outliers.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 1","pages":"41-60"},"PeriodicalIF":1.3,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7894190/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25432031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-01Epub Date: 2020-10-21DOI: 10.1002/sam.11483
Junghi Kim, Hongtu Zhu, Xiao Wang, Kim-Anh Do
With the advent of high-throughput sequencing, an efficient computing strategy is required to deal with large genomic data sets. The challenge of estimating a large precision matrix has garnered substantial research attention for its direct application to discriminant analyses and graphical models. Most existing methods either use a lasso-type penalty that may lead to biased estimators or are computationally intensive, which prevents their applications to very large graphs. We propose using an L0 penalty to estimate an ultra-large precision matrix (scalnetL0). We apply scalnetL0 to RNA-seq data from breast cancer patients represented in The Cancer Genome Atlas and find improved accuracy of classifications for survival times. The estimated precision matrix provides information about a large-scale co-expression network in breast cancer. Simulation studies demonstrate that scalnetL0 provides more accurate and efficient estimators, yielding shorter CPU time and less Frobenius loss on sparse learning for large-scale precision matrix estimation.
{"title":"Scalable network estimation with <i>L</i> <sub>0</sub> penalty.","authors":"Junghi Kim, Hongtu Zhu, Xiao Wang, Kim-Anh Do","doi":"10.1002/sam.11483","DOIUrl":"https://doi.org/10.1002/sam.11483","url":null,"abstract":"<p><p>With the advent of high-throughput sequencing, an efficient computing strategy is required to deal with large genomic data sets. The challenge of estimating a large precision matrix has garnered substantial research attention for its direct application to discriminant analyses and graphical models. Most existing methods either use a lasso-type penalty that may lead to biased estimators or are computationally intensive, which prevents their applications to very large graphs. We propose using an <i>L</i> <sub>0</sub> penalty to estimate an ultra-large precision matrix (scalnetL0). We apply scalnetL0 to RNA-seq data from breast cancer patients represented in The Cancer Genome Atlas and find improved accuracy of classifications for survival times. The estimated precision matrix provides information about a large-scale co-expression network in breast cancer. Simulation studies demonstrate that scalnetL0 provides more accurate and efficient estimators, yielding shorter CPU time and less Frobenius loss on sparse learning for large-scale precision matrix estimation.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"14 1","pages":"18-30"},"PeriodicalIF":1.3,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11483","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39933742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01Epub Date: 2020-04-20DOI: 10.1002/sam.11459
Nathaniel Garton, Jarad Niemi, Alicia Carriquiry
Sparse, knot-based Gaussian processes have enjoyed considerable success as scalable approximations of full Gaussian processes. Certain sparse models can be derived through specific variational approximations to the true posterior, and knots can be selected to minimize the Kullback-Leibler divergence between the approximate and true posterior. While this has been a successful approach, simultaneous optimization of knots can be slow due to the number of parameters being optimized. Furthermore, there have been few proposed methods for selecting the number of knots, and no experimental results exist in the literature. We propose using a one-at-a-time knot selection algorithm based on Bayesian optimization to select the number and locations of knots. We showcase the competitive performance of this method relative to optimization of knots simultaneously on three benchmark datasets, but at a fraction of the computational cost.
{"title":"Knot selection in sparse Gaussian processes with a variational objective function.","authors":"Nathaniel Garton, Jarad Niemi, Alicia Carriquiry","doi":"10.1002/sam.11459","DOIUrl":"10.1002/sam.11459","url":null,"abstract":"<p><p>Sparse, knot-based Gaussian processes have enjoyed considerable success as scalable approximations of full Gaussian processes. Certain sparse models can be derived through specific variational approximations to the true posterior, and knots can be selected to minimize the Kullback-Leibler divergence between the approximate and true posterior. While this has been a successful approach, simultaneous optimization of knots can be slow due to the number of parameters being optimized. Furthermore, there have been few proposed methods for selecting the number of knots, and no experimental results exist in the literature. We propose using a one-at-a-time knot selection algorithm based on Bayesian optimization to select the number and locations of knots. We showcase the competitive performance of this method relative to optimization of knots simultaneously on three benchmark datasets, but at a fraction of the computational cost.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 4","pages":"324-336"},"PeriodicalIF":1.3,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7386924/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38227720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01Epub Date: 2020-02-21DOI: 10.1002/sam.11449
Soyoung Park, Alicia Carriquiry
Footwear examiners are tasked with comparing an outsole impression (Q) left at a crime scene with an impression (K) from a database or from the suspect's shoe. We propose a method for comparing two shoe outsole impressions that relies on robust features (speeded-up robust feature; SURF) on each impression and aligns them using a maximum clique (MC). After alignment, an algorithm we denote MC-COMP is used to extract additional features that are then combined into a univariate similarity score using a random forest (RF). We use a database of shoe outsole impressions that includes images from two models of athletic shoes that were purchased new and then worn by study participants for about 6 months. The shoes share class characteristics such as outsole pattern and size, and thus the comparison is challenging. We find that the RF implemented on SURF outperforms other methods recently proposed in the literature in terms of classification precision. In more realistic scenarios where crime scene impressions may be degraded and smudged, the algorithm we propose-denoted MC-COMP-SURF-shows the best classification performance by detecting unique features better than other methods. The algorithm can be implemented with the R-package shoeprintr.
{"title":"An algorithm to compare two-dimensional footwear outsole images using maximum cliques and speeded-up robust feature.","authors":"Soyoung Park, Alicia Carriquiry","doi":"10.1002/sam.11449","DOIUrl":"10.1002/sam.11449","url":null,"abstract":"<p><p>Footwear examiners are tasked with comparing an outsole impression (<i>Q</i>) left at a crime scene with an impression (<i>K</i>) from a database or from the suspect's shoe. We propose a method for comparing two shoe outsole impressions that relies on robust features (speeded-up robust feature; SURF) on each impression and aligns them using a maximum clique (MC). After alignment, an algorithm we denote MC-COMP is used to extract additional features that are then combined into a univariate similarity score using a random forest (RF). We use a database of shoe outsole impressions that includes images from two models of athletic shoes that were purchased new and then worn by study participants for about 6 months. The shoes share class characteristics such as outsole pattern and size, and thus the comparison is challenging. We find that the RF implemented on SURF outperforms other methods recently proposed in the literature in terms of classification precision. In more realistic scenarios where crime scene impressions may be degraded and smudged, the algorithm we propose-denoted MC-COMP-SURF-shows the best classification performance by detecting unique features better than other methods. The algorithm can be implemented with the R-package shoeprintr.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 2","pages":"188-199"},"PeriodicalIF":1.3,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7079556/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37774263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.
{"title":"Practical Bayesian Modeling and Inference for Massive Spatial Datasets On Modest Computing Environments.","authors":"Lu Zhang, Abhirup Datta, Sudipto Banerjee","doi":"10.1002/sam.11413","DOIUrl":"https://doi.org/10.1002/sam.11413","url":null,"abstract":"<p><p>With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial datasets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial datasets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (e.g., a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"12 3","pages":"197-209"},"PeriodicalIF":1.3,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11413","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10297504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-01Epub Date: 2018-07-11DOI: 10.1002/sam.11382
Donghyeon Yu, Sang Han Lee, Johan Lim, Guanghua Xiao, R Cameron Craddock, Bharat B Biswal
In this paper, we propose a procedure to find differential edges between two graphs from high-dimensional data. We estimate two matrices of partial correlations and their differences by solving a penalized regression problem. We assume sparsity only on differences between two graphs, not graphs themselves. Thus, we impose an ℓ2 penalty on partial correlations and an ℓ1 penalty on their differences in the penalized regression problem. We apply the proposed procedure to finding differential functional connectivity between healthy individuals and Alzheimer's disease patients.
{"title":"Fused Lasso Regression for Identifying Differential Correlations in Brain Connectome Graphs.","authors":"Donghyeon Yu, Sang Han Lee, Johan Lim, Guanghua Xiao, R Cameron Craddock, Bharat B Biswal","doi":"10.1002/sam.11382","DOIUrl":"https://doi.org/10.1002/sam.11382","url":null,"abstract":"<p><p>In this paper, we propose a procedure to find differential edges between two graphs from high-dimensional data. We estimate two matrices of partial correlations and their differences by solving a penalized regression problem. We assume sparsity only on differences between two graphs, not graphs themselves. Thus, we impose an <i>ℓ</i> <sub>2</sub> penalty on partial correlations and an <i>ℓ</i> <sub>1</sub> penalty on their differences in the penalized regression problem. We apply the proposed procedure to finding differential functional connectivity between healthy individuals and Alzheimer's disease patients.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"11 5","pages":"203-226"},"PeriodicalIF":1.3,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11382","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39306532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}