Wind has the potential to make a significant contribution to future energy resources. Locating the sources of this renewable energy on a global scale is however extremely challenging, given the difficulty to store very large data sets generated by modern computer models. We propose a statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs via a Stochastic Generator (SG) of global annual wind data. We introduce an evolutionary spectrum approach with spatially varying parameters based on large-scale geographical descriptors such as altitude to better account for different regimes across the Earth's orography. We consider a multi-step conditional likelihood approach to estimate the parameters that explicitly accounts for nonstationary features while also balancing memory storage and distributed computation. We apply the proposed model to more than 18 million points of yearly global wind speed. The proposed SG requires orders of magnitude less storage for generating surrogate ensemble members from wind than does creating additional wind fields from the climate model, even if an effective lossy data compression algorithm is applied to the simulation output.
{"title":"Reducing Storage of Global Wind Ensembles with Stochastic Generators","authors":"J. Jeong, S. Castruccio, P. Crippa, M. Genton","doi":"10.1214/17-AOAS1105","DOIUrl":"https://doi.org/10.1214/17-AOAS1105","url":null,"abstract":"Wind has the potential to make a significant contribution to future energy resources. Locating the sources of this renewable energy on a global scale is however extremely challenging, given the difficulty to store very large data sets generated by modern computer models. We propose a statistical model that aims at reproducing the data-generating mechanism of an ensemble of runs via a Stochastic Generator (SG) of global annual wind data. We introduce an evolutionary spectrum approach with spatially varying parameters based on large-scale geographical descriptors such as altitude to better account for different regimes across the Earth's orography. We consider a multi-step conditional likelihood approach to estimate the parameters that explicitly accounts for nonstationary features while also balancing memory storage and distributed computation. We apply the proposed model to more than 18 million points of yearly global wind speed. The proposed SG requires orders of magnitude less storage for generating surrogate ensemble members from wind than does creating additional wind fields from the climate model, even if an effective lossy data compression algorithm is applied to the simulation output.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Johnson, R. Gramacy, Jeremy M. Cohen, E. Mordecai, C. Murdock, Jason Rohr, S. Ryan, Anna M. Stewart-Ibarra, Daniel P. Weikel
In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week and total season incidence across each of several seasons. Our team was one of the top performers of that competition, outperforming all other teams in multiple targets/locals. In this paper we report on our methodology, a large component of which, surprisingly, ignores the known biology of epidemics at large---in particular relationships between dengue transmission and environmental factors---and instead relies on flexible nonparametric nonlinear Gaussian process (GP) regression fits that "memorize" the trajectories of past seasons, and then "match" the dynamics of the unfolding season to past ones in real-time. Our phenomenological approach has advantages in situations where disease dynamics are less well understood, e.g., at sites with shorter histories of disease (such as Iquitos), or where measurements and forecasts of ancillary covariates like precipitation are unavailable and/or where the strength of association with cases are as yet unknown. In particular, we show that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that we developed to utilize abundant covariate information. We illustrate variations of our method(s) on the two benchmark locales alongside a full summary of results submitted by other contest competitors.
{"title":"Phenomenological forecasting of disease incidence using heteroskedastic Gaussian processes: a dengue case study","authors":"L. Johnson, R. Gramacy, Jeremy M. Cohen, E. Mordecai, C. Murdock, Jason Rohr, S. Ryan, Anna M. Stewart-Ibarra, Daniel P. Weikel","doi":"10.1214/17-AOAS1090","DOIUrl":"https://doi.org/10.1214/17-AOAS1090","url":null,"abstract":"In 2015 the US federal government sponsored a dengue forecasting competition using historical case data from Iquitos, Peru and San Juan, Puerto Rico. Competitors were evaluated on several aspects of out-of-sample forecasts including the targets of peak week, peak incidence during that week and total season incidence across each of several seasons. Our team was one of the top performers of that competition, outperforming all other teams in multiple targets/locals. In this paper we report on our methodology, a large component of which, surprisingly, ignores the known biology of epidemics at large---in particular relationships between dengue transmission and environmental factors---and instead relies on flexible nonparametric nonlinear Gaussian process (GP) regression fits that \"memorize\" the trajectories of past seasons, and then \"match\" the dynamics of the unfolding season to past ones in real-time. Our phenomenological approach has advantages in situations where disease dynamics are less well understood, e.g., at sites with shorter histories of disease (such as Iquitos), or where measurements and forecasts of ancillary covariates like precipitation are unavailable and/or where the strength of association with cases are as yet unknown. In particular, we show that the GP approach generally outperforms a more classical generalized linear (autoregressive) model (GLM) that we developed to utilize abundant covariate information. We illustrate variations of our method(s) on the two benchmark locales alongside a full summary of results submitted by other contest competitors.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128942044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-01-29DOI: 10.4310/SII.2018.V11.N3.A11
Azam Saadatjouy, A. R. Taheriyoun, M. Q. Vahidi-Asl
We consider the problem of testing the stationarity and isotropy of a spatial point pattern based on the concept of local spectra. Using a logarithmic transformation, the mechanism of the proposed test is approximately identical to a simple two factor analysis of variance procedure when the variance of residuals is known. This procedure is also used for testing the stationarity in neighborhood of a particular point of the window of observation. The same idea is used in post-hoc tests to cluster the point pattern into stationary and nonstationary sub-windows. The performance of the proposed method is examined via a simulation study and applied in a practical data.
{"title":"Testing the Zonal Stationarity of Spatial Point Processes: Applied to prostate tissues and trees locations","authors":"Azam Saadatjouy, A. R. Taheriyoun, M. Q. Vahidi-Asl","doi":"10.4310/SII.2018.V11.N3.A11","DOIUrl":"https://doi.org/10.4310/SII.2018.V11.N3.A11","url":null,"abstract":"We consider the problem of testing the stationarity and isotropy of a spatial point pattern based on the concept of local spectra. Using a logarithmic transformation, the mechanism of the proposed test is approximately identical to a simple two factor analysis of variance procedure when the variance of residuals is known. This procedure is also used for testing the stationarity in neighborhood of a particular point of the window of observation. The same idea is used in post-hoc tests to cluster the point pattern into stationary and nonstationary sub-windows. The performance of the proposed method is examined via a simulation study and applied in a practical data.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133800899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We look at common problems found in data that is used for predictive modeling tasks, and describe how to address them with the vtreat R package. vtreat prepares real-world data for predictive modeling in a reproducible and statistically sound manner. We describe the theory of preparing variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems dealt with include: infinite values, invalid values, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). Of special interest are techniques needed to avoid needlessly introducing undesirable nested modeling bias (which is a risk when using a data-preprocessor).
{"title":"vtreat: a data.frame Processor for Predictive Modeling","authors":"N. Zumel, J. Mount","doi":"10.5281/ZENODO.1173314","DOIUrl":"https://doi.org/10.5281/ZENODO.1173314","url":null,"abstract":"We look at common problems found in data that is used for predictive modeling tasks, and describe how to address them with the vtreat R package. vtreat prepares real-world data for predictive modeling in a reproducible and statistically sound manner. We describe the theory of preparing variables so that data has fewer exceptional cases, making it easier to safely use models in production. Common problems dealt with include: infinite values, invalid values, NA, too many categorical levels, rare categorical levels, and new categorical levels (levels seen during application, but not during training). Of special interest are techniques needed to avoid needlessly introducing undesirable nested modeling bias (which is a risk when using a data-preprocessor).","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120927179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We view the locations and times of a collection of crime events as a space-time point pattern. So, with either a nonhomogeneous Poisson process or with a more general Cox process, we need to specify a space-time intensity. For the latter, we need a emph{random} intensity which we model as a realization of a spatio-temporal log Gaussian process. Importantly, we view time as circular not linear, necessitating valid separable and nonseparable covariance functions over a bounded spatial region crossed with circular time. In addition, crimes are classified by crime type. Furthermore, each crime event is recorded by day of the year which we convert to day of the week marks. The contribution here is to develop models to accommodate such data. Our specifications take the form of hierarchical models which we fit within a Bayesian framework. In this regard, we consider model comparison between the nonhomogeneous Poisson process and the log Gaussian Cox process. We also compare separable vs. nonseparable covariance specifications. Our motivating dataset is a collection of crime events for the city of San Francisco during the year 2012. We have location, hour, day of the year, and crime type for each event. We investigate models to enhance our understanding of the set of incidences.
{"title":"Space and circular time log Gaussian Cox processes with application to crime event data","authors":"Shinichiro Shirota, A. Gelfand","doi":"10.1214/16-AOAS960","DOIUrl":"https://doi.org/10.1214/16-AOAS960","url":null,"abstract":"We view the locations and times of a collection of crime events as a space-time point pattern. So, with either a nonhomogeneous Poisson process or with a more general Cox process, we need to specify a space-time intensity. For the latter, we need a emph{random} intensity which we model as a realization of a spatio-temporal log Gaussian process. Importantly, we view time as circular not linear, necessitating valid separable and nonseparable covariance functions over a bounded spatial region crossed with circular time. In addition, crimes are classified by crime type. Furthermore, each crime event is recorded by day of the year which we convert to day of the week marks. \u0000The contribution here is to develop models to accommodate such data. Our specifications take the form of hierarchical models which we fit within a Bayesian framework. In this regard, we consider model comparison between the nonhomogeneous Poisson process and the log Gaussian Cox process. We also compare separable vs. nonseparable covariance specifications. \u0000Our motivating dataset is a collection of crime events for the city of San Francisco during the year 2012. We have location, hour, day of the year, and crime type for each event. We investigate models to enhance our understanding of the set of incidences.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116141104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-11-22DOI: 10.1007/978-981-10-5194-4_12
Yimeng Xie, Zhongnan Jin, Yili Hong, J. V. Mullekom
{"title":"Statistical Methods for Thermal Index Estimation Based on Accelerated Destructive Degradation Test Data","authors":"Yimeng Xie, Zhongnan Jin, Yili Hong, J. V. Mullekom","doi":"10.1007/978-981-10-5194-4_12","DOIUrl":"https://doi.org/10.1007/978-981-10-5194-4_12","url":null,"abstract":"","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125869210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the distributions of earthquake numbers in two global catalogs: Global Centroid-Moment Tensor and Preliminary Determinations of Epicenters. These distributions are required to develop the number test for forecasts of future seismic activity rate. A common assumption is that the numbers are described by the Poisson distribution. In contrast to the one-parameter Poisson distribution, the negative-binomial distribution (NBD) has two parameters. The second parameter characterizes the clustering or over-dispersion of a process. We investigate the dependence of parameters for both distributions on the catalog magnitude threshold and on temporal subdivision of catalog duration. We find that for most cases of interest the Poisson distribution can be rejected statistically at a high significance level in favor of the NBD. Therefore we investigate whether these distributions fit the observed distributions of seismicity. For this purpose we study upper statistical moments of earthquake numbers (skewness and kurtosis) and compare them to the theoretical values for both distributions. Empirical values for the skewness and the kurtosis increase for the smaller magnitude threshold and increase with even greater intensity for small temporal subdivision of catalogs. A calculation of the NBD skewness and kurtosis levels shows rapid increase of these upper moments levels. However, the observed catalog values of skewness and kurtosis are rising even faster. This means that for small time intervals the earthquake number distribution is even more heavy-tailed than the NBD predicts. Therefore for small time intervals we propose using empirical number distributions appropriately smoothed for testing forecasted earthquake numbers.
{"title":"Earthquake Number Forecasts Testing","authors":"Y. Kagan","doi":"10.1093/gji/ggx300","DOIUrl":"https://doi.org/10.1093/gji/ggx300","url":null,"abstract":"We study the distributions of earthquake numbers in two global catalogs: Global Centroid-Moment Tensor and Preliminary Determinations of Epicenters. These distributions are required to develop the number test for forecasts of future seismic activity rate. A common assumption is that the numbers are described by the Poisson distribution. In contrast to the one-parameter Poisson distribution, the negative-binomial distribution (NBD) has two parameters. The second parameter characterizes the clustering or over-dispersion of a process. We investigate the dependence of parameters for both distributions on the catalog magnitude threshold and on temporal subdivision of catalog duration. We find that for most cases of interest the Poisson distribution can be rejected statistically at a high significance level in favor of the NBD. Therefore we investigate whether these distributions fit the observed distributions of seismicity. For this purpose we study upper statistical moments of earthquake numbers (skewness and kurtosis) and compare them to the theoretical values for both distributions. Empirical values for the skewness and the kurtosis increase for the smaller magnitude threshold and increase with even greater intensity for small temporal subdivision of catalogs. A calculation of the NBD skewness and kurtosis levels shows rapid increase of these upper moments levels. However, the observed catalog values of skewness and kurtosis are rising even faster. This means that for small time intervals the earthquake number distribution is even more heavy-tailed than the NBD predicts. Therefore for small time intervals we propose using empirical number distributions appropriately smoothed for testing forecasted earthquake numbers.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"608 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121982643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-09-25DOI: 10.5923/j.statistics.20150504.04
A. O. Bello, F. Oguntolu, O. Adetutu, Nyor Ngutor, J. P. Ojedokun
This research reports on the relationship and significance of social-economic factors (age, sex, employment status) and modes of HIV/AIDS transmission to the HIV/AIDS spread. Logistic regression model, a form of probabilistic function for binary response was used to relate social-economic factors (age, sex, employment status) to HIV/AIDS spread. The statistical predictive model was used to project the likelihood response of HIV/AIDS spread with a larger population using 10,000 Bootstrap re-sampling observations.
{"title":"Application of Bootstrap Re-sampling Method to a Categorical Data of HIV/AIDS Spread across different Social-Economic Classes","authors":"A. O. Bello, F. Oguntolu, O. Adetutu, Nyor Ngutor, J. P. Ojedokun","doi":"10.5923/j.statistics.20150504.04","DOIUrl":"https://doi.org/10.5923/j.statistics.20150504.04","url":null,"abstract":"This research reports on the relationship and significance of social-economic factors (age, sex, employment status) and modes of HIV/AIDS transmission to the HIV/AIDS spread. Logistic regression model, a form of probabilistic function for binary response was used to relate social-economic factors (age, sex, employment status) to HIV/AIDS spread. The statistical predictive model was used to project the likelihood response of HIV/AIDS spread with a larger population using 10,000 Bootstrap re-sampling observations.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132217763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-09-23DOI: 10.4310/SII.2018.V11.N4.A1
Y. V. Tan, C. Flannagan, M. Elliott
The development of driverless vehicles has spurred the need to predict human driving behavior to facilitate interaction between driverless and human-driven vehicles. Predicting human driving movements can be challenging, and poor prediction models can lead to accidents between the driverless and human-driven vehicles. We used the vehicle speed obtained from a naturalistic driving dataset to predict whether a human-driven vehicle would stop before executing a left turn. In a preliminary analysis, we found that BART produced less variable and higher AUC values compared to a variety of other state-of-the-art binary predictor methods. However, BART assumes independent observations, but our dataset consists of multiple observations clustered by driver. Although methods extending BART to clustered or longitudinal data are available, they lack readily available software and can only be applied to clustered continuous outcomes. We extend BART to handle correlated binary observations by adding a random intercept and used a simulation study to determine bias, root mean squared error, 95% coverage, and average length of 95% credible interval in a correlated data setting. We then successfully implemented our random intercept BART model to our clustered dataset and found substantial improvements in prediction performance compared to BART and random intercept linear logistic regression.
{"title":"Predicting human-driving behavior to help driverless vehicles drive: random intercept Bayesian Additive Regression Trees","authors":"Y. V. Tan, C. Flannagan, M. Elliott","doi":"10.4310/SII.2018.V11.N4.A1","DOIUrl":"https://doi.org/10.4310/SII.2018.V11.N4.A1","url":null,"abstract":"The development of driverless vehicles has spurred the need to predict human driving behavior to facilitate interaction between driverless and human-driven vehicles. Predicting human driving movements can be challenging, and poor prediction models can lead to accidents between the driverless and human-driven vehicles. We used the vehicle speed obtained from a naturalistic driving dataset to predict whether a human-driven vehicle would stop before executing a left turn. In a preliminary analysis, we found that BART produced less variable and higher AUC values compared to a variety of other state-of-the-art binary predictor methods. However, BART assumes independent observations, but our dataset consists of multiple observations clustered by driver. Although methods extending BART to clustered or longitudinal data are available, they lack readily available software and can only be applied to clustered continuous outcomes. We extend BART to handle correlated binary observations by adding a random intercept and used a simulation study to determine bias, root mean squared error, 95% coverage, and average length of 95% credible interval in a correlated data setting. We then successfully implemented our random intercept BART model to our clustered dataset and found substantial improvements in prediction performance compared to BART and random intercept linear logistic regression.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115463618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-09-06DOI: 10.6084/M9.FIGSHARE.3807537.V1
E. Thomas
In this paper we propose methodology for inference of binary-valued adjacency matrices from various measures of the strength of association between pairs of network nodes, or more generally pairs of variables. This strength of association can be quantified by sample covariance and correlation matrices, and more generally by test-statistics and hypothesis test p-values from arbitrary distributions. Community detection methods such as block modelling typically require binary-valued adjacency matrices as a starting point. Hence, a main motivation for the methodology we propose is to obtain binary-valued adjacency matrices from such pairwise measures of strength of association between variables. The proposed methodology is applicable to large high-dimensional data-sets and is based on computationally efficient algorithms. We illustrate its utility in a range of contexts and data-sets.
{"title":"Network Inference and Community Detection, Based on Covariance Matrices, Correlations and Test Statistics from Arbitrary Distributions","authors":"E. Thomas","doi":"10.6084/M9.FIGSHARE.3807537.V1","DOIUrl":"https://doi.org/10.6084/M9.FIGSHARE.3807537.V1","url":null,"abstract":"In this paper we propose methodology for inference of binary-valued adjacency matrices from various measures of the strength of association between pairs of network nodes, or more generally pairs of variables. This strength of association can be quantified by sample covariance and correlation matrices, and more generally by test-statistics and hypothesis test p-values from arbitrary distributions. Community detection methods such as block modelling typically require binary-valued adjacency matrices as a starting point. Hence, a main motivation for the methodology we propose is to obtain binary-valued adjacency matrices from such pairwise measures of strength of association between variables. The proposed methodology is applicable to large high-dimensional data-sets and is based on computationally efficient algorithms. We illustrate its utility in a range of contexts and data-sets.","PeriodicalId":409996,"journal":{"name":"arXiv: Applications","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123308840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}