V. Policastro, D. Righelli, A. Carissimo, L. Cutillo, I. Feis
In network analysis, many community detection algorithms have been developed, however, their implementation leaves unaddressed the question of the statistical validation of the results. Here we present robin(ROBustness In Network), an R package to assess the robustness of the community structure of a network found by one or more methods to give indications about their reliability. The procedure initially detects if the community structure found by a set of algorithms is statistically significant and then compares two selected detection algorithms on the same graph to choose the one that better fits the network of interest. We demonstrate the use of our package on the American College Football benchmark dataset.
{"title":"ROBustness In Network (robin): an R Package for Comparison and Validation of Communities","authors":"V. Policastro, D. Righelli, A. Carissimo, L. Cutillo, I. Feis","doi":"10.32614/rj-2021-040","DOIUrl":"https://doi.org/10.32614/rj-2021-040","url":null,"abstract":"In network analysis, many community detection algorithms have been developed, however, their implementation leaves unaddressed the question of the statistical validation of the results. Here we present robin(ROBustness In Network), an R package to assess the robustness of the community structure of a network found by one or more methods to give indications about their reliability. The procedure initially detects if the community structure found by a set of algorithms is statistically significant and then compares two selected detection algorithms on the same graph to choose the one that better fits the network of interest. We demonstrate the use of our package on the American College Football benchmark dataset.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"17 1","pages":"292"},"PeriodicalIF":0.0,"publicationDate":"2021-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80573135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an R package bssm for Bayesian non-linear/non-Gaussian state space modelling. Unlike the existing packages, bssm allows for easy-to-use approximate inference based on Gaussian approximations such as the Laplace approximation and the extended Kalman filter. The package accommodates also discretely observed latent diffusion processes. The inference is based on fully automatic, adaptive Markov chain Monte Carlo (MCMC) on the hyperparameters, with optional importance sampling post-correction to eliminate any approximation bias. The package implements also a direct pseudo-marginal MCMC and a delayed acceptance pseudo-marginal MCMC using intermediate approximations. The package offers an easy-to-use interface to define models with linear-Gaussian state dynamics with non-Gaussian observation models, and has an Rcpp interface for specifying custom non-linear and diffusion models. models are a flexible tool for analysing a variety of time series data. Here we introduced the R package bssm for fully Bayesian state space modelling for a large class of models with several alternative MCMC sampling strategies. All computationally intensive parts of the package are
{"title":"bssm: Bayesian Inference of Non-linear and Non-Gaussian State Space Models in R","authors":"Jouni Helske, M. Vihola","doi":"10.32614/RJ-2021-103","DOIUrl":"https://doi.org/10.32614/RJ-2021-103","url":null,"abstract":"We present an R package bssm for Bayesian non-linear/non-Gaussian state space modelling. Unlike the existing packages, bssm allows for easy-to-use approximate inference based on Gaussian approximations such as the Laplace approximation and the extended Kalman filter. The package accommodates also discretely observed latent diffusion processes. The inference is based on fully automatic, adaptive Markov chain Monte Carlo (MCMC) on the hyperparameters, with optional importance sampling post-correction to eliminate any approximation bias. The package implements also a direct pseudo-marginal MCMC and a delayed acceptance pseudo-marginal MCMC using intermediate approximations. The package offers an easy-to-use interface to define models with linear-Gaussian state dynamics with non-Gaussian observation models, and has an Rcpp interface for specifying custom non-linear and diffusion models. models are a flexible tool for analysing a variety of time series data. Here we introduced the R package bssm for fully Bayesian state space modelling for a large class of models with several alternative MCMC sampling strategies. All computationally intensive parts of the package are","PeriodicalId":20974,"journal":{"name":"R J.","volume":"55 1","pages":"471"},"PeriodicalIF":0.0,"publicationDate":"2021-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90068033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The bestNormalize R package was designed to help users find a transformation that can effectively normalize a vector regardless of its actual distribution. Each of the many normalization techniques that have been developed has its own strengths and weaknesses, and deciding which to use until data are fully observed is difficult or impossible. This package facilitates choosing between a range of possible transformations and will automatically return the best one, i.e., the one that makes data look the most normal. To evaluate and compare the normalization efficacy across a suite of possible transformations, we developed a statistic based on a goodness of fit test divided by its degrees of freedom. Transformations can be seamlessly trained and applied to newly observed data, and can be implemented in conjunction with caret and recipes for data preprocessing in machine learning workflows. Custom transformations and normalization statistics are supported.
{"title":"Finding Optimal Normalizing Transformations via bestNormalize","authors":"Ryan A. Peterson","doi":"10.32614/rj-2021-041","DOIUrl":"https://doi.org/10.32614/rj-2021-041","url":null,"abstract":"The bestNormalize R package was designed to help users find a transformation that can effectively normalize a vector regardless of its actual distribution. Each of the many normalization techniques that have been developed has its own strengths and weaknesses, and deciding which to use until data are fully observed is difficult or impossible. This package facilitates choosing between a range of possible transformations and will automatically return the best one, i.e., the one that makes data look the most normal. To evaluate and compare the normalization efficacy across a suite of possible transformations, we developed a statistic based on a goodness of fit test divided by its degrees of freedom. Transformations can be seamlessly trained and applied to newly observed data, and can be implemented in conjunction with caret and recipes for data preprocessing in machine learning workflows. Custom transformations and normalization statistics are supported.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"19 1","pages":"310"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74589020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a new flexible distribution for discrete data. Approximate moment estimators of the parameters of the distribution, to be used as starting values for numerical optimization procedures, are discussed. “Exact” moment estimation, effected via a numerical procedure, and maximum likelihood estimation, are considered. The quality of the results produced by these estimators is assessed via simulation experiments. Several examples are given of fitting instances of the new distribution to real and simulated data. It is noted that the new distribution is a member of the exponential family. Expressions for the gradient and Hessian of the log-likelihood of the new distribution are derived. The former facilitates the numerical maximization of the likelihood with optim(); the latter provides means of calculating or estimating the covariance matrix of of the parameter estimates. A discrepancy between estimates of the covariance matrix obtained by inverting the Hessian and those obtained by Monte Carlo methods is discussed.
{"title":"A New Versatile Discrete Distribution","authors":"Rolf Turner","doi":"10.32614/rj-2021-067","DOIUrl":"https://doi.org/10.32614/rj-2021-067","url":null,"abstract":"This paper introduces a new flexible distribution for discrete data. Approximate moment estimators of the parameters of the distribution, to be used as starting values for numerical optimization procedures, are discussed. “Exact” moment estimation, effected via a numerical procedure, and maximum likelihood estimation, are considered. The quality of the results produced by these estimators is assessed via simulation experiments. Several examples are given of fitting instances of the new distribution to real and simulated data. It is noted that the new distribution is a member of the exponential family. Expressions for the gradient and Hessian of the log-likelihood of the new distribution are derived. The former facilitates the numerical maximization of the likelihood with optim(); the latter provides means of calculating or estimating the covariance matrix of of the parameter estimates. A discrepancy between estimates of the covariance matrix obtained by inverting the Hessian and those obtained by Monte Carlo methods is discussed.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"154 1","pages":"427"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72696852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"We Need Trustworthy R Packages","authors":"Will Landau","doi":"10.32614/rj-2021-109","DOIUrl":"https://doi.org/10.32614/rj-2021-109","url":null,"abstract":"","PeriodicalId":20974,"journal":{"name":"R J.","volume":"85 1","pages":"661"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84316073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Boumaza, Pierre Santagostini, Smail Yousfi, S. Demotes-Mainard
Multidimensional scaling (MDS), hierarchical cluster analysis (HCA) and discriminant analysis (DA) are classical techniques which deal with data made of n individuals and p variables. When the individuals are divided into T groups, the R package dad associates with each group a multivariate probability density function and then carries out these techniques on the densities which are estimated by the data under consideration. These techniques are based on distance measures between densities: chi-square, Hellinger, Jeffreys, Jensen-Shannon and L p for discrete densities, Hellinger , Jeffreys, L 2 and 2-Wasserstein for Gaussian densities, and L 2 for numeric non Gaussian densities estimated by the Gaussian kernel method. Practical methods help the user to give meaning to the outputs in the context of MDS and HCA, and to look for an optimal prediction in the context of DA based on the one-leave-out misclassification ratio. Some functions for data management or basic statistics calculations on groups are annexed.
{"title":"dad: an R Package for Visualisation, Classification and Discrimination of Multivariate Groups Modelled by their Densities","authors":"R. Boumaza, Pierre Santagostini, Smail Yousfi, S. Demotes-Mainard","doi":"10.32614/rj-2021-071","DOIUrl":"https://doi.org/10.32614/rj-2021-071","url":null,"abstract":"Multidimensional scaling (MDS), hierarchical cluster analysis (HCA) and discriminant analysis (DA) are classical techniques which deal with data made of n individuals and p variables. When the individuals are divided into T groups, the R package dad associates with each group a multivariate probability density function and then carries out these techniques on the densities which are estimated by the data under consideration. These techniques are based on distance measures between densities: chi-square, Hellinger, Jeffreys, Jensen-Shannon and L p for discrete densities, Hellinger , Jeffreys, L 2 and 2-Wasserstein for Gaussian densities, and L 2 for numeric non Gaussian densities estimated by the Gaussian kernel method. Practical methods help the user to give meaning to the outputs in the context of MDS and HCA, and to look for an optimal prediction in the context of DA based on the one-leave-out misclassification ratio. Some functions for data management or basic statistics calculations on groups are annexed.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"12 1","pages":"90"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85463630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Osório, Kelly Botero, A. Velasco, Nicolás Mendoza-Mejía, Felipe Rojas-Rodríguez, G. Barreto, Janneth González
During the building of a genome-scale metabolic model, there are several dead-end metabolites and substrates which cannot be imported, produced nor used by any reaction incorporated in the network. The presence of these dead-end metabolites can block out the net flux of the objective function when it is evaluated through Flux Balance Analysis (FBA), and when it is not blocked, bias in the biological conclusions increase. In this aspect, the refinement to restore the connectivity of the network can be carried out manually or using computational algorithms. The g2f package was designed as a tool to find the gaps from dead-end metabolites and fill them from the stoichiometric reactions of a reference, filtering candidate reactions using a weighting function. Additionally, this algorithm allows to download all the set of gene-associated stoichiometric reactions for a specific organism from the KEGG database. Our package is compatible with both 4.0.0 and 3.6.0 R versions.
{"title":"g2f as a Novel Tool to Find and Fill Gaps in Metabolic Networks","authors":"Daniel Osório, Kelly Botero, A. Velasco, Nicolás Mendoza-Mejía, Felipe Rojas-Rodríguez, G. Barreto, Janneth González","doi":"10.32614/rj-2021-064","DOIUrl":"https://doi.org/10.32614/rj-2021-064","url":null,"abstract":"During the building of a genome-scale metabolic model, there are several dead-end metabolites and substrates which cannot be imported, produced nor used by any reaction incorporated in the network. The presence of these dead-end metabolites can block out the net flux of the objective function when it is evaluated through Flux Balance Analysis (FBA), and when it is not blocked, bias in the biological conclusions increase. In this aspect, the refinement to restore the connectivity of the network can be carried out manually or using computational algorithms. The g2f package was designed as a tool to find the gaps from dead-end metabolites and fill them from the stoichiometric reactions of a reference, filtering candidate reactions using a weighting function. Additionally, this algorithm allows to download all the set of gene-associated stoichiometric reactions for a specific organism from the KEGG database. Our package is compatible with both 4.0.0 and 3.6.0 R versions.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"26 1","pages":"6"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78277396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miguel Ferreiro-Díaz, T. Cotos-Yáñez, J. R. Méndez, David Ruano-Ordás
In the last years, big data has become a useful paradigm for taking advantage of multiple sources to find relevant knowledge in real domains (such as the design of personalized marketing campaigns or helping to palliate the effects of several mortal diseases). Big data programming tools and methods have evolved over time from a MapReduce to a pipeline-based archetype. Concretely the use of pipelining schemes has become the most reliable way of processing and analysing large amounts of data. To this end, this work introduces bdpar, a new highly customizable pipeline-based framework (using the OOP paradigm provided by R6 package) able to execute multiple pre-processing tasks over heterogeneous data sources. Moreover, to increase the flexibility and performance, bdpar provides helpful features such as (i) the definition of a novel object-based pipe operator (%>|%), (ii) the ability to easily design and deploy new (and customized) input data parsers, tasks and pipelines, (iii) only-once execution which avoids the execution of previously processed information (instances) guaranteeing that only new both input data and pipelines are executed, (iv) the capability to perform serial or parallel operations according to the user needs, (v) the inclusion of a debugging mechanism which allows users to check the status of each instance (and find possible errors) throughout the process.
{"title":"The bdpar Package: Big Data Pipelining Architecture for R","authors":"Miguel Ferreiro-Díaz, T. Cotos-Yáñez, J. R. Méndez, David Ruano-Ordás","doi":"10.32614/rj-2021-065","DOIUrl":"https://doi.org/10.32614/rj-2021-065","url":null,"abstract":"In the last years, big data has become a useful paradigm for taking advantage of multiple sources to find relevant knowledge in real domains (such as the design of personalized marketing campaigns or helping to palliate the effects of several mortal diseases). Big data programming tools and methods have evolved over time from a MapReduce to a pipeline-based archetype. Concretely the use of pipelining schemes has become the most reliable way of processing and analysing large amounts of data. To this end, this work introduces bdpar, a new highly customizable pipeline-based framework (using the OOP paradigm provided by R6 package) able to execute multiple pre-processing tasks over heterogeneous data sources. Moreover, to increase the flexibility and performance, bdpar provides helpful features such as (i) the definition of a novel object-based pipe operator (%>|%), (ii) the ability to easily design and deploy new (and customized) input data parsers, tasks and pipelines, (iii) only-once execution which avoids the execution of previously processed information (instances) guaranteeing that only new both input data and pipelines are executed, (iv) the capability to perform serial or parallel operations according to the user needs, (v) the inclusion of a debugging mechanism which allows users to check the status of each instance (and find possible errors) throughout the process.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"43 1","pages":"130"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77507422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a graphical user interface (GUI) to carry out a Bayesian regression analysis in a very friendly environment without any programming skills (drag and drop). This paper is designed for teaching and applied purposes at an introductory level. Our GUI is based on an interactive web application using shiny, and libraries from R. We carry out some applications to highlight the potential of our GUI for applied researchers and practitioners. In addition, the Help option in the main tap panel has an extended version of this paper, where we present the basic theory underlying all regression models that we developed in our GUI, and more applications associated with each model.
{"title":"A GUIded tour of Bayesian regression","authors":"A. Ramírez‐Hassan, Mateo Graciano-Londoño","doi":"10.32614/rj-2021-081","DOIUrl":"https://doi.org/10.32614/rj-2021-081","url":null,"abstract":"This paper presents a graphical user interface (GUI) to carry out a Bayesian regression analysis in a very friendly environment without any programming skills (drag and drop). This paper is designed for teaching and applied purposes at an introductory level. Our GUI is based on an interactive web application using shiny, and libraries from R. We carry out some applications to highlight the potential of our GUI for applied researchers and practitioners. In addition, the Help option in the main tap panel has an extended version of this paper, where we present the basic theory underlying all regression models that we developed in our GUI, and more applications associated with each model.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"13 1","pages":"57"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82087117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brandon Bolte, N. Schmidt, Sergio Béjar, N. Huynh, Bumba Mukherjee
Survival data often include a fraction of units that are susceptible to an event of interest as well as a fraction of “immune” units. In many applications, spatial clustering in unobserved risk factors across nearby units can also affect their survival rates and odds of becoming immune. To address these methodological challenges, this article introduces our BayesSPsurv R-package, which fits parametric Bayesian Spatial split-population survival (cure) models that can account for spatial autocorrelation in both subpopulations of the user’s time-to-event data. Spatial autocorrelation is modeled with spatially weighted frailties, which are estimated using a conditionally autoregressive prior. The user can also fit parametric cure models with or without non-spatial i.i.d. frailties, and each model can incorporate time-varying covariates. BayesSPsurv also includes various functions to conduct pre-estimation spatial autocorrelation tests, visualize results, and assess model performance, all of which are illustrated using data on post-civil war peace survival.
{"title":"BayesSPsurv: An R Package to Estimate Bayesian (Spatial) Split-Population Survival Models","authors":"Brandon Bolte, N. Schmidt, Sergio Béjar, N. Huynh, Bumba Mukherjee","doi":"10.32614/rj-2021-068","DOIUrl":"https://doi.org/10.32614/rj-2021-068","url":null,"abstract":"Survival data often include a fraction of units that are susceptible to an event of interest as well as a fraction of “immune” units. In many applications, spatial clustering in unobserved risk factors across nearby units can also affect their survival rates and odds of becoming immune. To address these methodological challenges, this article introduces our BayesSPsurv R-package, which fits parametric Bayesian Spatial split-population survival (cure) models that can account for spatial autocorrelation in both subpopulations of the user’s time-to-event data. Spatial autocorrelation is modeled with spatially weighted frailties, which are estimated using a conditionally autoregressive prior. The user can also fit parametric cure models with or without non-spatial i.i.d. frailties, and each model can incorporate time-varying covariates. BayesSPsurv also includes various functions to conduct pre-estimation spatial autocorrelation tests, visualize results, and assess model performance, all of which are illustrated using data on post-civil war peace survival.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"3 1","pages":"595"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89243776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}