Pub Date : 2024-07-25DOI: 10.1016/j.csda.2024.108016
Mehrdad Naderi , Mostafa Tamandi , Elham Mirfarah , Wan-Lun Wang , Tsung-I Lin
With the steady growth of computer technologies, the application of statistical techniques to analyze extensive datasets has garnered substantial attention. The analysis of three-way (matrix-variate) data has emerged as a burgeoning field that has inspired statisticians in recent years to develop novel analytical methods. This paper introduces a unified finite mixture model that relies on the mean-mixture of matrix-variate normal distributions. The strength of our proposed model lies in its capability to capture and cluster a wide range of three-way data that exhibit heterogeneous, asymmetric and leptokurtic features. A computationally feasible ECME algorithm is developed to compute the maximum likelihood (ML) estimates. Numerous simulation studies are conducted to investigate the asymptotic properties of the ML estimators, validate the effectiveness of the Bayesian information criterion in selecting the appropriate model, and assess the classification ability in presence of contaminated noise. The utility of the proposed methodology is demonstrated by analyzing a real-life data example.
{"title":"Three-way data clustering based on the mean-mixture of matrix-variate normal distributions","authors":"Mehrdad Naderi , Mostafa Tamandi , Elham Mirfarah , Wan-Lun Wang , Tsung-I Lin","doi":"10.1016/j.csda.2024.108016","DOIUrl":"10.1016/j.csda.2024.108016","url":null,"abstract":"<div><p>With the steady growth of computer technologies, the application of statistical techniques to analyze extensive datasets has garnered substantial attention. The analysis of three-way (matrix-variate) data has emerged as a burgeoning field that has inspired statisticians in recent years to develop novel analytical methods. This paper introduces a unified finite mixture model that relies on the mean-mixture of matrix-variate normal distributions. The strength of our proposed model lies in its capability to capture and cluster a wide range of three-way data that exhibit heterogeneous, asymmetric and leptokurtic features. A computationally feasible ECME algorithm is developed to compute the maximum likelihood (ML) estimates. Numerous simulation studies are conducted to investigate the asymptotic properties of the ML estimators, validate the effectiveness of the Bayesian information criterion in selecting the appropriate model, and assess the classification ability in presence of contaminated noise. The utility of the proposed methodology is demonstrated by analyzing a real-life data example.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141947240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1016/j.csda.2024.108026
Weichao Yang , Xu Guo , Lixing Zhu
This study investigates the testing of regression coefficients within high-dimensional generalized linear models featuring general covariance structures. The derived asymptotic properties reveal that distinct covariance structures can lead to varying limiting null distributions, including the normal distribution, for a widely employed quadratic-norm based test statistic. This circumstance renders it infeasible to determine critical values through a limiting null distribution. In response to this challenge, we propose a multiplier bootstrap test procedure for practical implementation. Additionally, we introduce a modified version of this procedure, incorporating projection when dealing with nuisance parameters. We then proceed to examine the asymptotic level and power of the proposed tests and assess their finite-sample performance through simulations. Finally, we present a real data analysis to illustrate the practical application of the proposed tests.
{"title":"Tests for high-dimensional generalized linear models under general covariance structure","authors":"Weichao Yang , Xu Guo , Lixing Zhu","doi":"10.1016/j.csda.2024.108026","DOIUrl":"10.1016/j.csda.2024.108026","url":null,"abstract":"<div><p>This study investigates the testing of regression coefficients within high-dimensional generalized linear models featuring general covariance structures. The derived asymptotic properties reveal that distinct covariance structures can lead to varying limiting null distributions, including the normal distribution, for a widely employed quadratic-norm based test statistic. This circumstance renders it infeasible to determine critical values through a limiting null distribution. In response to this challenge, we propose a multiplier bootstrap test procedure for practical implementation. Additionally, we introduce a modified version of this procedure, incorporating projection when dealing with nuisance parameters. We then proceed to examine the asymptotic level and power of the proposed tests and assess their finite-sample performance through simulations. Finally, we present a real data analysis to illustrate the practical application of the proposed tests.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141728824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-14DOI: 10.1016/j.csda.2024.108025
C.J.R. Murphy-Barltrop , J.L. Wadsworth
In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, a novel semi-parametric modelling framework for bivariate extremal dependence structures is proposed. This framework can capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, the model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points.
{"title":"Modelling non-stationarity in asymptotically independent extremes","authors":"C.J.R. Murphy-Barltrop , J.L. Wadsworth","doi":"10.1016/j.csda.2024.108025","DOIUrl":"10.1016/j.csda.2024.108025","url":null,"abstract":"<div><p>In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, a novel semi-parametric modelling framework for bivariate extremal dependence structures is proposed. This framework can capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, the model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324001099/pdfft?md5=30bf72d73c4164fa1e95447a8e89f109&pid=1-s2.0-S0167947324001099-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141636850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.csda.2024.108013
Laura Vana-Gür
A multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects is proposed. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, which accounts for the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student-t distributed. The estimation is performed using composite likelihood methods. Through several simulation exercises, the quality of the estimates in different settings as well as in comparison with a Bayesian approach is investigated. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. Finally, the framework is illustrated using a data set containing bankruptcy and credit rating information for US exchange-listed companies.
{"title":"Multivariate ordinal regression for multiple repeated measurements","authors":"Laura Vana-Gür","doi":"10.1016/j.csda.2024.108013","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108013","url":null,"abstract":"<div><p>A multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects is proposed. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, which accounts for the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student-<em>t</em> distributed. The estimation is performed using composite likelihood methods. Through several simulation exercises, the quality of the estimates in different settings as well as in comparison with a Bayesian approach is investigated. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. Finally, the framework is illustrated using a data set containing bankruptcy and credit rating information for US exchange-listed companies.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000975/pdfft?md5=ab85b2830c29a159e869e1da23f9a25e&pid=1-s2.0-S0167947324000975-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141541625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1016/j.csda.2024.108015
Joakim Nyberg , Andrew C. Hooker , Georg Zimmermann , Johan Verbeeck , Martin Geroldinger , Konstantin Emil Thiel , Geert Molenberghs , Martin Laimer , Verena Wally
Epidermolysis bullosa simplex (EBS) skin disease is a rare disease, which renders the use of optimal design techniques especially important to maximize the potential information in a future study, that is, to make efficient use of the limited number of available subjects and observations. A generalized linear mixed effects model (GLMM), built on an EBS trial was used to optimize the design. The model assumed a full treatment effect in the follow-up period. In addition to this model, two models with either no assumed treatment effect or a linearly declining treatment effect in the follow-up were assumed. The information gain and loss when changing the number of EBS blisters counts, altering the duration of the treatment as well as changing the study period was assessed. In addition, optimization of the EBS blister assessment times was performed. The optimization was utilizing the derived Fisher information matrix for the GLMM with EBS blister counts and the information gain and loss was quantified by D-optimal efficiency. The optimization results indicated that using optimal assessment times increases the information of about 110-120%, varying slightly between the assumed treatment models. In addition, the result showed that the assessment times were also sensitive to be moved ± one week, but assessment times within ± two days were not decreasing the information as long as three assessments (out of four assessments in the trial period) were within the treatment period and not in the follow-up period. Increasing the number of assessments to six or five per trial period increased the information to 130% and 115%, respectively, while decreasing the number of assessments to two or three, decreased the information to 50% and 80%, respectively. Increasing the length of the trial period had a minor impact on the information, while increasing the treatment period by two and four weeks had a larger impact, 120% and 130%, respectively. To conclude, general applications of optimal design methodology, derivation of the Fisher information matrix for GLMM with count data and examples on how optimal design could be used when designing trials for treatment of the EBS disease is presented. The methodology is also of interest for study designs where maximizing the information is essential. Therefore, a general applied research guidance for using optimal design is also provided.
{"title":"Optimizing designs in clinical trials with an application in treatment of Epidermolysis bullosa simplex, a rare genetic skin disease","authors":"Joakim Nyberg , Andrew C. Hooker , Georg Zimmermann , Johan Verbeeck , Martin Geroldinger , Konstantin Emil Thiel , Geert Molenberghs , Martin Laimer , Verena Wally","doi":"10.1016/j.csda.2024.108015","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108015","url":null,"abstract":"<div><p>Epidermolysis bullosa simplex (EBS) skin disease is a rare disease, which renders the use of optimal design techniques especially important to maximize the potential information in a future study, that is, to make efficient use of the limited number of available subjects and observations. A generalized linear mixed effects model (GLMM), built on an EBS trial was used to optimize the design. The model assumed a full treatment effect in the follow-up period. In addition to this model, two models with either no assumed treatment effect or a linearly declining treatment effect in the follow-up were assumed. The information gain and loss when changing the number of EBS blisters counts, altering the duration of the treatment as well as changing the study period was assessed. In addition, optimization of the EBS blister assessment times was performed. The optimization was utilizing the derived Fisher information matrix for the GLMM with EBS blister counts and the information gain and loss was quantified by D-optimal efficiency. The optimization results indicated that using optimal assessment times increases the information of about 110-120%, varying slightly between the assumed treatment models. In addition, the result showed that the assessment times were also sensitive to be moved ± one week, but assessment times within ± two days were not decreasing the information as long as three assessments (out of four assessments in the trial period) were within the treatment period and not in the follow-up period. Increasing the number of assessments to six or five per trial period increased the information to 130% and 115%, respectively, while decreasing the number of assessments to two or three, decreased the information to 50% and 80%, respectively. Increasing the length of the trial period had a minor impact on the information, while increasing the treatment period by two and four weeks had a larger impact, 120% and 130%, respectively. To conclude, general applications of optimal design methodology, derivation of the Fisher information matrix for GLMM with count data and examples on how optimal design could be used when designing trials for treatment of the EBS disease is presented. The methodology is also of interest for study designs where maximizing the information is essential. Therefore, a general applied research guidance for using optimal design is also provided.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000999/pdfft?md5=f5085e42686fa3be3531f90fc0181a2c&pid=1-s2.0-S0167947324000999-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1016/j.csda.2024.108014
Katarzyna Reluga , María-José Lombardía , Stefan Sperlich
Linear mixed effects are considered excellent predictors of cluster-level parameters in various domains. However, previous research has demonstrated that their performance is affected by departures from model assumptions. Given the common occurrence of these departures in empirical studies, there is a need for inferential methods that are robust to misspecifications while remaining accessible and appealing to practitioners. Statistical tools have been developed for cluster-wise and simultaneous inference for mixed effects under distributional misspecifications, employing a user-friendly semiparametric random effect bootstrap. The merits and limitations of this approach are discussed in the general context of model misspecification. Theoretical analysis demonstrates the asymptotic consistency of the methods under general regularity conditions. Simulations show that the proposed intervals are robust to departures from modelling assumptions, including asymmetry and long tails in the distributions of errors and random effects, outperforming competitors in terms of empirical coverage probability. Finally, the methodology is applied to construct confidence intervals for household income across counties in the Spanish region of Galicia.
{"title":"Bootstrap-based statistical inference for linear mixed effects under misspecifications","authors":"Katarzyna Reluga , María-José Lombardía , Stefan Sperlich","doi":"10.1016/j.csda.2024.108014","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108014","url":null,"abstract":"<div><p>Linear mixed effects are considered excellent predictors of cluster-level parameters in various domains. However, previous research has demonstrated that their performance is affected by departures from model assumptions. Given the common occurrence of these departures in empirical studies, there is a need for inferential methods that are robust to misspecifications while remaining accessible and appealing to practitioners. Statistical tools have been developed for cluster-wise and simultaneous inference for mixed effects under distributional misspecifications, employing a user-friendly semiparametric random effect bootstrap. The merits and limitations of this approach are discussed in the general context of model misspecification. Theoretical analysis demonstrates the asymptotic consistency of the methods under general regularity conditions. Simulations show that the proposed intervals are robust to departures from modelling assumptions, including asymmetry and long tails in the distributions of errors and random effects, outperforming competitors in terms of empirical coverage probability. Finally, the methodology is applied to construct confidence intervals for household income across counties in the Spanish region of Galicia.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000987/pdfft?md5=733458402da2cf31e9cef3842c8c4865&pid=1-s2.0-S0167947324000987-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141541624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-27DOI: 10.1016/j.csda.2024.108012
Qingyang Liu, Xianzheng Huang, Ray Bai
Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at https://github.com/rh8liuqy/Bayesian_modal_regression.
{"title":"Bayesian modal regression based on mixture distributions","authors":"Qingyang Liu, Xianzheng Huang, Ray Bai","doi":"10.1016/j.csda.2024.108012","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108012","url":null,"abstract":"<div><p>Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at <span>https://github.com/rh8liuqy/Bayesian_modal_regression</span><svg><path></path></svg>.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-25DOI: 10.1016/j.csda.2024.108010
Yixuan Liu , Claudia Kirch , Jeong Eun Lee , Renate Meyer
A novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series is presented. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. A proof of mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series is given. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach, a generalisation of the multivariate Whittle-likelihood-based method of Meier et al. (2020) as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of Kirch et al. (2019) to the multivariate case is presented. It is demonstrated that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. Its practical advantages are illustrated by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and a multivariate time series of windspeed data at six locations in California.
{"title":"A nonparametrically corrected likelihood for Bayesian spectral analysis of multivariate time series","authors":"Yixuan Liu , Claudia Kirch , Jeong Eun Lee , Renate Meyer","doi":"10.1016/j.csda.2024.108010","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108010","url":null,"abstract":"<div><p>A novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series is presented. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. A proof of mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series is given. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach, a generalisation of the multivariate Whittle-likelihood-based method of <span>Meier et al. (2020)</span> as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of <span>Kirch et al. (2019)</span> to the multivariate case is presented. It is demonstrated that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. Its practical advantages are illustrated by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and a multivariate time series of windspeed data at six locations in California.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016794732400094X/pdfft?md5=4194de676b76fa0193f3ea88ff4e7bdc&pid=1-s2.0-S016794732400094X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-21DOI: 10.1016/j.csda.2024.108011
Schyan Zafar, Geoff K. Nicholls
Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as “kosmos” (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.
{"title":"An embedded diachronic sense change model with a case study from ancient Greek","authors":"Schyan Zafar, Geoff K. Nicholls","doi":"10.1016/j.csda.2024.108011","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108011","url":null,"abstract":"<div><p>Word meanings change over time, and word <em>senses</em> evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as “kosmos” (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000951/pdfft?md5=12930590074b9c3008e514576f2c4ba0&pid=1-s2.0-S0167947324000951-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-20DOI: 10.1016/j.csda.2024.108009
Xuan Ma, Jenný Brynjarsdóttir, Thomas LaFramboise
A double Pólya-Gamma data augmentation scheme is developed for posterior sampling from a Bayesian hierarchical model of total and categorical count data. The scheme applies to a Negative Binomial - Binomial (NBB) hierarchical regression model with logit links and normal priors on regression coefficients. The approach is shown to be very efficient and in most cases out-performs the Stan program. The hierarchical modeling framework and the Pólya-Gamma data augmentation scheme are applied to human mitochondrial DNA data.
本文提出了一种双 Pólya-Gamma 数据扩增方案,用于从总体和分类计数数据的贝叶斯分层模型中进行后验采样。该方案适用于带有对数链接和回归系数正态先验的负二项-二项(NBB)分层回归模型。结果表明,该方法非常高效,在大多数情况下都优于 Stan 程序。分层建模框架和 Pólya-Gamma 数据增强方案被应用于人类线粒体 DNA 数据。
{"title":"A double Pólya-Gamma data augmentation scheme for a hierarchical Negative Binomial - Binomial data model","authors":"Xuan Ma, Jenný Brynjarsdóttir, Thomas LaFramboise","doi":"10.1016/j.csda.2024.108009","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108009","url":null,"abstract":"<div><p>A double Pólya-Gamma data augmentation scheme is developed for posterior sampling from a Bayesian hierarchical model of total and categorical count data. The scheme applies to a Negative Binomial - Binomial (NBB) hierarchical regression model with logit links and normal priors on regression coefficients. The approach is shown to be very efficient and in most cases out-performs the Stan program. The hierarchical modeling framework and the Pólya-Gamma data augmentation scheme are applied to human mitochondrial DNA data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000938/pdfft?md5=5e06b3420d4ee7efb587c1f231e8d551&pid=1-s2.0-S0167947324000938-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}