Julia Sharp, Emily H. Griffith, Bruce A. Craig, Alexandra Hanlon, Sarah Peskoe, Jennifer Van Mullekom
The delivery of academic statistical collaboration resources can vary among types of institutions and across time. In particular, this variation might occur in the management of infrastructure and the business model, the staffing model and opportunities for staff development. In this manuscript, we present examples of these three themes in modern academic statistical collaboration units and describe key advantages and challenges.
{"title":"The current landscape of academic statistical and data science collaboration units with examples","authors":"Julia Sharp, Emily H. Griffith, Bruce A. Craig, Alexandra Hanlon, Sarah Peskoe, Jennifer Van Mullekom","doi":"10.1002/sta4.718","DOIUrl":"https://doi.org/10.1002/sta4.718","url":null,"abstract":"The delivery of academic statistical collaboration resources can vary among types of institutions and across time. In particular, this variation might occur in the management of infrastructure and the business model, the staffing model and opportunities for staff development. In this manuscript, we present examples of these three themes in modern academic statistical collaboration units and describe key advantages and challenges.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"67 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a novel two‐sample test for multivariate sample space. The test statistic calculates the mean of absolute difference of average interpoint distance. We utilize a permutation procedure to establish the critical value for the test. Through comprehensive simulation studies, we compare the performance of our proposed test with that of the K‐nearest neighbour test and the energy test. The results demonstrate that our proposed test exhibits advantages over the other two tests, particularly in high‐dimensional sample spaces. This superiority is further validated by its application to UCR time series datasets.
本文提出了一种新颖的多元样本空间双样本检验方法。该检验统计量计算平均点间距离绝对差的平均值。我们利用置换程序来确定检验的临界值。通过综合模拟研究,我们比较了我们提出的检验与 K 最近邻检验和能量检验的性能。结果表明,我们提出的检验方法比其他两种检验方法更具优势,尤其是在高维样本空间中。在 UCR 时间序列数据集上的应用进一步验证了这一优势。
{"title":"New two‐sample test utilizing interpoint distance discrepancy","authors":"Dong Xu","doi":"10.1002/sta4.712","DOIUrl":"https://doi.org/10.1002/sta4.712","url":null,"abstract":"In this paper, we propose a novel two‐sample test for multivariate sample space. The test statistic calculates the mean of absolute difference of average interpoint distance. We utilize a permutation procedure to establish the critical value for the test. Through comprehensive simulation studies, we compare the performance of our proposed test with that of the K‐nearest neighbour test and the energy test. The results demonstrate that our proposed test exhibits advantages over the other two tests, particularly in high‐dimensional sample spaces. This superiority is further validated by its application to UCR time series datasets.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"50 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141774143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michele Cavazzutti, Eleonora Arnone, Federico Ferraccioli, Cristina Galimberti, Livio Finos, Laura M. Sangalli
SummaryWe address the problem of performing inference on the linear and nonlinear terms of a semiparametric spatial regression model with differential regularisation. For the linear term, we propose a new resampling procedure, based on (partial) sign‐flipping of an appropriate transformation of the residuals of the model. The proposed resampling scheme can mitigate the bias effect induced by the differential regularisation. We prove that the proposed test is asymptotically exact. Moreover, we show, by simulation studies, that it enjoys very good control of Type‐I error also in small sample scenarios, differently from parametric alternatives. Additionally, we show that the proposed test has higher power with respect than recently proposed nonparametric tests on the linear term of semiparametric regression models with differential regularisation. Concerning the nonlinear term, we develop three different inference approaches: a parametric one and two nonparametric alternatives. The nonparametric tests are based on a sign‐flip approach. One of these is proved to be asymptotically exact, while the other is proved to be exact also for finite samples. Simulation studies highlight the good control of Type‐I error of the nonparametric approaches with respect the parametric test, while retaining high power.
摘要我们要解决的问题是对具有微分正则化的半参数空间回归模型的线性项和非线性项进行推断。对于线性项,我们提出了一种新的重采样程序,该程序基于模型残差适当变换的(部分)符号翻转。所提出的重采样方案可以减轻微分正则化引起的偏差效应。我们证明了所提出的检验方法是渐近精确的。此外,我们还通过模拟研究表明,与参数法不同,该方法在小样本情况下也能很好地控制 I 类误差。此外,我们还证明,与最近提出的对具有微分正则化的半参数回归模型线性项的非参数检验相比,所提出的检验具有更高的功率。关于非线性项,我们开发了三种不同的推断方法:一种参数方法和两种非参数方法。非参数检验基于符号翻转方法。其中一种被证明是渐近精确的,而另一种则被证明在有限样本中也是精确的。模拟研究突出表明,相对于参数检验,非参数方法能很好地控制第一类误差,同时保持较高的功率。
{"title":"Sign‐flip inference for spatial regression with differential regularisation","authors":"Michele Cavazzutti, Eleonora Arnone, Federico Ferraccioli, Cristina Galimberti, Livio Finos, Laura M. Sangalli","doi":"10.1002/sta4.711","DOIUrl":"https://doi.org/10.1002/sta4.711","url":null,"abstract":"SummaryWe address the problem of performing inference on the linear and nonlinear terms of a semiparametric spatial regression model with differential regularisation. For the linear term, we propose a new resampling procedure, based on (partial) sign‐flipping of an appropriate transformation of the residuals of the model. The proposed resampling scheme can mitigate the bias effect induced by the differential regularisation. We prove that the proposed test is asymptotically exact. Moreover, we show, by simulation studies, that it enjoys very good control of Type‐I error also in small sample scenarios, differently from parametric alternatives. Additionally, we show that the proposed test has higher power with respect than recently proposed nonparametric tests on the linear term of semiparametric regression models with differential regularisation. Concerning the nonlinear term, we develop three different inference approaches: a parametric one and two nonparametric alternatives. The nonparametric tests are based on a sign‐flip approach. One of these is proved to be asymptotically exact, while the other is proved to be exact also for finite samples. Simulation studies highlight the good control of Type‐I error of the nonparametric approaches with respect the parametric test, while retaining high power.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"48 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tzung Hsuen Khoo, Dharini Pathmanathan, Philipp Otto, Sophie Dabo‐Niang
Stock market indices are volatile by nature, and sudden shocks are known to affect volatility patterns. The autoregressive conditional heteroskedasticity (ARCH) and generalized ARCH (GARCH) models neglect structural breaks triggered by sudden shocks that may lead to an overestimation of persistence, causing an upward bias in the estimates. Different regime‐switching models that have abrupt regime‐switching governed by a Markov chain were developed to model volatility in financial time series data. Volatility modelling was also extended to spatially interconnected time series, resulting in spatial variants of ARCH models. This inspired us to propose a Markov switching framework of the spatio‐temporal log‐ARCH model. In this article, we discuss the Markov‐switching extension of the model, the estimation procedure and the smooth inferences of the regimes. The Monte Carlo simulation studies show that the maximum likelihood estimation method for our proposed model has good finite sample properties. The proposed model was applied to 28 stock indices' data that were presumably affected by the 2015–2016 Chinese stock market crash. The results showed that our model is a better fit compared to that of the one‐regime counterpart. Furthermore, the smoothed inference of the data indicated the approximate periods where structural breaks occurred. This model can capture structural breaks that simultaneously occur in nearby locations.
{"title":"A Markov‐switching spatio‐temporal ARCH model","authors":"Tzung Hsuen Khoo, Dharini Pathmanathan, Philipp Otto, Sophie Dabo‐Niang","doi":"10.1002/sta4.713","DOIUrl":"https://doi.org/10.1002/sta4.713","url":null,"abstract":"Stock market indices are volatile by nature, and sudden shocks are known to affect volatility patterns. The autoregressive conditional heteroskedasticity (ARCH) and generalized ARCH (GARCH) models neglect structural breaks triggered by sudden shocks that may lead to an overestimation of persistence, causing an upward bias in the estimates. Different regime‐switching models that have abrupt regime‐switching governed by a Markov chain were developed to model volatility in financial time series data. Volatility modelling was also extended to spatially interconnected time series, resulting in spatial variants of ARCH models. This inspired us to propose a Markov switching framework of the spatio‐temporal log‐ARCH model. In this article, we discuss the Markov‐switching extension of the model, the estimation procedure and the smooth inferences of the regimes. The Monte Carlo simulation studies show that the maximum likelihood estimation method for our proposed model has good finite sample properties. The proposed model was applied to 28 stock indices' data that were presumably affected by the 2015–2016 Chinese stock market crash. The results showed that our model is a better fit compared to that of the one‐regime counterpart. Furthermore, the smoothed inference of the data indicated the approximate periods where structural breaks occurred. This model can capture structural breaks that simultaneously occur in nearby locations.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"33 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Following recent developments of dimension reduction algorithms for a multivariate time series, we propose in this work the adaptation of sliced inverse mean difference algorithm, an algorithm which was previously proposed in a standard multiple regression setting, to develop an algorithm appropriate to perform dimension reduction for a multivariate time series. The resulting algorithm called time series sliced inverse mean difference (TSIMD) is shown to be able to identify important directions and important lags using less significant pairs than previously proposed algorithms for dimension reduction in multivariate time series. We demonstrate the competitive performance of our algorithms through a number of experiments.
{"title":"Using sliced inverse mean difference for dimension reduction in multivariate time series","authors":"Hector Haffenden, Andreas Artemiou","doi":"10.1002/sta4.709","DOIUrl":"https://doi.org/10.1002/sta4.709","url":null,"abstract":"Following recent developments of dimension reduction algorithms for a multivariate time series, we propose in this work the adaptation of sliced inverse mean difference algorithm, an algorithm which was previously proposed in a standard multiple regression setting, to develop an algorithm appropriate to perform dimension reduction for a multivariate time series. The resulting algorithm called time series sliced inverse mean difference (TSIMD) is shown to be able to identify important directions and important lags using less significant pairs than previously proposed algorithms for dimension reduction in multivariate time series. We demonstrate the competitive performance of our algorithms through a number of experiments.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"327 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141615021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chasz Griego, Nicky Agate, Ana‐Maria Iosif, Amy M. Crisp
Clinical and academic research continues to become more complex as our knowledge and technology advance. A substantial and growing number of specialists in biostatistics, data science and library sciences are needed to support these research systems and promote high‐calibre research. However, that support is often marginalized as optional rather than a fundamental component of research infrastructure. By building research infrastructure, an institution harnesses access to tools and support/service centres that host skilled experts who approach research with best practices in mind and domain‐specific knowledge at hand. We outline the potential roles of data scientists and statisticians in research infrastructure and recommend guidelines for advocating for the institutional resources needed to support these roles in a sustainable and efficient manner for the long‐term success of the institution. We provide these guidelines in terms of resource efficiency, monetary efficiency and long‐term sustainability. We hope this work contributes to—and provides shared language for—a conversation on a broader framework beyond metrics that can be used to advocate for needed resources.
{"title":"What is it that you say you do here? Advocating for the critical role of data scientists in research infrastructure","authors":"Chasz Griego, Nicky Agate, Ana‐Maria Iosif, Amy M. Crisp","doi":"10.1002/sta4.714","DOIUrl":"https://doi.org/10.1002/sta4.714","url":null,"abstract":"Clinical and academic research continues to become more complex as our knowledge and technology advance. A substantial and growing number of specialists in biostatistics, data science and library sciences are needed to support these research systems and promote high‐calibre research. However, that support is often marginalized as optional rather than a fundamental component of research infrastructure. By building research infrastructure, an institution harnesses access to tools and support/service centres that host skilled experts who approach research with best practices in mind and domain‐specific knowledge at hand. We outline the potential roles of data scientists and statisticians in research infrastructure and recommend guidelines for advocating for the institutional resources needed to support these roles in a sustainable and efficient manner for the long‐term success of the institution. We provide these guidelines in terms of resource efficiency, monetary efficiency and long‐term sustainability. We hope this work contributes to—and provides shared language for—a conversation on a broader framework beyond metrics that can be used to advocate for needed resources.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"59 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The original Hotelling–Solomons inequality states that an upper bound of the absolute difference between the mean and median, standardised by the standard deviation, is 1. However, in this paper, we introduce a new bound that depends on the sample size, which is strictly smaller than 1.
{"title":"A sharper bound of the Hotelling–Solomons inequality","authors":"Yuzo Maruyama","doi":"10.1002/sta4.710","DOIUrl":"https://doi.org/10.1002/sta4.710","url":null,"abstract":"The original Hotelling–Solomons inequality states that an upper bound of the absolute difference between the mean and median, standardised by the standard deviation, is 1. However, in this paper, we introduce a new bound that depends on the sample size, which is strictly smaller than 1.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"20 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper studies a tensor factor model that augments samples from multiple classes. The nuisance common patterns shared across classes are characterised by pervasive noises, and the patterns that distinguish different classes are represented by class‐specific components. Additionally, the pervasive component is modelled by the production of a low‐rank tensor latent factor and several factor loading matrices. This augmented tensor factor model can be expanded to a series of matrix variate tensor factor models and estimated using principal component analysis. The ranks of latent factors are estimated using a modified eigen‐ratio method. The proposed estimators have fast convergence rates and enjoy the blessing of dimensionality. The proposed factor model is applied to address the challenge of overlapping issues in image classification through a factor adjustment procedure. The procedure is shown to be powerful through synthetic experiments and an application to COVID‐19 pneumonia diagnosis from frontal chest X‐ray images.
本文研究的张量因子模型可增强来自多个类别的样本。不同类别之间共有的干扰共同模式由普遍噪声表征,而区分不同类别的模式则由特定类别成分表示。此外,通过生成低秩张量潜因子和多个因子载荷矩阵,对普遍成分进行建模。这种增强张量因子模型可扩展为一系列矩阵变量张量因子模型,并使用主成分分析法进行估算。潜在因子的阶数采用修正的特征比方法进行估算。所提出的估计方法收敛速度快,且不受维度限制。提出的因子模型通过一个因子调整程序用于解决图像分类中的重叠问题。通过合成实验和在 COVID-19 肺炎诊断中对前胸 X 光图像的应用,证明了该程序的强大功能。
{"title":"Tensor factor adjustment for image classification with pervasive noises","authors":"Xiaochuan Li, Bingnan Li, Wenzhan Song, Yuan Ke","doi":"10.1002/sta4.705","DOIUrl":"https://doi.org/10.1002/sta4.705","url":null,"abstract":"This paper studies a tensor factor model that augments samples from multiple classes. The nuisance common patterns shared across classes are characterised by pervasive noises, and the patterns that distinguish different classes are represented by class‐specific components. Additionally, the pervasive component is modelled by the production of a low‐rank tensor latent factor and several factor loading matrices. This augmented tensor factor model can be expanded to a series of matrix variate tensor factor models and estimated using principal component analysis. The ranks of latent factors are estimated using a modified eigen‐ratio method. The proposed estimators have fast convergence rates and enjoy the blessing of dimensionality. The proposed factor model is applied to address the challenge of overlapping issues in image classification through a factor adjustment procedure. The procedure is shown to be powerful through synthetic experiments and an application to COVID‐19 pneumonia diagnosis from frontal chest X‐ray images.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"16 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose an estimator for the population mean under the semi‐supervised learning setting with the Missing at Random (MAR) assumption. This setting assumes that the probability of observing , denoted by , depends on the total sample size and satisfies . To efficiently estimate , we introduce an adaptive estimator based on inverse probability weighting and cross‐fitting. Theoretical analysis reveals that our proposed estimator is consistent and efficient, with a convergence rate of , slower than the typical rate, due to the diminishing proportion of labelled data as the sample size increases in the semi‐supervised setting. We also prove the consistency of inverse probability weighting (IPW)–Nadaraya–Watson density function estimators. Extensive simulations and an application to the Los Angeles homeless data validate the effectiveness of our approach.
{"title":"Solving the missing at random problem in semi‐supervised learning: An inverse probability weighting method","authors":"Jin Su, Shuyi Zhang, Yong Zhou","doi":"10.1002/sta4.707","DOIUrl":"https://doi.org/10.1002/sta4.707","url":null,"abstract":"We propose an estimator for the population mean under the semi‐supervised learning setting with the Missing at Random (MAR) assumption. This setting assumes that the probability of observing , denoted by , depends on the total sample size and satisfies . To efficiently estimate , we introduce an adaptive estimator based on inverse probability weighting and cross‐fitting. Theoretical analysis reveals that our proposed estimator is consistent and efficient, with a convergence rate of , slower than the typical rate, due to the diminishing proportion of labelled data as the sample size increases in the semi‐supervised setting. We also prove the consistency of inverse probability weighting (IPW)–Nadaraya–Watson density function estimators. Extensive simulations and an application to the Los Angeles homeless data validate the effectiveness of our approach.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"29 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop lagged Metropolis–Hastings walk for sampling from simple undirected graphs according to given stationary sampling probabilities. It is explained how the technique can be applied together with designed graphs for sampling of units‐in‐space. Compared with the existing spatial sampling methods, which chiefly focus on the sample spatial balance regardless of the associated outcomes of interest, the proposed graph spatial sampling method can considerably improve the efficiency because the graph can be designed to take into account the anticipated spatial distribution of the outcome of interest.
{"title":"Graph spatial sampling","authors":"Li‐Chun Zhang","doi":"10.1002/sta4.708","DOIUrl":"https://doi.org/10.1002/sta4.708","url":null,"abstract":"We develop lagged Metropolis–Hastings walk for sampling from simple undirected graphs according to given stationary sampling probabilities. It is explained how the technique can be applied together with designed graphs for sampling of units‐in‐space. Compared with the existing spatial sampling methods, which chiefly focus on the sample spatial balance regardless of the associated outcomes of interest, the proposed graph spatial sampling method can considerably improve the efficiency because the graph can be designed to take into account the anticipated spatial distribution of the outcome of interest.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"24 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}