Pub Date : 2024-04-03DOI: 10.1007/s00362-024-01548-y
Abstract
Change point detection is an important area of scientific research and has applications in a wide range of fields. In this paper, we propose a sequential change point detection (SCPD) procedure for mean-shift change point models. Unlike classical feature selection based approaches, the SCPD method detects change points in the order of the conditional change sizes and makes full use of the identified change points information. The extended Bayesian information criterion (EBIC) is employed as the stopping rule in the SCPD procedure. We investigate the theoretical property of the procedure and compare its performance with other methods existing in the literature. It is established that the SCPD procedure has the property of detection consistency. Simulation studies and real data analyses demonstrate that the SCPD procedure has the edge over the other methods in terms of detection accuracy and robustness.
{"title":"A sequential feature selection approach to change point detection in mean-shift change point models","authors":"","doi":"10.1007/s00362-024-01548-y","DOIUrl":"https://doi.org/10.1007/s00362-024-01548-y","url":null,"abstract":"<h3>Abstract</h3> <p>Change point detection is an important area of scientific research and has applications in a wide range of fields. In this paper, we propose a sequential change point detection (SCPD) procedure for mean-shift change point models. Unlike classical feature selection based approaches, the SCPD method detects change points in the order of the conditional change sizes and makes full use of the identified change points information. The extended Bayesian information criterion (EBIC) is employed as the stopping rule in the SCPD procedure. We investigate the theoretical property of the procedure and compare its performance with other methods existing in the literature. It is established that the SCPD procedure has the property of detection consistency. Simulation studies and real data analyses demonstrate that the SCPD procedure has the edge over the other methods in terms of detection accuracy and robustness.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"33 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1007/s00362-024-01538-0
Koki Momoki, Takuma Yoshida
This study examines the varying coefficient model in tail index regression. The varying coefficient model is an efficient semiparametric model that avoids the curse of dimensionality when including large covariates in the model. In fact, the varying coefficient model is useful in mean, quantile, and other regressions. The tail index regression is not an exception. However, the varying coefficient model is flexible, but leaner and simpler models are preferred for applications. Therefore, it is important to evaluate whether the estimated coefficient function varies significantly with covariates. If the effect of the non-linearity of the model is weak, the varying coefficient structure is reduced to a simpler model, such as a constant or zero. Accordingly, the hypothesis test for model assessment in the varying coefficient model has been discussed in mean and quantile regression. However, there are no results in tail index regression. In this study, we investigate the asymptotic properties of an estimator and provide a hypothesis testing method for varying coefficient models for tail index regression.
{"title":"Hypothesis testing for varying coefficient models in tail index regression","authors":"Koki Momoki, Takuma Yoshida","doi":"10.1007/s00362-024-01538-0","DOIUrl":"https://doi.org/10.1007/s00362-024-01538-0","url":null,"abstract":"<p>This study examines the varying coefficient model in tail index regression. The varying coefficient model is an efficient semiparametric model that avoids the curse of dimensionality when including large covariates in the model. In fact, the varying coefficient model is useful in mean, quantile, and other regressions. The tail index regression is not an exception. However, the varying coefficient model is flexible, but leaner and simpler models are preferred for applications. Therefore, it is important to evaluate whether the estimated coefficient function varies significantly with covariates. If the effect of the non-linearity of the model is weak, the varying coefficient structure is reduced to a simpler model, such as a constant or zero. Accordingly, the hypothesis test for model assessment in the varying coefficient model has been discussed in mean and quantile regression. However, there are no results in tail index regression. In this study, we investigate the asymptotic properties of an estimator and provide a hypothesis testing method for varying coefficient models for tail index regression.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"41 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1007/s00362-024-01541-5
Nicoletta D’Angelo, Giada Adelfio
In this paper, we harness a result in point process theory, specifically the expectation of the weighted K-function, where the weighting is done by the true first-order intensity function. This theoretical result can be employed as an estimation method to derive parameter estimates for a particular model assumed for the data. The underlying motivation is to avoid the difficulties associated with dealing with complex likelihoods in point process models and their maximization. The exploited result makes our method theoretically applicable to any model specification. In this paper, we restrict our study to Poisson models, whose likelihood represents the base for many more complex point process models. In this context, our proposed method can estimate the vector of local parameters that correspond to the points within the analyzed point pattern without introducing any additional complexity compared to the global estimation. We illustrate the method through simulation studies for both purely spatial and spatio-temporal point processes and show complex scenarios based on the Poisson model through the analysis of two real datasets concerning environmental problems.
在本文中,我们利用了点过程理论中的一个结果,特别是加权 K 函数的期望,其中加权是由真实的一阶强度函数完成的。这一理论结果可作为一种估算方法,用于推导为数据假设的特定模型的参数估计。其根本动机在于避免处理点过程模型中复杂似然及其最大化所带来的困难。所利用的结果使我们的方法在理论上适用于任何模型规范。在本文中,我们的研究仅限于泊松模型,而泊松模型的似然是许多更复杂的点过程模型的基础。在这种情况下,我们提出的方法可以估算出与分析点模式中的点相对应的局部参数向量,与全局估算相比,不会带来任何额外的复杂性。我们通过对纯空间点过程和时空点过程的模拟研究来说明该方法,并通过分析两个有关环境问题的真实数据集来展示基于泊松模型的复杂情景。
{"title":"Minimum contrast for the first-order intensity estimation of spatial and spatio-temporal point processes","authors":"Nicoletta D’Angelo, Giada Adelfio","doi":"10.1007/s00362-024-01541-5","DOIUrl":"https://doi.org/10.1007/s00362-024-01541-5","url":null,"abstract":"<p>In this paper, we harness a result in point process theory, specifically the expectation of the weighted <i>K</i>-function, where the weighting is done by the true first-order intensity function. This theoretical result can be employed as an estimation method to derive parameter estimates for a particular model assumed for the data. The underlying motivation is to avoid the difficulties associated with dealing with complex likelihoods in point process models and their maximization. The exploited result makes our method theoretically applicable to any model specification. In this paper, we restrict our study to Poisson models, whose likelihood represents the base for many more complex point process models. In this context, our proposed method can estimate the vector of local parameters that correspond to the points within the analyzed point pattern without introducing any additional complexity compared to the global estimation. We illustrate the method through simulation studies for both purely spatial and spatio-temporal point processes and show complex scenarios based on the Poisson model through the analysis of two real datasets concerning environmental problems.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"20 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1007/s00362-024-01536-2
Long-Hao Xu, Yinan Li, Kai-Tai Fang
The bootstrap method relies on resampling from the empirical distribution to provide inferences about the population with a distribution F. The empirical distribution serves as an approximation to the population. It is possible, however, to resample from another approximating distribution of F to conduct simulation-based inferences. In this paper, we utilize representative points to form an alternative approximating distribution of F for resampling. The representative points in terms of minimum mean squared error from F have been widely applied to numerical integration, simulation, and the problems of grouping, quantization, and classification. The method of resampling via representative points can be used to estimate the sampling distribution of a statistic of interest. A basic theory for the proposed method is established. We prove the convergence of higher-order moments of the new approximating distribution of F, and establish the consistency of sampling distribution approximation in the cases of the sample mean and sample variance under the Kolmogorov metric and Mallows–Wasserstein metric. Based on some numerical studies, it has been shown that the proposed resampling method improves the nonparametric bootstrap in terms of confidence intervals for mean and variance.
自举法依赖于从经验分布中重新取样来推断具有分布 F 的群体。不过,也可以从 F 的另一个近似分布中重新取样,进行基于模拟的推断。在本文中,我们利用代表点来形成 F 的另一种近似分布,以进行重新采样。从 F 的最小均方误差来看,代表点已被广泛应用于数值积分、模拟以及分组、量化和分类等问题。通过代表点重新取样的方法可用于估计相关统计量的取样分布。我们建立了拟议方法的基本理论。我们证明了 F 的新近似分布的高阶矩的收敛性,并在 Kolmogorov 公制和 Mallows-Wasserstein 公制下建立了样本均值和样本方差情况下抽样分布近似的一致性。基于一些数值研究表明,所提出的重采样方法在均值和方差的置信区间方面改进了非参数引导法。
{"title":"The resampling method via representative points","authors":"Long-Hao Xu, Yinan Li, Kai-Tai Fang","doi":"10.1007/s00362-024-01536-2","DOIUrl":"https://doi.org/10.1007/s00362-024-01536-2","url":null,"abstract":"<p>The bootstrap method relies on resampling from the empirical distribution to provide inferences about the population with a distribution <i>F</i>. The empirical distribution serves as an approximation to the population. It is possible, however, to resample from another approximating distribution of <i>F</i> to conduct simulation-based inferences. In this paper, we utilize representative points to form an alternative approximating distribution of <i>F</i> for resampling. The representative points in terms of minimum mean squared error from <i>F</i> have been widely applied to numerical integration, simulation, and the problems of grouping, quantization, and classification. The method of resampling via representative points can be used to estimate the sampling distribution of a statistic of interest. A basic theory for the proposed method is established. We prove the convergence of higher-order moments of the new approximating distribution of <i>F</i>, and establish the consistency of sampling distribution approximation in the cases of the sample mean and sample variance under the Kolmogorov metric and Mallows–Wasserstein metric. Based on some numerical studies, it has been shown that the proposed resampling method improves the nonparametric bootstrap in terms of confidence intervals for mean and variance.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"84 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1007/s00362-023-01517-x
Abstract
Cattel’s (Multivar Behav Res 1:245–276, 1966) heuristic determines the number of factors as the elbow point between ‘steep’ and ‘not steep’ in the scree plot. In contrast, an elbow is by definition absent in points on a hyberbole with corresponding equisized surfaces. We formalize this heuristic and propose a criterion to determine the number of factors by comparing surfaces under the scree plot. Monte Carlo simulations shows that the finite-sample properties of our proposed criterion outperform benchmarks in the dynamic factor model literature.
摘要 卡特尔(Multivar Behav Res 1:245-276,1966 年)的启发式方法将因子数确定为树状图中 "陡峭 "与 "不陡峭 "之间的肘点。与此相反,根据定义,在具有相应等值化表面的小交叉点上不存在肘点。我们将这一启发式方法正式化,并提出了一个标准,通过比较克里图下的表面来确定因子的数量。蒙特卡罗模拟显示,我们提出的标准的有限样本属性优于动态因子模型文献中的基准。
{"title":"An heuristic scree plot criterion for the number of factors","authors":"","doi":"10.1007/s00362-023-01517-x","DOIUrl":"https://doi.org/10.1007/s00362-023-01517-x","url":null,"abstract":"<h3>Abstract</h3> <p>Cattel’s (Multivar Behav Res 1:245–276, 1966) heuristic determines the number of factors as the elbow point between ‘steep’ and ‘not steep’ in the scree plot. In contrast, an elbow is by definition absent in points on a hyberbole with corresponding equisized surfaces. We formalize this heuristic and propose a criterion to determine the number of factors by comparing surfaces under the scree plot. Monte Carlo simulations shows that the finite-sample properties of our proposed criterion outperform benchmarks in the dynamic factor model literature.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"44 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14DOI: 10.1007/s00362-024-01537-1
Zhaoyang Li, Yuehan Yang
In this paper, we focus on overlapping community detection and propose an efficient semi-orthogonal nonnegative matrix tri-factorization (semi-ONMTF) algorithm. This method factorizes a matrix X into an orthogonal matrix U, a nonnegative matrix B, and a transposed matrix (U^mathrm {scriptscriptstyle T} ). We use the Cayley Transformation to maintain strict orthogonality of U that each iteration stays on the Stiefel Manifold. This algorithm is computationally efficient because the solutions of U and B are simplified into a matrix-wise update algorithm. Applying this method, we detect overlapping communities by the belonging coefficient vector and analyse associations between communities by the unweighted network of communities. We conduct simulations and applications to show that the proposed method has wide applicability. In a real data example, we apply the semi-ONMTF to a stock data set and construct a directed association network of companies. Based on the modularity for directed and overlapping communities, we obtain five overlapping communities, 17 overlapping nodes, and five outlier nodes in the network. We also discuss the associations between communities, providing insights into the overlapping community detection on the stock market network.
本文的重点是重叠群落检测,并提出了一种高效的半正交非负矩阵三因子化(semi-ONMTF)算法。该方法将矩阵 X 分解为一个正交矩阵 U、一个非负矩阵 B 和一个转置矩阵(U^mathrm {scriptscriptstyle T} )。我们使用凯利变换(Cayley Transformation)来保持 U 的严格正交性,使每次迭代都保持在 Stiefel Manifold 上。这种算法的计算效率很高,因为 U 和 B 的解被简化为矩阵更新算法。应用这种方法,我们可以通过归属系数向量检测重叠群落,并通过非加权群落网络分析群落间的关联。我们通过模拟和应用表明,所提出的方法具有广泛的适用性。在一个真实数据示例中,我们将半ONMTF应用于股票数据集,并构建了公司的有向关联网络。根据有向和重叠群落的模块性,我们得到了网络中的 5 个重叠群落、17 个重叠节点和 5 个离群节点。我们还讨论了社群之间的关联,为在股票市场网络上检测重叠社群提供了启示。
{"title":"A semi-orthogonal nonnegative matrix tri-factorization algorithm for overlapping community detection","authors":"Zhaoyang Li, Yuehan Yang","doi":"10.1007/s00362-024-01537-1","DOIUrl":"https://doi.org/10.1007/s00362-024-01537-1","url":null,"abstract":"<p>In this paper, we focus on overlapping community detection and propose an efficient semi-orthogonal nonnegative matrix tri-factorization (semi-ONMTF) algorithm. This method factorizes a matrix <i>X</i> into an orthogonal matrix <i>U</i>, a nonnegative matrix <i>B</i>, and a transposed matrix <span>(U^mathrm {scriptscriptstyle T} )</span>. We use the Cayley Transformation to maintain strict orthogonality of <i>U</i> that each iteration stays on the Stiefel Manifold. This algorithm is computationally efficient because the solutions of <i>U</i> and <i>B</i> are simplified into a matrix-wise update algorithm. Applying this method, we detect overlapping communities by the belonging coefficient vector and analyse associations between communities by the unweighted network of communities. We conduct simulations and applications to show that the proposed method has wide applicability. In a real data example, we apply the semi-ONMTF to a stock data set and construct a directed association network of companies. Based on the modularity for directed and overlapping communities, we obtain five overlapping communities, 17 overlapping nodes, and five outlier nodes in the network. We also discuss the associations between communities, providing insights into the overlapping community detection on the stock market network.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"395 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-08DOI: 10.1007/s00362-024-01533-5
Abbas Parchami, Przemyslaw Grzegorzewski, Maciej Romaniuk
Computer simulations are a powerful tool in many fields of research. This also applies to the broadly understood analysis of experimental data, which are frequently burdened with multiple imperfections. Often the underlying imprecision or vagueness can be suitably described in terms of fuzzy numbers which enable also the capture of subjectivity. On the other hand, due to the random nature of the experimental data, the tools for their description must take into account their statistical nature. In this way, we come to random fuzzy numbers that model fuzzy data and are also solidly formalized within the probabilistic setting. In this contribution, we introduce the so-called LR random fuzzy numbers that can be used in various Monte-Carlo simulations on fuzzy data. The proposed method of generating fuzzy numbers with membership functions given by probability densities is both simple and rich, well-grounded mathematically, and has a high application potential.
计算机模拟是许多研究领域的有力工具。这同样适用于对实验数据的广义分析,因为实验数据往往存在多种不完善之处。通常情况下,可以用模糊数来适当地描述潜在的不精确性或模糊性,模糊数还可以捕捉主观性。另一方面,由于实验数据的随机性,对其进行描述的工具必须考虑其统计性质。这样,我们就得出了能模拟模糊数据的随机模糊数,并在概率论环境中将其形式化。在本文中,我们介绍了所谓的 LR 随机模糊数,它可用于对模糊数据进行各种蒙特卡洛模拟。所提出的生成模糊数的方法,其成员函数由概率密度给出,既简单又丰富,具有坚实的数学基础,应用潜力很大。
{"title":"Statistical simulations with LR random fuzzy numbers","authors":"Abbas Parchami, Przemyslaw Grzegorzewski, Maciej Romaniuk","doi":"10.1007/s00362-024-01533-5","DOIUrl":"https://doi.org/10.1007/s00362-024-01533-5","url":null,"abstract":"<p>Computer simulations are a powerful tool in many fields of research. This also applies to the broadly understood analysis of experimental data, which are frequently burdened with multiple imperfections. Often the underlying imprecision or vagueness can be suitably described in terms of fuzzy numbers which enable also the capture of subjectivity. On the other hand, due to the random nature of the experimental data, the tools for their description must take into account their statistical nature. In this way, we come to random fuzzy numbers that model fuzzy data and are also solidly formalized within the probabilistic setting. In this contribution, we introduce the so-called LR random fuzzy numbers that can be used in various Monte-Carlo simulations on fuzzy data. The proposed method of generating fuzzy numbers with membership functions given by probability densities is both simple and rich, well-grounded mathematically, and has a high application potential.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140070462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-06DOI: 10.1007/s00362-023-01491-4
Fengying Li, Yuqiang Li, Xianyi Wu
Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment
{"title":"Minimax weight learning for absorbing MDPs","authors":"Fengying Li, Yuqiang Li, Xianyi Wu","doi":"10.1007/s00362-023-01491-4","DOIUrl":"https://doi.org/10.1007/s00362-023-01491-4","url":null,"abstract":"<p>Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon Markov Decision Processes (MDPs). In this paper, we study undiscounted off-policy evaluation for absorbing MDPs. Given the dataset consisting of i.i.d episodes under a given truncation level, we propose an algorithm (referred to as MWLA in the text) to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound of the MWLA method is provided and the dependence of statistical errors on the data size and the truncation level are analyzed. The performance of the algorithm is illustrated by means of computational experiments under an episodic taxi environment</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"43 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140045510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1007/s00362-024-01532-6
Shuyi Liang, Kai-Tai Fang, Xin-Wei Huang, Yijing Xin, Chang-Xing Ma
In clinical trials studying paired parts of a subject with binary outcomes, it is expected to collect measurements bilaterally. However, there are cases where subjects contribute measurements for only one part. By utilizing combined data, it is possible to gain additional information compared to using bilateral or unilateral data alone. With the combined data, this article investigates homogeneity tests of risk differences with the presence of stratification effects and proposes interval estimations of a common risk difference if stratification does not introduce underlying dissimilarities. Under Dallal’s model (Biometrics 44:253–257, 1988), we propose three test statistics and evaluate their performances regarding type I error controls and powers. Confidence intervals of a common risk difference with satisfactory coverage probabilities and interval length are constructed. Our simulation results show that the score test is the most robust and the profile likelihood confidence interval outperforms other methods proposed. Data from a study of acute otitis media is used to illustrate our proposed procedures.
在对受试者的成对部分进行二元结果研究的临床试验中,预计要收集双侧的测量数据。不过,也有受试者只对一个部位进行测量的情况。与单独使用双侧或单侧数据相比,利用组合数据可以获得更多信息。利用合并数据,本文研究了存在分层效应时风险差异的同质性检验,并提出了在分层不引入潜在差异的情况下共同风险差异的区间估计。根据 Dallal 的模型(Biometrics 44:253-257, 1988),我们提出了三种检验统计量,并评估了它们在 I 型误差控制和幂级数方面的表现。我们构建了具有令人满意的覆盖概率和区间长度的共同风险差异置信区间。我们的模拟结果表明,得分检验是最稳健的,轮廓似然置信区间优于其他方法。我们使用急性中耳炎的研究数据来说明我们提出的程序。
{"title":"Homogeneity tests and interval estimations of risk differences for stratified bilateral and unilateral correlated data","authors":"Shuyi Liang, Kai-Tai Fang, Xin-Wei Huang, Yijing Xin, Chang-Xing Ma","doi":"10.1007/s00362-024-01532-6","DOIUrl":"https://doi.org/10.1007/s00362-024-01532-6","url":null,"abstract":"<p>In clinical trials studying paired parts of a subject with binary outcomes, it is expected to collect measurements bilaterally. However, there are cases where subjects contribute measurements for only one part. By utilizing combined data, it is possible to gain additional information compared to using bilateral or unilateral data alone. With the combined data, this article investigates homogeneity tests of risk differences with the presence of stratification effects and proposes interval estimations of a common risk difference if stratification does not introduce underlying dissimilarities. Under Dallal’s model (Biometrics 44:253–257, 1988), we propose three test statistics and evaluate their performances regarding type I error controls and powers. Confidence intervals of a common risk difference with satisfactory coverage probabilities and interval length are constructed. Our simulation results show that the score test is the most robust and the profile likelihood confidence interval outperforms other methods proposed. Data from a study of acute otitis media is used to illustrate our proposed procedures.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"55 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140033154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-04DOI: 10.1007/s00362-024-01531-7
David Curtis
It has previously been pointed out that Student’s t test, which assumes that samples are drawn from populations with equal standard deviations, can have an inflated Type I error rate if this assumption is violated. Hence it has been recommended that Welch’s t test should be preferred. In the context of carrying out gene-wise weighted burden tests for detecting association of rare variants with psoriasis we observe that Welch’s test performs unsatisfactorily. We show that if the assumption of normality is violated and observations follow a Poisson distribution, then with unequal sample sizes Welch’s t test has an inflated Type I error rate, is systematically biased and is prone to produce extremely low p values. We argue that such data can arise in a variety of real world situations and believe that researchers should be aware of this issue. Student’s t test performs much better in this scenario but a likelihood ratio test based on logistic regression models performs better still and we suggest that this might generally be a preferable method to test for a difference in distributions between two samples.
This research has been conducted using the UK Biobank Resource.
以前曾有人指出,学生 t 检验假定样本来自标准差相等的群体,如果违反了这一假定,I 类错误率就会增大。因此,建议采用韦尔奇 t 检验。在为检测罕见变异体与银屑病的关联而进行基因加权负担测试时,我们发现韦尔奇检验的表现并不令人满意。我们的研究表明,如果违反了正态性假设,观察结果呈泊松分布,那么在样本量不等的情况下,韦尔奇 t 检验的 I 类错误率就会升高,出现系统性偏差,并容易产生极低的 p 值。我们认为,这种数据可能出现在现实世界的各种情况中,研究人员应该意识到这个问题。在这种情况下,学生 t 检验的效果要好得多,但基于逻辑回归模型的似然比检验的效果更好,我们认为这可能是检验两个样本分布差异的较好方法。
{"title":"Welch’s t test is more sensitive to real world violations of distributional assumptions than student’s t test but logistic regression is more robust than either","authors":"David Curtis","doi":"10.1007/s00362-024-01531-7","DOIUrl":"https://doi.org/10.1007/s00362-024-01531-7","url":null,"abstract":"<p>It has previously been pointed out that Student’s <i>t</i> test, which assumes that samples are drawn from populations with equal standard deviations, can have an inflated Type I error rate if this assumption is violated. Hence it has been recommended that Welch’s <i>t</i> test should be preferred. In the context of carrying out gene-wise weighted burden tests for detecting association of rare variants with psoriasis we observe that Welch’s test performs unsatisfactorily. We show that if the assumption of normality is violated and observations follow a Poisson distribution, then with unequal sample sizes Welch’s <i>t</i> test has an inflated Type I error rate, is systematically biased and is prone to produce extremely low <i>p</i> values. We argue that such data can arise in a variety of real world situations and believe that researchers should be aware of this issue. Student’s <i>t</i> test performs much better in this scenario but a likelihood ratio test based on logistic regression models performs better still and we suggest that this might generally be a preferable method to test for a difference in distributions between two samples.</p><p>This research has been conducted using the UK Biobank Resource.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"239 ","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140037982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}