首页 > 最新文献

Statistical Papers最新文献

英文 中文
Handling skewness and directional tails in model-based clustering. 在基于模型的聚类中处理偏度和方向尾。
IF 1.2 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-07-04 DOI: 10.1007/s00362-025-01723-9
Cristina Tortora, Antonio Punzo, Brian C Franczak

Model-based clustering is a powerful approach used in data analysis to unveil underlying patterns or groups within a data set. However, when applied to clusters that exhibit skewness, heavy tails, or both, the classification of data points becomes more challenging. In this study, we introduce two models considering two component-wise transformations of the observed data within a mixture of multiple scaled contaminated normal (MSCN) distributions. MSCN distributions are designed to enable a different tail behavior in each dimension and directional outlier detection in the direction of the principal components. Using the transformed MSCN distributions as components of a mixture, we obtain model-based clustering techniques that allow for 1) flexible cluster shapes in terms of skewness and kurtosis and 2) component-wise and directional outlier detection. We assess the efficacy of the proposed techniques by comparing them with model-based clustering methods that perform global or component-wise outlier detection using simulated and real data sets. This comparative analysis aims to demonstrate which practical clustering scenarios using the proposed MSCN-based approaches are advantageous.

基于模型的聚类是一种在数据分析中用于揭示数据集中的底层模式或组的强大方法。然而,当应用于表现出偏态、重尾或两者兼而有之的聚类时,数据点的分类变得更具挑战性。在本研究中,我们引入了两个模型,考虑了在多尺度污染正态分布(MSCN)混合分布中观测数据的两个分量转换。MSCN分布的设计是为了在每个维度上实现不同的尾部行为,并在主成分的方向上进行定向离群检测。使用转换后的MSCN分布作为混合物的组成部分,我们获得了基于模型的聚类技术,该技术允许1)在偏度和峰度方面具有灵活的聚类形状,以及2)组件明智和定向异常值检测。我们通过将所提出的技术与基于模型的聚类方法进行比较来评估它们的有效性,这些方法使用模拟和真实数据集执行全局或组件异常值检测。这个比较分析的目的是证明使用基于mscn的方法的实际聚类场景是有利的。
{"title":"Handling skewness and directional tails in model-based clustering.","authors":"Cristina Tortora, Antonio Punzo, Brian C Franczak","doi":"10.1007/s00362-025-01723-9","DOIUrl":"10.1007/s00362-025-01723-9","url":null,"abstract":"<p><p>Model-based clustering is a powerful approach used in data analysis to unveil underlying patterns or groups within a data set. However, when applied to clusters that exhibit skewness, heavy tails, or both, the classification of data points becomes more challenging. In this study, we introduce two models considering two component-wise transformations of the observed data within a mixture of multiple scaled contaminated normal (MSCN) distributions. MSCN distributions are designed to enable a different tail behavior in each dimension and directional outlier detection in the direction of the principal components. Using the transformed MSCN distributions as components of a mixture, we obtain model-based clustering techniques that allow for 1) flexible cluster shapes in terms of skewness and kurtosis and 2) component-wise and directional outlier detection. We assess the efficacy of the proposed techniques by comparing them with model-based clustering methods that perform global or component-wise outlier detection using simulated and real data sets. This comparative analysis aims to demonstrate which practical clustering scenarios using the proposed MSCN-based approaches are advantageous.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"66 5","pages":"114"},"PeriodicalIF":1.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144576896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Maximum likelihood estimation under the Emax model: existence, geometry and efficiency. Emax模型下的最大似然估计:存在性、几何和效率。
IF 1.2 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-06-10 DOI: 10.1007/s00362-025-01673-2
Giacomo Aletti, Nancy Flournoy, Caterina May, Chiara Tommasi

This study focuses on the estimation of the Emax dose-response model, a widely utilized framework in clinical trials, experiments in pharmacology, agriculture, environmental science, and more. Existing challenges in obtaining maximum likelihood estimates (MLE) for model parameters are often ascribed to computational issues but, in reality, stem from the absence of a MLE. Our contribution provides new understanding and control of all the experimental situations that practitioners might face, guiding them in the estimation process. We derive the exact MLE for a three-point experimental design and identify the two scenarios where the MLE fails to exist. To address these challenges, we propose utilizing Firth's modified score, which we express analytically as a function of the experimental design. Through a simulation study, we demonstrate that the Firth modification yields a finite estimate in one of the problematic scenarios. For the remaining case, we introduce a design-augmentation strategy akin to a hypothesis test.

本研究的重点是Emax剂量反应模型的估计,这是一个广泛应用于临床试验、药理学、农业、环境科学等领域的实验框架。在获得模型参数的最大似然估计(MLE)方面存在的挑战通常归因于计算问题,但实际上源于缺乏最大似然估计。我们的贡献为从业者可能面临的所有实验情况提供了新的理解和控制,在评估过程中指导他们。我们推导了一个三点实验设计的精确最大似然值,并确定了最大似然值不存在的两种情况。为了应对这些挑战,我们建议利用Firth的修正分数,我们将其分析表达为实验设计的函数。通过模拟研究,我们证明了Firth修正在一个有问题的情况下产生有限的估计。对于剩下的情况,我们引入了类似于假设检验的设计增强策略。
{"title":"Maximum likelihood estimation under the Emax model: existence, geometry and efficiency.","authors":"Giacomo Aletti, Nancy Flournoy, Caterina May, Chiara Tommasi","doi":"10.1007/s00362-025-01673-2","DOIUrl":"10.1007/s00362-025-01673-2","url":null,"abstract":"<p><p>This study focuses on the estimation of the Emax dose-response model, a widely utilized framework in clinical trials, experiments in pharmacology, agriculture, environmental science, and more. Existing challenges in obtaining maximum likelihood estimates (MLE) for model parameters are often ascribed to computational issues but, in reality, stem from the absence of a MLE. Our contribution provides new understanding and control of all the experimental situations that practitioners might face, guiding them in the estimation process. We derive the exact MLE for a three-point experimental design and identify the two scenarios where the MLE fails to exist. To address these challenges, we propose utilizing Firth's modified score, which we express analytically as a function of the experimental design. Through a simulation study, we demonstrate that the Firth modification yields a finite estimate in one of the problematic scenarios. For the remaining case, we introduce a design-augmentation strategy akin to a hypothesis test.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"66 5","pages":"106"},"PeriodicalIF":1.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12152072/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144287065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local linear smoothing for regression surfaces on the simplex using Dirichlet kernels. 用狄利克雷核对单纯形上的回归曲面进行局部线性平滑。
IF 1.2 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2025-01-01 Epub Date: 2025-05-14 DOI: 10.1007/s00362-025-01708-8
Christian Genest, Frédéric Ouimet

This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring good boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (Ann Inst Stat Math, 54(2):312-323, 2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda et al. (AIMS Math 9(9):26195-26282, 2024).

介绍了单纯形上回归曲面的局部线性光滑器。该估计器解决了由局部自适应狄利克雷核加权的最小二乘回归问题,保证了良好的边界性质。推广了Chen的单变量结果,得到了偏差、方差、均方误差和平均积分平方误差的渐近结果(数理统计,54(2):312- 323,2002)。仿真研究表明,所提出的具有Dirichlet核的局部线性估计器优于其文献中唯一的直接竞争对手,即Bouzebda等人提出的具有Dirichlet核的Nadaraya-Watson估计器(AIMS Math 9(9): 261995 -26282, 2024)。
{"title":"Local linear smoothing for regression surfaces on the simplex using Dirichlet kernels.","authors":"Christian Genest, Frédéric Ouimet","doi":"10.1007/s00362-025-01708-8","DOIUrl":"https://doi.org/10.1007/s00362-025-01708-8","url":null,"abstract":"<p><p>This paper introduces a local linear smoother for regression surfaces on the simplex. The estimator solves a least-squares regression problem weighted by a locally adaptive Dirichlet kernel, ensuring good boundary properties. Asymptotic results for the bias, variance, mean squared error, and mean integrated squared error are derived, generalizing the univariate results of Chen (Ann Inst Stat Math, 54(2):312-323, 2002). A simulation study shows that the proposed local linear estimator with Dirichlet kernel outperforms its only direct competitor in the literature, the Nadaraya-Watson estimator with Dirichlet kernel due to Bouzebda et al. (AIMS Math 9(9):26195-26282, 2024).</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"66 4","pages":"97"},"PeriodicalIF":1.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12078451/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144095669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The distribution of power-related random variables (and their use in clinical trials) 与功率有关的随机变量的分布(及其在临床试验中的应用)
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-19 DOI: 10.1007/s00362-024-01599-1
Francesco Mariani, Fulvio De Santis, Stefania Gubbiotti

In the hybrid Bayesian-frequentist approach to hypotheses tests, the power function, i.e. the probability of rejecting the null hypothesis, is a random variable and a pre-experimental evaluation of the study is commonly carried out through the so-called probability of success (PoS). PoS is usually defined as the expected value of the random power that is not necessarily a well-representative summary of the entire distribution. Here, we consider the main definitions of PoS and investigate the power related random variables that induce them. We provide general expressions for their cumulative distribution and probability density functions, as well as closed-form expressions when the test statistic is, at least asymptotically, normal. The analysis of such distributions highlights discrepancies in the main definitions of PoS, leading us to prefer the one based on the utility function of the test. We illustrate our idea through an example and an application to clinical trials, which is a framework where PoS is commonly employed.

在贝叶斯-频率主义混合假设检验方法中,幂函数(即拒绝零假设的概率)是一个随机变量,通常通过所谓的成功概率(PoS)对研究进行实验前评估。PoS 通常被定义为随机幂的期望值,它不一定是整个分布的代表性总结。在此,我们考虑了 PoS 的主要定义,并研究了引起 PoS 的与功率相关的随机变量。我们提供了它们的累积分布和概率密度函数的一般表达式,以及当检验统计量至少在渐近上是正态时的闭式表达式。对这些分布的分析凸显了 PoS 主要定义中的差异,使我们更倾向于基于检验效用函数的定义。我们通过一个例子来说明我们的想法,并将其应用到临床试验中,临床试验是常用 PoS 的框架。
{"title":"The distribution of power-related random variables (and their use in clinical trials)","authors":"Francesco Mariani, Fulvio De Santis, Stefania Gubbiotti","doi":"10.1007/s00362-024-01599-1","DOIUrl":"https://doi.org/10.1007/s00362-024-01599-1","url":null,"abstract":"<p>In the hybrid Bayesian-frequentist approach to hypotheses tests, the power function, i.e. the probability of rejecting the null hypothesis, is a random variable and a pre-experimental evaluation of the study is commonly carried out through the so-called probability of success (PoS). PoS is usually defined as the expected value of the random power that is not necessarily a well-representative summary of the entire distribution. Here, we consider the main definitions of PoS and investigate the power related random variables that induce them. We provide general expressions for their cumulative distribution and probability density functions, as well as closed-form expressions when the test statistic is, at least asymptotically, normal. The analysis of such distributions highlights discrepancies in the main definitions of PoS, leading us to prefer the one based on the utility function of the test. We illustrate our idea through an example and an application to clinical trials, which is a framework where PoS is commonly employed.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"26 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The cost of sequential adaptation and the lower bound for mean squared error 顺序适应的成本和均方误差的下限
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-17 DOI: 10.1007/s00362-024-01565-x
Sergey Tarima, Nancy Flournoy

Informative interim adaptations lead to random sample sizes. The random sample size becomes a component of the sufficient statistic and estimation based solely on observed samples or on the likelihood function does not use all available statistical evidence. The total Fisher Information (FI) is decomposed into the design FI and a conditional-on-design FI. The FI unspent by a design’s informative interim adaptation decomposes further into a weighted linear combination of FIs conditional-on-stopping decisions. Then, these components are used to determine the new lower mean squared error (MSE) in post-adaptation estimation because the Cramer–Rao lower bound (1945, 1946) and its sequential version suggested by Wolfowitz (Ann Math Stat 18(2):215–230, 1947) for non-informative stopping are not applicable to post-informative-adaptation estimation. In addition, we also show that the new proposed lower boundary on the MSE is reached by the maximum likelihood estimators in designs with informative adaptations when data are coming from one-parameter exponential family. Theoretical results are illustrated with simple normal samples collected according to a two-stage design with a possibility of early stopping.

有启发性的临时调整会产生随机样本规模。随机样本规模成为充分统计量的一个组成部分,而仅仅基于观测样本或似然函数的估计并没有使用所有可用的统计证据。总费雪信息 (FI) 分解为设计 FI 和条件设计 FI。设计信息中期调整未消耗的费雪信息进一步分解为以停止决策为条件的费雪信息的加权线性组合。然后,这些成分被用来确定适应后估计中新的均方误差下限(MSE),因为 Wolfowitz(Ann Math Stat 18(2):215-230, 1947)提出的用于非信息停止的 Cramer-Rao 下限(1945, 1946)及其顺序版本不适用于信息适应后估计。此外,我们还证明,当数据来自单参数指数族时,在有信息适应的设计中,最大似然估计值可以达到新提出的 MSE 下限。理论结果以根据两阶段设计收集的简单正态样本为例作了说明,该设计有可能提前停止。
{"title":"The cost of sequential adaptation and the lower bound for mean squared error","authors":"Sergey Tarima, Nancy Flournoy","doi":"10.1007/s00362-024-01565-x","DOIUrl":"https://doi.org/10.1007/s00362-024-01565-x","url":null,"abstract":"<p>Informative interim adaptations lead to random sample sizes. The random sample size becomes a component of the sufficient statistic and estimation based solely on observed samples or on the likelihood function does not use all available statistical evidence. The total Fisher Information (FI) is decomposed into the design FI and a conditional-on-design FI. The FI unspent by a design’s informative interim adaptation decomposes further into a weighted linear combination of FIs conditional-on-stopping decisions. Then, these components are used to determine the new lower mean squared error (MSE) in post-adaptation estimation because the Cramer–Rao lower bound (1945, 1946) and its sequential version suggested by Wolfowitz (Ann Math Stat 18(2):215–230, 1947) for non-informative stopping are not applicable to post-informative-adaptation estimation. In addition, we also show that the new proposed lower boundary on the MSE is reached by the maximum likelihood estimators in designs with informative adaptations when data are coming from one-parameter exponential family. Theoretical results are illustrated with simple normal samples collected according to a two-stage design with a possibility of early stopping.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"207 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nested strong orthogonal arrays 嵌套强正交阵列
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-16 DOI: 10.1007/s00362-024-01609-2
Chunwei Zheng, Wenlong Li, Jian-Feng Yang

Nested space-filling designs are popular for conducting multiple computer experiments with different levels of accuracy. Strong orthogonal arrays (SOAs) are a special type of space-filling designs which possess attractive low-dimensional stratifications. Combining these two kinds of designs, we propose a new type of design called a nested strong orthogonal array. Such a design is a special nested space-filling design that consists of two layers, i.e., the large SOA and the small SOA, where they enjoy different strengths, and the small one is nested in the large one. The proposed construction method is easy to use, capable of accommodating a larger number of columns, and the resulting designs possess better stratifications than the existing nested space-filling designs in two dimensions. The construction method is based on regular second order saturated designs and nonregular designs. Some comparisons with the existing nested space-filling designs are given to show the usefulness of the proposed designs.

嵌套空间填充设计是进行不同精度的多重计算机实验的常用方法。强正交阵列(SOA)是一种特殊的空间填充设计,它拥有极具吸引力的低维分层。结合这两种设计,我们提出了一种名为嵌套强正交阵列的新型设计。这种设计是一种特殊的嵌套空间填充设计,由两层组成,即大 SOA 和小 SOA,它们具有不同的强度,小 SOA 嵌套在大 SOA 中。与现有的二维嵌套式空间填充设计相比,所提出的构建方法易于使用,能够容纳更多的柱子,而且所产生的设计具有更好的分层效果。该构造方法基于规则二阶饱和设计和非规则设计。与现有的嵌套空间填充设计进行了一些比较,以显示拟议设计的实用性。
{"title":"Nested strong orthogonal arrays","authors":"Chunwei Zheng, Wenlong Li, Jian-Feng Yang","doi":"10.1007/s00362-024-01609-2","DOIUrl":"https://doi.org/10.1007/s00362-024-01609-2","url":null,"abstract":"<p>Nested space-filling designs are popular for conducting multiple computer experiments with different levels of accuracy. Strong orthogonal arrays (SOAs) are a special type of space-filling designs which possess attractive low-dimensional stratifications. Combining these two kinds of designs, we propose a new type of design called a nested strong orthogonal array. Such a design is a special nested space-filling design that consists of two layers, i.e., the large SOA and the small SOA, where they enjoy different strengths, and the small one is nested in the large one. The proposed construction method is easy to use, capable of accommodating a larger number of columns, and the resulting designs possess better stratifications than the existing nested space-filling designs in two dimensions. The construction method is based on regular second order saturated designs and nonregular designs. Some comparisons with the existing nested space-filling designs are given to show the usefulness of the proposed designs.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tests for time-varying coefficient spatial autoregressive panel data model with fixed effects 具有固定效应的时变系数空间自回归面板数据模型检验
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-14 DOI: 10.1007/s00362-024-01607-4
Lingling Tian, Yunan Su, Chuanhua Wei

As an extension of the spatial autoregressive panel data model and the time-varying coefficient panel data model, the time-varying coefficient spatial autoregressive panel data model is useful in analysis of spatial panel data. While research has addressed the estimation problem of this model, less attention has been given to hypotheses tests. This paper studies two tests for this semiparametric spatial panel data model. One considers the existence of the spatial lag term, and the other determines whether some time-varying coefficients are constants. We employ the profile generalized likelihood ratio test procedure to construct the corresponding test statistic, and the residual-based bootstrap procedure is used to derive the p-value of the tests. Some simulations are conducted to evaluate the performance of the proposed test method, the results show that the proposed methods have good finite sample properties. Finally, we apply the proposed test methods to the provincial carbon emission data of China. Our findings suggest that the partially linear time-varying coefficients spatial autoregressive panel data model provides a better fit for the carbon emission data.

作为空间自回归面板数据模型和时变系数面板数据模型的扩展,时变系数空间自回归面板数据模型在空间面板数据分析中非常有用。虽然已有研究解决了该模型的估计问题,但较少关注假设检验。本文研究了该半参数空间面板数据模型的两种检验方法。一个是考虑空间滞后项的存在,另一个是确定某些时变系数是否为常数。我们采用剖面广义似然比检验程序来构建相应的检验统计量,并使用基于残差的引导程序来得出检验的 p 值。我们进行了一些模拟来评估所提出的检验方法的性能,结果表明所提出的方法具有良好的有限样本特性。最后,我们将所提出的检验方法应用于中国省级碳排放数据。我们的研究结果表明,部分线性时变系数空间自回归面板数据模型能更好地拟合碳排放数据。
{"title":"Tests for time-varying coefficient spatial autoregressive panel data model with fixed effects","authors":"Lingling Tian, Yunan Su, Chuanhua Wei","doi":"10.1007/s00362-024-01607-4","DOIUrl":"https://doi.org/10.1007/s00362-024-01607-4","url":null,"abstract":"<p>As an extension of the spatial autoregressive panel data model and the time-varying coefficient panel data model, the time-varying coefficient spatial autoregressive panel data model is useful in analysis of spatial panel data. While research has addressed the estimation problem of this model, less attention has been given to hypotheses tests. This paper studies two tests for this semiparametric spatial panel data model. One considers the existence of the spatial lag term, and the other determines whether some time-varying coefficients are constants. We employ the profile generalized likelihood ratio test procedure to construct the corresponding test statistic, and the residual-based bootstrap procedure is used to derive the p-value of the tests. Some simulations are conducted to evaluate the performance of the proposed test method, the results show that the proposed methods have good finite sample properties. Finally, we apply the proposed test methods to the provincial carbon emission data of China. Our findings suggest that the partially linear time-varying coefficients spatial autoregressive panel data model provides a better fit for the carbon emission data.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"167 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the consistency of supervised learning with missing values 论缺失值监督学习的一致性
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-12 DOI: 10.1007/s00362-024-01550-4
Julie Josse, Jacob M. Chen, Nicolas Prost, Gaël Varoquaux, Erwan Scornet

In many application settings, data have missing entries, which makes subsequent analyses challenging. An abundant literature addresses missing values in an inferential framework, aiming at estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and test data. We first rewrite classic missing values results for this setting. We then show the consistency of two approaches, test-time multiple imputation and single imputation in prediction. A striking result is that the widely-used method of imputing with a constant prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is frowned upon as it distorts the distribution of the data. The consistency of such a popular simple approach is important in practice. Finally, to contrast procedures based on imputation prior to learning with procedures that optimize the missing-value handling for prediction, we consider decision trees. Indeed, decision trees are among the few methods that can tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing empirically different missing values strategies in trees, we recommend using the “missing incorporated in attribute” method as it can handle both non-informative and informative missing values.

在许多应用环境中,数据会有缺失项,这给后续分析带来了挑战。有大量文献在推理框架中处理缺失值问题,目的是从不完整的表格中估计参数及其方差。在这里,我们考虑的是监督学习环境:当训练数据和测试数据中都出现缺失值时预测目标。我们首先重写了这种情况下的经典缺失值结果。然后,我们展示了两种方法的一致性,即预测中的测试时间多重估算和单一估算。一个惊人的结果是,当缺失值不具有信息性时,广泛使用的在学习前使用常数估算的方法是一致的。这与推断环境形成了鲜明对比,在推断环境中,平均估算会扭曲数据的分布,因此受到人们的鄙视。这种流行的简单方法的一致性在实践中非常重要。最后,为了将基于学习前估算的程序与优化缺失值处理以进行预测的程序进行对比,我们考虑了决策树。事实上,决策树是少数几种能够处理缺失值的经验风险最小化的方法之一,这是因为决策树能够处理不完全变量的半离散性质。在对决策树中不同的缺失值策略进行经验比较后,我们推荐使用 "属性缺失并入 "方法,因为它既能处理非信息性缺失值,也能处理信息性缺失值。
{"title":"On the consistency of supervised learning with missing values","authors":"Julie Josse, Jacob M. Chen, Nicolas Prost, Gaël Varoquaux, Erwan Scornet","doi":"10.1007/s00362-024-01550-4","DOIUrl":"https://doi.org/10.1007/s00362-024-01550-4","url":null,"abstract":"<p>In many application settings, data have missing entries, which makes subsequent analyses challenging. An abundant literature addresses missing values in an inferential framework, aiming at estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and test data. We first rewrite classic missing values results for this setting. We then show the consistency of two approaches, test-time multiple imputation and single imputation in prediction. A striking result is that the widely-used method of imputing with a constant prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is frowned upon as it distorts the distribution of the data. The consistency of such a popular simple approach is important in practice. Finally, to contrast procedures based on imputation prior to learning with procedures that optimize the missing-value handling for prediction, we consider decision trees. Indeed, decision trees are among the few methods that can tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing empirically different missing values strategies in trees, we recommend using the “missing incorporated in attribute” method as it can handle both non-informative and informative missing values.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"15 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142201029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Maximum likelihood estimation for left-truncated log-logistic distributions with a given truncation point 对给定截断点的左截断对数-逻辑分布进行最大似然估计
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-10 DOI: 10.1007/s00362-024-01603-8
Markus Kreer, Ayşe Kızılersü, Jake Guscott, Lukas Christopher Schmitz, Anthony W. Thomas

For a sample (X_1, X_2,ldots X_N) of independent identically distributed copies of a log-logistically distributed random variable X the maximum likelihood estimation is analysed in detail if a left-truncation point (x_L>0) is introduced. Due to scaling properties it is sufficient to investigate the case (x_L=1). Here the corresponding maximum likelihood equations for a normalised sample (i.e. a sample divided by (x_L)) do not always possess a solution. A simple criterion guarantees the existence of a solution: Let (mathbb {E}(cdot )) denote the expectation induced by the normalised sample and denote by (beta _0=mathbb {E}(ln {X})^{-1}), the inverse value of expectation of the logarithm of the sampled random variable X (which is greater than (x_L=1)). If this value (beta _0) is bigger than a certain positive number (beta _C) then a solution of the maximum likelihood equation exists. Here the number (beta _C) is the unique solution of a moment equation,(mathbb {E}(X^{-beta _C})=frac{1}{2}). In the case of existence a profile likelihood function can be constructed and the optimisation problem is reduced to one dimension leading to a robust numerical algorithm. When the maximum likelihood equations do not admit a solution for certain data samples, it is shown that the Pareto distribution is the (L^1)-limit of the degenerated left-truncated log-logistic distribution, where (L^1(mathbb {R}^+)) is the usual Banach space of functions whose absolute value is Lebesgue-integrable. A large sample analysis showing consistency and asymptotic normality complements our analysis. Finally, two applications to real world data are presented.

对于对数逻辑分布随机变量 X 的独立同分布副本的样本 (X_1,X_2,ldotsX_N),如果引入一个左截断点 (x_L>0),就可以详细分析最大似然估计。由于缩放特性,研究 (x_L=1)的情况就足够了。在这里,归一化样本(即样本除以 (x_L))的相应最大似然方程并不总是有解。一个简单的标准可以保证解的存在:让 (mathbb {E}(cdot )) 表示归一化样本引起的期望值,用 (beta _0=mathbb {E}(ln {X})^{-1})表示采样随机变量 X 的对数(大于 (x_L=1))的反期望值。如果这个值(beta _0)大于某个正数(beta _C),那么就存在最大似然方程的解。这里的数(beta _C)是矩方程的唯一解,(mathbb {E}(X^{-beta _C})=frac{1}{2})。在存在的情况下,可以构建一个轮廓似然函数,并将优化问题简化为一个维度,从而产生一种稳健的数值算法。当最大似然方程对某些数据样本不允许求解时,可以证明帕累托分布是退化的左截断对数-逻辑分布的 (L^1)-limit ,其中 (L^1(mathbb {R}^+)) 是绝对值可被勒贝格积分的函数的通常巴拿赫空间。大样本分析显示了一致性和渐近正态性,补充了我们的分析。最后,我们介绍了现实世界数据的两个应用。
{"title":"Maximum likelihood estimation for left-truncated log-logistic distributions with a given truncation point","authors":"Markus Kreer, Ayşe Kızılersü, Jake Guscott, Lukas Christopher Schmitz, Anthony W. Thomas","doi":"10.1007/s00362-024-01603-8","DOIUrl":"https://doi.org/10.1007/s00362-024-01603-8","url":null,"abstract":"<p>For a sample <span>(X_1, X_2,ldots X_N)</span> of independent identically distributed copies of a log-logistically distributed random variable <i>X</i> the maximum likelihood estimation is analysed in detail if a left-truncation point <span>(x_L&gt;0)</span> is introduced. Due to scaling properties it is sufficient to investigate the case <span>(x_L=1)</span>. Here the corresponding maximum likelihood equations for a normalised sample (i.e. a sample divided by <span>(x_L)</span>) do not always possess a solution. A simple criterion guarantees the existence of a solution: Let <span>(mathbb {E}(cdot ))</span> denote the expectation induced by the normalised sample and denote by <span>(beta _0=mathbb {E}(ln {X})^{-1})</span>, the inverse value of expectation of the logarithm of the sampled random variable <i>X</i> (which is greater than <span>(x_L=1)</span>). If this value <span>(beta _0)</span> is bigger than a certain positive number <span>(beta _C)</span> then a solution of the maximum likelihood equation exists. Here the number <span>(beta _C)</span> is the unique solution of a moment equation,<span>(mathbb {E}(X^{-beta _C})=frac{1}{2})</span>. In the case of existence a profile likelihood function can be constructed and the optimisation problem is reduced to one dimension leading to a robust numerical algorithm. When the maximum likelihood equations do not admit a solution for certain data samples, it is shown that the Pareto distribution is the <span>(L^1)</span>-limit of the degenerated left-truncated log-logistic distribution, where <span>(L^1(mathbb {R}^+))</span> is the usual Banach space of functions whose absolute value is Lebesgue-integrable. A large sample analysis showing consistency and asymptotic normality complements our analysis. Finally, two applications to real world data are presented.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"4 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142201030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confidence bounds for compound Poisson process 复合泊松过程的置信区间
IF 1.3 3区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2024-09-05 DOI: 10.1007/s00362-024-01604-7
Marek Skarupski, Qinhao Wu

The compound Poisson process (CPP) is a common mathematical model for describing many phenomena in medicine, reliability theory and risk theory. However, in the case of low-frequency phenomena, we are often unable to collect a sufficiently large database to conduct analysis. In this article, we focused on methods for determining confidence intervals for the rate of the CPP when the sample size is small. Based on the properties of process parameter estimators, we proposed a new method for constructing such intervals and compared it with other known approaches. In numerical simulations, we used synthetic data from several continuous and discrete distributions. The case of CPP, in which rewards came from exponential distribution, was discussed separately. The recommendation of how to use each method to have a more precise confidence interval is given. All simulations were performed in R version 4.2.1.

复合泊松过程(CPP)是描述医学、可靠性理论和风险理论中许多现象的常用数学模型。然而,对于低频现象,我们往往无法收集足够大的数据库来进行分析。在本文中,我们重点讨论了在样本量较小时确定 CPP 率置信区间的方法。基于过程参数估计器的特性,我们提出了一种构建此类区间的新方法,并将其与其他已知方法进行了比较。在数值模拟中,我们使用了几种连续和离散分布的合成数据。我们单独讨论了 CPP 的情况,其中奖励来自指数分布。我们给出了如何使用每种方法获得更精确置信区间的建议。所有模拟均在 R 4.2.1 版本中进行。
{"title":"Confidence bounds for compound Poisson process","authors":"Marek Skarupski, Qinhao Wu","doi":"10.1007/s00362-024-01604-7","DOIUrl":"https://doi.org/10.1007/s00362-024-01604-7","url":null,"abstract":"<p>The compound Poisson process (CPP) is a common mathematical model for describing many phenomena in medicine, reliability theory and risk theory. However, in the case of low-frequency phenomena, we are often unable to collect a sufficiently large database to conduct analysis. In this article, we focused on methods for determining confidence intervals for the rate of the CPP when the sample size is small. Based on the properties of process parameter estimators, we proposed a new method for constructing such intervals and compared it with other known approaches. In numerical simulations, we used synthetic data from several continuous and discrete distributions. The case of CPP, in which rewards came from exponential distribution, was discussed separately. The recommendation of how to use each method to have a more precise confidence interval is given. All simulations were performed in R version 4.2.1.</p>","PeriodicalId":51166,"journal":{"name":"Statistical Papers","volume":"17 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142201031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Papers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1