首页 > 最新文献

Asta-Advances in Statistical Analysis最新文献

英文 中文
Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods 混合专家分布回归:采用自适应一阶方法的稳健估计实现
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-15 DOI: 10.1007/s10182-023-00486-8
David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün

In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.

在这项工作中,我们提出了一种有效的专家混合分布回归模型的实现,该模型通过使用随机一阶优化技术和自适应学习率调度程序来利用鲁棒估计。我们利用神经网络软件的灵活性和可扩展性,并在mixdistreg中实现所提出的框架,mixdistreg是一个R软件包,允许定义许多不同家族的混合物,在高维和大样本设置中进行估计,并基于TensorFlow进行鲁棒优化。模拟和真实数据应用的数值实验表明,在许多不同的设置中,优化与通过经典方法进行估计一样可靠,并且在经典方法始终失败的复杂场景中可能获得结果。
{"title":"Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods","authors":"David Rügamer,&nbsp;Florian Pfisterer,&nbsp;Bernd Bischl,&nbsp;Bettina Grün","doi":"10.1007/s10182-023-00486-8","DOIUrl":"10.1007/s10182-023-00486-8","url":null,"abstract":"<div><p>In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in <i>mixdistreg</i>, an <span>R</span> software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"351 - 373"},"PeriodicalIF":1.4,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00486-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138506564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Bayesian approach to modeling topic-metadata relationships 贝叶斯方法为主题-元数据关系建模
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-03 DOI: 10.1007/s10182-023-00485-9
Patrick Schulze, Simon Wiegrebe, Paul W. Thurner, Christian Heumann, Matthias Aßenmacher

The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the method of composition, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the R package stm by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.

高级主题建模的目的不仅在于探索潜在的主题结构,还在于估计所发现的主题与理论上相关的元数据之间的关系。用于估算这种关系的方法必须考虑到拓扑结构不是直接观察到的,而是以无监督的方式估算出来的,通常是通过普通的主题模型。为实现这一目的,经常使用的程序是构成法,这是一种蒙特卡罗抽样技术,对元数据协变量的抽样主题比例进行多次重复线性回归。在本文中,我们对这种方法提出了两点修改建议:首先,我们用更合适的 Beta 回归取代了线性回归,从而大大改进了 R 软件包 stm 中现有的组成方法实现。其次,我们从根本上改进了整个估计框架,用完全的贝叶斯方法取代了目前的频繁法和贝叶斯方法的混合方法。这样就能更恰当地量化不确定性。我们通过调查德国议员的 Twitter 帖子与其选区相关的不同元数据协变量之间的关系来说明我们改进后的方法,并使用结构主题模型来估计主题比例。
{"title":"A Bayesian approach to modeling topic-metadata relationships","authors":"Patrick Schulze,&nbsp;Simon Wiegrebe,&nbsp;Paul W. Thurner,&nbsp;Christian Heumann,&nbsp;Matthias Aßenmacher","doi":"10.1007/s10182-023-00485-9","DOIUrl":"10.1007/s10182-023-00485-9","url":null,"abstract":"<div><p>The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the <i>method of composition</i>, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the <span>R</span> package <span>stm</span> by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"333 - 349"},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00485-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135820119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPS data on tourists: a spatial analysis on road networks 游客 GPS 数据:道路网络的空间分析
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-11-03 DOI: 10.1007/s10182-023-00484-w
Nicoletta D’Angelo, Antonino Abbruzzo, Mauro Ferrante, Giada Adelfio, Marcello Chiodi

This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.

本文提出了一个线性网络上的空间点过程模型来分析邮轮乘客的停留活动。该模型将游客在目的地的停留强度作为其主要决定因素的函数进行识别和建模。为此,我们在意大利巴勒莫和克罗地亚杜布罗夫尼克两座城市,通过整合传统的问卷调查方法和 GPS 跟踪数据,收集了邮轮乘客的数据。首先,我们采用基于密度的空间聚类算法来识别 GPS 跟踪数据中的停靠地点。在吉布斯点过程模型的框架内,考虑了与个人相关的变量和与行程相关的特征的影响。所提出的模型描述了目的地的空间停靠强度,考虑了基础道路网络的几何形状、与个人相关的变量、上下文信息以及停靠点之间的空间交互作用。分析成功地量化了个人相关变量和行程相关特征对停靠强度的影响。通过互动参数,可以衡量邮轮乘客在决定停靠站点时的依赖程度。
{"title":"GPS data on tourists: a spatial analysis on road networks","authors":"Nicoletta D’Angelo,&nbsp;Antonino Abbruzzo,&nbsp;Mauro Ferrante,&nbsp;Giada Adelfio,&nbsp;Marcello Chiodi","doi":"10.1007/s10182-023-00484-w","DOIUrl":"10.1007/s10182-023-00484-w","url":null,"abstract":"<div><p>This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"477 - 499"},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00484-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional sum of squares estimation of k-factor GARMA models k 因子 GARMA 模型的条件平方和估计
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-31 DOI: 10.1007/s10182-023-00482-y
Paul M. Beaumont, Aaron D. Smallwood

We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the k-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate T, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.

我们分析了 k 因子格根鲍尔自回归移动平均(GARMA)模型的约束平方和估计器(CSS)的估计和推断相关问题。我们给出了估计器的理论结果,并表明决定周期长度的参数是渐近独立的,在有限周期内以样本大小 T 的速率收敛。其余参数缺乏独立性,以标准速率收敛。与现有文献类似,由于相关参数位于参数空间的边界上,因此在检验非周期性长记忆假设时存在一些挑战。我们展示了模拟结果,以探索估计器的小样本特性,这些结果支持大多数分布结果,同时也强调了值得进一步探索的领域。我们通过对 IBM 交易量的应用证明了理论和估计器的适用性。
{"title":"Conditional sum of squares estimation of k-factor GARMA models","authors":"Paul M. Beaumont,&nbsp;Aaron D. Smallwood","doi":"10.1007/s10182-023-00482-y","DOIUrl":"10.1007/s10182-023-00482-y","url":null,"abstract":"<div><p>We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the <i>k</i>-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate <i>T</i>, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"501 - 543"},"PeriodicalIF":1.4,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135870088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measures of interrater agreement for quantitative data 定量数据的互译一致性测量方法
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-10 DOI: 10.1007/s10182-023-00483-x
Daniela Marella, Giuseppe Bove

In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.

本文提出了基于标准偏差的定量测量的评分者间绝对一致度量。这些指数可以:(i) 克服影响类内相关指数的限制;(ii) 测量单个目标的评定者之间的一致性。介绍了所提测量指标的估计值,并研究了它们对正态和非正态数据的抽样特性。采用模拟数据来证明新指数在评估一致性方面的准确性和实用性。最后,介绍了一种应用方法,用于评估放射科医生在评估肺癌肿瘤大小时所进行测量的一致性。
{"title":"Measures of interrater agreement for quantitative data","authors":"Daniela Marella,&nbsp;Giuseppe Bove","doi":"10.1007/s10182-023-00483-x","DOIUrl":"10.1007/s10182-023-00483-x","url":null,"abstract":"<div><p>In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 4","pages":"801 - 821"},"PeriodicalIF":1.4,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00483-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136296350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Calibrated imputation for multivariate categorical data 多变量分类数据的校准估算
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-05 DOI: 10.1007/s10182-023-00481-z
Ton de Waal, Jacco Daalmans

Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.

对于任何收集和处理数据的人来说,非响应都是一个主要问题。处理缺失数据的常用技术是估算,即估算缺失值并将其填入数据集。如果要估算的变量必须符合已知的总数,那么估算就会变得很有挑战性。更具挑战性的情况是,同一数据集中的多个变量都需要估算,而且除了已知总数外,还必须满足变量之间的逻辑限制。在本文中,我们为多变量分类数据的一大类估算方法开发了一种方法,在满足数据逻辑限制的同时,保留了之前公布的总数。所开发的方法可与任何估算估算概率的估算模型结合使用,估算概率即在某一单位中对某一变量的某一类别进行估算,从而得出该变量和单位的正确值的概率。
{"title":"Calibrated imputation for multivariate categorical data","authors":"Ton de Waal,&nbsp;Jacco Daalmans","doi":"10.1007/s10182-023-00481-z","DOIUrl":"10.1007/s10182-023-00481-z","url":null,"abstract":"<div><p>Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"545 - 576"},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00481-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Debiasing SHAP scores in random forests 在随机森林中去偏SHAP分数
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-08-22 DOI: 10.1007/s10182-023-00479-7
Markus Loecher

Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or "shrunk" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.

黑盒机器学习模型目前正被用于医疗保健和刑事司法等社会各领域的高风险决策。虽然基于树的集合方法(如随机森林)在表格数据集上的表现通常优于深度学习模型,但众所周知,其内置的变量重要性算法严重偏向于高熵特征。最近的研究表明,日益流行的 SHAP(SHapley Additive exPlanations)值也存在类似的偏差。我们提出了基于样本拆分的去偏或 "缩减 "SHAP 分数,它还能在特征层面检测过拟合问题。
{"title":"Debiasing SHAP scores in random forests","authors":"Markus Loecher","doi":"10.1007/s10182-023-00479-7","DOIUrl":"10.1007/s10182-023-00479-7","url":null,"abstract":"<div><p>Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or \"shrunk\" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"427 - 440"},"PeriodicalIF":1.4,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00479-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48943594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A family of consistent normally distributed tests for Poissonity Poissonity的一致正态分布检验族
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-06-15 DOI: 10.1007/s10182-023-00478-8
Antonio Di Noia, Marzia Marcheselli, Caterina Pisani, Luca Pratelli

A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.

根据概率生成函数的特征,提出了一系列一致的检验方法,用于评估泊松性与各类计数分布的关系,其中包括一些最常采用的泊松分布替代方案。实际上,检验统计量系列是基于泊松累积分布函数的插件估计值与经验累积分布函数之间的差异。检验统计量具有直观、简单的形式,并且是渐近正态分布,可以直接进行检验。通过广泛的模拟研究,对该检验的有限样本特性进行了研究。与其他已知极限分布的检验相比,该检验表现令人满意。
{"title":"A family of consistent normally distributed tests for Poissonity","authors":"Antonio Di Noia,&nbsp;Marzia Marcheselli,&nbsp;Caterina Pisani,&nbsp;Luca Pratelli","doi":"10.1007/s10182-023-00478-8","DOIUrl":"10.1007/s10182-023-00478-8","url":null,"abstract":"<div><p>A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"209 - 223"},"PeriodicalIF":1.4,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00478-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48755643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correlation-type goodness-of-fit tests based on independence characterizations 基于独立性特征的相关型拟合优度检验
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-04 DOI: 10.1007/s10182-023-00475-x
Katarina Halaj, Bojana Milošević, Marko Obradović, M. Dolores Jiménez-Gamero

This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.

本文利用独立性类型特征提出了一类检验统计量,可用于检验几类无效分布的拟合优度。由此得出的检验结果对固定的替代方案是一致的。文章还探讨了检验统计量的一些极限和小样本特性。与常见的通用拟合优度检验相比,新检验对大多数备选方案都表现出更强的能力,而与另一种基于特征描述的程序相比,新检验在各种模拟环境中都能提供具有竞争力或可比的能力。通过几个真实数据实例,证明了所提出的测试方法的实用性。
{"title":"Correlation-type goodness-of-fit tests based on independence characterizations","authors":"Katarina Halaj,&nbsp;Bojana Milošević,&nbsp;Marko Obradović,&nbsp;M. Dolores Jiménez-Gamero","doi":"10.1007/s10182-023-00475-x","DOIUrl":"10.1007/s10182-023-00475-x","url":null,"abstract":"<div><p>This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"185 - 207"},"PeriodicalIF":1.4,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41779980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional feature importance for mixed data 混合数据的条件特征重要性
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-29 DOI: 10.1007/s10182-023-00477-9
Kristin Blesch, David S. Watson, Marvin N. Wright

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

尽管特征重要性(FI)测量方法在可解释机器学习中很受欢迎,但很少有人讨论这些方法的统计充分性。从统计学的角度来看,一个主要的区别在于分析变量在调整协变量之前和之后的重要性,即边际测量和条件测量之间的区别。我们的研究提请人们注意这一鲜为人知但却至关重要的区别,并展示其影响。我们发现,目前可用来测试条件 FI 的方法很少,而且由于数据要求不匹配,从业人员在方法应用方面一直受到严重限制。现实世界中的大多数数据都表现出复杂的特征依赖性,同时包含连续和分类特征(即混合数据)。条件 FI 方法往往忽略了这两种特性。为了填补这一空白,我们建议将条件预测影响(CPI)框架与连续山寨抽样相结合。条件预测影响(CPI)通过对有效的山寨产品进行采样,从而生成与待分析数据具有相似统计属性的合成数据,从而实现条件预测影响测量,并控制任何特征依赖性。我们特意设计了连续山寨数据来处理混合数据,因此可以将 CPI 方法扩展到此类数据集。我们通过大量模拟和一个真实世界的例子证明,我们提出的工作流程可以控制 I 型误差,实现高功率,并且与其他条件 FI 指标给出的结果一致,而边际 FI 指标可能会导致误导性解释。我们的研究结果凸显了为混合数据开发统计充分的专门方法的必要性。
{"title":"Conditional feature importance for mixed data","authors":"Kristin Blesch,&nbsp;David S. Watson,&nbsp;Marvin N. Wright","doi":"10.1007/s10182-023-00477-9","DOIUrl":"10.1007/s10182-023-00477-9","url":null,"abstract":"<div><p>Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between <i>marginal</i> and <i>conditional</i> measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"259 - 278"},"PeriodicalIF":1.4,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00477-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77609605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Asta-Advances in Statistical Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1