首页 > 最新文献

Asta-Advances in Statistical Analysis最新文献

英文 中文
Calibrated imputation for multivariate categorical data 多变量分类数据的校准估算
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-10-05 DOI: 10.1007/s10182-023-00481-z
Ton de Waal, Jacco Daalmans

Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.

对于任何收集和处理数据的人来说,非响应都是一个主要问题。处理缺失数据的常用技术是估算,即估算缺失值并将其填入数据集。如果要估算的变量必须符合已知的总数,那么估算就会变得很有挑战性。更具挑战性的情况是,同一数据集中的多个变量都需要估算,而且除了已知总数外,还必须满足变量之间的逻辑限制。在本文中,我们为多变量分类数据的一大类估算方法开发了一种方法,在满足数据逻辑限制的同时,保留了之前公布的总数。所开发的方法可与任何估算估算概率的估算模型结合使用,估算概率即在某一单位中对某一变量的某一类别进行估算,从而得出该变量和单位的正确值的概率。
{"title":"Calibrated imputation for multivariate categorical data","authors":"Ton de Waal,&nbsp;Jacco Daalmans","doi":"10.1007/s10182-023-00481-z","DOIUrl":"10.1007/s10182-023-00481-z","url":null,"abstract":"<div><p>Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"545 - 576"},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00481-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Debiasing SHAP scores in random forests 在随机森林中去偏SHAP分数
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-08-22 DOI: 10.1007/s10182-023-00479-7
Markus Loecher

Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or "shrunk" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.

黑盒机器学习模型目前正被用于医疗保健和刑事司法等社会各领域的高风险决策。虽然基于树的集合方法(如随机森林)在表格数据集上的表现通常优于深度学习模型,但众所周知,其内置的变量重要性算法严重偏向于高熵特征。最近的研究表明,日益流行的 SHAP(SHapley Additive exPlanations)值也存在类似的偏差。我们提出了基于样本拆分的去偏或 "缩减 "SHAP 分数,它还能在特征层面检测过拟合问题。
{"title":"Debiasing SHAP scores in random forests","authors":"Markus Loecher","doi":"10.1007/s10182-023-00479-7","DOIUrl":"10.1007/s10182-023-00479-7","url":null,"abstract":"<div><p>Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or \"shrunk\" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"427 - 440"},"PeriodicalIF":1.4,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00479-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48943594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A family of consistent normally distributed tests for Poissonity Poissonity的一致正态分布检验族
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-06-15 DOI: 10.1007/s10182-023-00478-8
Antonio Di Noia, Marzia Marcheselli, Caterina Pisani, Luca Pratelli

A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.

根据概率生成函数的特征,提出了一系列一致的检验方法,用于评估泊松性与各类计数分布的关系,其中包括一些最常采用的泊松分布替代方案。实际上,检验统计量系列是基于泊松累积分布函数的插件估计值与经验累积分布函数之间的差异。检验统计量具有直观、简单的形式,并且是渐近正态分布,可以直接进行检验。通过广泛的模拟研究,对该检验的有限样本特性进行了研究。与其他已知极限分布的检验相比,该检验表现令人满意。
{"title":"A family of consistent normally distributed tests for Poissonity","authors":"Antonio Di Noia,&nbsp;Marzia Marcheselli,&nbsp;Caterina Pisani,&nbsp;Luca Pratelli","doi":"10.1007/s10182-023-00478-8","DOIUrl":"10.1007/s10182-023-00478-8","url":null,"abstract":"<div><p>A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"209 - 223"},"PeriodicalIF":1.4,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00478-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48755643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correlation-type goodness-of-fit tests based on independence characterizations 基于独立性特征的相关型拟合优度检验
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-05-04 DOI: 10.1007/s10182-023-00475-x
Katarina Halaj, Bojana Milošević, Marko Obradović, M. Dolores Jiménez-Gamero

This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.

本文利用独立性类型特征提出了一类检验统计量,可用于检验几类无效分布的拟合优度。由此得出的检验结果对固定的替代方案是一致的。文章还探讨了检验统计量的一些极限和小样本特性。与常见的通用拟合优度检验相比,新检验对大多数备选方案都表现出更强的能力,而与另一种基于特征描述的程序相比,新检验在各种模拟环境中都能提供具有竞争力或可比的能力。通过几个真实数据实例,证明了所提出的测试方法的实用性。
{"title":"Correlation-type goodness-of-fit tests based on independence characterizations","authors":"Katarina Halaj,&nbsp;Bojana Milošević,&nbsp;Marko Obradović,&nbsp;M. Dolores Jiménez-Gamero","doi":"10.1007/s10182-023-00475-x","DOIUrl":"10.1007/s10182-023-00475-x","url":null,"abstract":"<div><p>This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"185 - 207"},"PeriodicalIF":1.4,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41779980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional feature importance for mixed data 混合数据的条件特征重要性
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-04-29 DOI: 10.1007/s10182-023-00477-9
Kristin Blesch, David S. Watson, Marvin N. Wright

Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.

尽管特征重要性(FI)测量方法在可解释机器学习中很受欢迎,但很少有人讨论这些方法的统计充分性。从统计学的角度来看,一个主要的区别在于分析变量在调整协变量之前和之后的重要性,即边际测量和条件测量之间的区别。我们的研究提请人们注意这一鲜为人知但却至关重要的区别,并展示其影响。我们发现,目前可用来测试条件 FI 的方法很少,而且由于数据要求不匹配,从业人员在方法应用方面一直受到严重限制。现实世界中的大多数数据都表现出复杂的特征依赖性,同时包含连续和分类特征(即混合数据)。条件 FI 方法往往忽略了这两种特性。为了填补这一空白,我们建议将条件预测影响(CPI)框架与连续山寨抽样相结合。条件预测影响(CPI)通过对有效的山寨产品进行采样,从而生成与待分析数据具有相似统计属性的合成数据,从而实现条件预测影响测量,并控制任何特征依赖性。我们特意设计了连续山寨数据来处理混合数据,因此可以将 CPI 方法扩展到此类数据集。我们通过大量模拟和一个真实世界的例子证明,我们提出的工作流程可以控制 I 型误差,实现高功率,并且与其他条件 FI 指标给出的结果一致,而边际 FI 指标可能会导致误导性解释。我们的研究结果凸显了为混合数据开发统计充分的专门方法的必要性。
{"title":"Conditional feature importance for mixed data","authors":"Kristin Blesch,&nbsp;David S. Watson,&nbsp;Marvin N. Wright","doi":"10.1007/s10182-023-00477-9","DOIUrl":"10.1007/s10182-023-00477-9","url":null,"abstract":"<div><p>Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between <i>marginal</i> and <i>conditional</i> measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"259 - 278"},"PeriodicalIF":1.4,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00477-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77609605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering of extreme values: estimation and application 极值聚类:估计和应用。
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-31 DOI: 10.1007/s10182-023-00474-y
Marta Ferreira

The extreme value theory (EVT) encompasses a set of methods that allow inferring about the risk inherent to various phenomena in the scope of economic, financial, actuarial, environmental, hydrological, climatic sciences, as well as various areas of engineering. In many situations the clustering effect of high values may have an impact on the risk of occurrence of extreme phenomena. For example, extreme temperatures that last over time and result in drought situations, the permanence of intense rains leading to floods, stock markets in successive falls and consequent catastrophic losses. The extremal index is a measure of EVT associated with the degree of clustering of extreme values. In many situations, and under certain conditions, it corresponds to the arithmetic inverse of the average size of high-value clusters. The estimation of the extremal index generally entails two sources of uncertainty: the level at which high observations are considered and the identification of clusters. There are several contributions in the literature on the estimation of the extremal index, including methodologies to overcome the aforementioned sources of uncertainty. In this work we will revisit several existing estimators, apply automatic choice methods, both for the threshold and for the clustering parameter, and compare the performance of the methods. We will end with an application to meteorological data.

极值理论(EVT)包括一套方法,可以推断经济、金融、精算、环境、水文、气候科学以及各种工程领域中各种现象所固有的风险。在许多情况下,高值的聚集效应可能会对极端现象发生的风险产生影响。例如,持续一段时间并导致干旱的极端温度,导致洪水的持续暴雨,股市连续下跌,以及随之而来的灾难性损失。极值指数是与极值的聚类程度相关联的EVT的度量。在许多情况下,在某些条件下,它对应于高值集群平均大小的算术逆。极值指数的估计通常包含两个不确定性来源:考虑高观测值的水平和聚类的识别。文献中有一些关于极值指数估计的贡献,包括克服上述不确定性来源的方法。在这项工作中,我们将重新审视几种现有的估计量,应用阈值和聚类参数的自动选择方法,并比较这些方法的性能。最后我们将介绍气象数据的应用。
{"title":"Clustering of extreme values: estimation and application","authors":"Marta Ferreira","doi":"10.1007/s10182-023-00474-y","DOIUrl":"10.1007/s10182-023-00474-y","url":null,"abstract":"<div><p>The extreme value theory (EVT) encompasses a set of methods that allow inferring about the risk inherent to various phenomena in the scope of economic, financial, actuarial, environmental, hydrological, climatic sciences, as well as various areas of engineering. In many situations the clustering effect of high values may have an impact on the risk of occurrence of extreme phenomena. For example, extreme temperatures that last over time and result in drought situations, the permanence of intense rains leading to floods, stock markets in successive falls and consequent catastrophic losses. The extremal index is a measure of EVT associated with the degree of clustering of extreme values. In many situations, and under certain conditions, it corresponds to the arithmetic inverse of the average size of high-value clusters. The estimation of the extremal index generally entails two sources of uncertainty: the level at which high observations are considered and the identification of clusters. There are several contributions in the literature on the estimation of the extremal index, including methodologies to overcome the aforementioned sources of uncertainty. In this work we will revisit several existing estimators, apply automatic choice methods, both for the threshold and for the clustering parameter, and compare the performance of the methods. We will end with an application to meteorological data.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"101 - 125"},"PeriodicalIF":1.4,"publicationDate":"2023-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10064624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9769919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A spatial semiparametric M-quantile regression for hedonic price modelling 特征价格模型的空间半参数M-分位数回归
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-30 DOI: 10.1007/s10182-023-00476-w
Francesco Schirripa Spagnolo, Riccardo Borgoni, Antonella Carcagnì, Alessandra Michelangeli, Nicola Salvati

This paper proposes an M-quantile regression approach to address the heterogeneity of the housing market in a modern European city. We show how M-quantile modelling is a rich and flexible tool for empirical market price data analysis, allowing us to obtain a robust estimation of the hedonic price function whilst accounting for different sources of heterogeneity in market prices. The suggested methodology can generally be used to analyse nonlinear interactions between prices and predictors. In particular, we develop a spatial semiparametric M-quantile model to capture both the potential nonlinear effects of the cultural environment on pricing and spatial trends. In both cases, nonlinearity is introduced into the model using appropriate bases functions. We show how the implicit price associated with the variable that measures cultural amenities can be determined in this semiparametric framework. Our findings show that the effect of several housing attributes and urban amenities differs significantly across the response distribution, suggesting that buyers of lower-priced properties behave differently than buyers of higher-priced properties.

本文提出了一种 M-quantile 回归方法,以解决现代欧洲城市住房市场的异质性问题。我们展示了 M-quantile 模型是如何成为经验性市场价格数据分析的一个丰富而灵活的工具,使我们能够获得对冲价格函数的稳健估计,同时考虑到市场价格中不同来源的异质性。建议的方法一般可用于分析价格与预测因素之间的非线性相互作用。特别是,我们建立了一个空间半参数 M-quantile 模型,以捕捉文化环境对价格和空间趋势的潜在非线性影响。在这两种情况下,都使用适当的基函数将非线性引入模型。我们展示了如何在这个半参数框架中确定与衡量文化设施的变量相关的隐含价格。我们的研究结果表明,在不同的响应分布中,若干住房属性和城市配套设施的影响存在显著差异,这表明低价房产的买家与高价房产的买家行为不同。
{"title":"A spatial semiparametric M-quantile regression for hedonic price modelling","authors":"Francesco Schirripa Spagnolo,&nbsp;Riccardo Borgoni,&nbsp;Antonella Carcagnì,&nbsp;Alessandra Michelangeli,&nbsp;Nicola Salvati","doi":"10.1007/s10182-023-00476-w","DOIUrl":"10.1007/s10182-023-00476-w","url":null,"abstract":"<div><p>This paper proposes an M-quantile regression approach to address the heterogeneity of the housing market in a modern European city. We show how M-quantile modelling is a rich and flexible tool for empirical market price data analysis, allowing us to obtain a robust estimation of the hedonic price function whilst accounting for different sources of heterogeneity in market prices. The suggested methodology can generally be used to analyse nonlinear interactions between prices and predictors. In particular, we develop a spatial semiparametric M-quantile model to capture both the potential nonlinear effects of the cultural environment on pricing and spatial trends. In both cases, nonlinearity is introduced into the model using appropriate bases functions. We show how the implicit price associated with the variable that measures cultural amenities can be determined in this semiparametric framework. Our findings show that the effect of several housing attributes and urban amenities differs significantly across the response distribution, suggesting that buyers of lower-priced properties behave differently than buyers of higher-priced properties.\u0000</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"159 - 183"},"PeriodicalIF":1.4,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00476-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41823433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust estimation of fixed effect parameters and variances of linear mixed models: the minimum density power divergence approach 线性混合模型固定效应参数和方差的稳健估计:最小密度功率散度法
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-29 DOI: 10.1007/s10182-023-00473-z
Giovanni Saraceno, Abhik Ghosh, Ayanendranath Basu, Claudio Agostinelli

Many real-life data sets can be analyzed using linear mixed models (LMMs). Since these are ordinarily based on normality assumptions, under small deviations from the model the inference can be highly unstable when the associated parameters are estimated by classical methods. On the other hand, the density power divergence (DPD) family, which measures the discrepancy between two probability density functions, has been successfully used to build robust estimators with high stability associated with minimal loss in efficiency. Here, we develop the minimum DPD estimator (MDPDE) for independent but non-identically distributed observations for LMMs according to the variance components model. We prove that the theoretical properties hold, including consistency and asymptotic normality of the estimators. The influence function and sensitivity measures are computed to explore the robustness properties. As a data-based choice of the MDPDE tuning parameter (alpha) is very important, we propose two candidates as “optimal” choices, where optimality is in the sense of choosing the strongest downweighting that is necessary for the particular data set. We conduct a simulation study comparing the proposed MDPDE, for different values of (alpha), with S-estimators, M-estimators and the classical maximum likelihood estimator, considering different levels of contamination. Finally, we illustrate the performance of our proposal on a real-data example.

许多现实生活中的数据集都可以使用线性混合模型(LMM)进行分析。由于这些模型通常基于正态性假设,因此在模型出现微小偏差的情况下,用经典方法估计相关参数时,推理可能会非常不稳定。另一方面,密度幂发散(DPD)系列测量两个概率密度函数之间的差异,已被成功用于建立稳健的估计器,其稳定性高,效率损失最小。在此,我们根据方差分量模型,为 LMM 的独立但非同分布观测值开发了最小 DPD 估计器(MDPDE)。我们证明了理论特性的成立,包括估计器的一致性和渐近正态性。我们还计算了影响函数和敏感性度量,以探索鲁棒性特性。由于基于数据选择 MDPDE 调整参数 (α)非常重要,我们提出了两个候选的 "最优 "选择,这里的最优是指选择特定数据集所需的最强降权。我们进行了一项模拟研究,在考虑到不同污染水平的情况下,针对不同的 (alpha)值,比较了所提出的 MDPDE 与 S-估计器、M-估计器和经典的最大似然估计器。最后,我们在一个真实数据实例中说明了我们建议的性能。
{"title":"Robust estimation of fixed effect parameters and variances of linear mixed models: the minimum density power divergence approach","authors":"Giovanni Saraceno,&nbsp;Abhik Ghosh,&nbsp;Ayanendranath Basu,&nbsp;Claudio Agostinelli","doi":"10.1007/s10182-023-00473-z","DOIUrl":"10.1007/s10182-023-00473-z","url":null,"abstract":"<div><p>Many real-life data sets can be analyzed using linear mixed models (LMMs). Since these are ordinarily based on normality assumptions, under small deviations from the model the inference can be highly unstable when the associated parameters are estimated by classical methods. On the other hand, the density power divergence (DPD) family, which measures the discrepancy between two probability density functions, has been successfully used to build robust estimators with high stability associated with minimal loss in efficiency. Here, we develop the minimum DPD estimator (MDPDE) for independent but non-identically distributed observations for LMMs according to the variance components model. We prove that the theoretical properties hold, including consistency and asymptotic normality of the estimators. The influence function and sensitivity measures are computed to explore the robustness properties. As a data-based choice of the MDPDE tuning parameter <span>(alpha)</span> is very important, we propose two candidates as “optimal” choices, where optimality is in the sense of choosing the strongest downweighting that is necessary for the particular data set. We conduct a simulation study comparing the proposed MDPDE, for different values of <span>(alpha)</span>, with S-estimators, M-estimators and the classical maximum likelihood estimator, considering different levels of contamination. Finally, we illustrate the performance of our proposal on a real-data example.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"127 - 157"},"PeriodicalIF":1.4,"publicationDate":"2023-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00473-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47139711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lasso-based variable selection methods in text regression: the case of short texts 文本回归中基于Lasso的变量选择方法:以短文本为例
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-03-20 DOI: 10.1007/s10182-023-00472-0
Marzia Freo, Alessandra Luati

Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.

通过网站进行的交流通常以短文为特征,如图片说明或推文等。本文探讨了一类用于分析短文的监督学习方法,以替代广泛用于从结构化文本中推断主题的无监督方法。目的是评估文本数据在社会科学中用作回归模型解释变量时的有效性。为此,我们比较了将文本回归模型拟合到真实、简短的文本数据时的不同变量选择程序。我们从所选变量的数量和重要性(通过拟合优度、纳入频率和模型类别依赖性进行评估)的角度,讨论了拉索的几种变体、基于筛选的方法和基于随机化的模型(如确定的独立性筛选和稳定性选择)所获得的结果。潜在德里赫特分配结果也被视为一种比较。我们的视角主要是实证性的,我们的出发点是分析两个真实的案例研究,但也考虑了每个数据集的引导复制。第一个案例研究旨在根据电子商务平台上销售商品描述中包含的信息来解释价格变化。第二个案例涉及满意度调查中的开放式问题。案例研究的性质不同,代表了不同类型的短文,其中一个案例研究的是简洁的描述性文本,而另一个案例研究的是表达观点的文本。
{"title":"Lasso-based variable selection methods in text regression: the case of short texts","authors":"Marzia Freo,&nbsp;Alessandra Luati","doi":"10.1007/s10182-023-00472-0","DOIUrl":"10.1007/s10182-023-00472-0","url":null,"abstract":"<div><p>Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"69 - 99"},"PeriodicalIF":1.4,"publicationDate":"2023-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00472-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43416978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Bayesian ridge regression for survival data based on a vine copula-based prior 更正:基于藤蔓协整先验的生存数据贝叶斯脊回归
IF 1.4 4区 数学 Q2 STATISTICS & PROBABILITY Pub Date : 2023-02-14 DOI: 10.1007/s10182-023-00470-2
Hirofumi Michimae, Takeshi Emura
{"title":"Correction: Bayesian ridge regression for survival data based on a vine copula-based prior","authors":"Hirofumi Michimae,&nbsp;Takeshi Emura","doi":"10.1007/s10182-023-00470-2","DOIUrl":"10.1007/s10182-023-00470-2","url":null,"abstract":"","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"703 - 703"},"PeriodicalIF":1.4,"publicationDate":"2023-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135797364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Asta-Advances in Statistical Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1