Pub Date : 2023-11-15DOI: 10.1007/s10182-023-00486-8
David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün
In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.
{"title":"Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods","authors":"David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün","doi":"10.1007/s10182-023-00486-8","DOIUrl":"10.1007/s10182-023-00486-8","url":null,"abstract":"<div><p>In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in <i>mixdistreg</i>, an <span>R</span> software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"351 - 373"},"PeriodicalIF":1.4,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00486-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138506564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-03DOI: 10.1007/s10182-023-00485-9
Patrick Schulze, Simon Wiegrebe, Paul W. Thurner, Christian Heumann, Matthias Aßenmacher
The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the method of composition, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the R package stm by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.
高级主题建模的目的不仅在于探索潜在的主题结构,还在于估计所发现的主题与理论上相关的元数据之间的关系。用于估算这种关系的方法必须考虑到拓扑结构不是直接观察到的,而是以无监督的方式估算出来的,通常是通过普通的主题模型。为实现这一目的,经常使用的程序是构成法,这是一种蒙特卡罗抽样技术,对元数据协变量的抽样主题比例进行多次重复线性回归。在本文中,我们对这种方法提出了两点修改建议:首先,我们用更合适的 Beta 回归取代了线性回归,从而大大改进了 R 软件包 stm 中现有的组成方法实现。其次,我们从根本上改进了整个估计框架,用完全的贝叶斯方法取代了目前的频繁法和贝叶斯方法的混合方法。这样就能更恰当地量化不确定性。我们通过调查德国议员的 Twitter 帖子与其选区相关的不同元数据协变量之间的关系来说明我们改进后的方法,并使用结构主题模型来估计主题比例。
{"title":"A Bayesian approach to modeling topic-metadata relationships","authors":"Patrick Schulze, Simon Wiegrebe, Paul W. Thurner, Christian Heumann, Matthias Aßenmacher","doi":"10.1007/s10182-023-00485-9","DOIUrl":"10.1007/s10182-023-00485-9","url":null,"abstract":"<div><p>The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the <i>method of composition</i>, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the <span>R</span> package <span>stm</span> by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"333 - 349"},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00485-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135820119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.
{"title":"GPS data on tourists: a spatial analysis on road networks","authors":"Nicoletta D’Angelo, Antonino Abbruzzo, Mauro Ferrante, Giada Adelfio, Marcello Chiodi","doi":"10.1007/s10182-023-00484-w","DOIUrl":"10.1007/s10182-023-00484-w","url":null,"abstract":"<div><p>This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"477 - 499"},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00484-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-31DOI: 10.1007/s10182-023-00482-y
Paul M. Beaumont, Aaron D. Smallwood
We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the k-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate T, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.
我们分析了 k 因子格根鲍尔自回归移动平均(GARMA)模型的约束平方和估计器(CSS)的估计和推断相关问题。我们给出了估计器的理论结果,并表明决定周期长度的参数是渐近独立的,在有限周期内以样本大小 T 的速率收敛。其余参数缺乏独立性,以标准速率收敛。与现有文献类似,由于相关参数位于参数空间的边界上,因此在检验非周期性长记忆假设时存在一些挑战。我们展示了模拟结果,以探索估计器的小样本特性,这些结果支持大多数分布结果,同时也强调了值得进一步探索的领域。我们通过对 IBM 交易量的应用证明了理论和估计器的适用性。
{"title":"Conditional sum of squares estimation of k-factor GARMA models","authors":"Paul M. Beaumont, Aaron D. Smallwood","doi":"10.1007/s10182-023-00482-y","DOIUrl":"10.1007/s10182-023-00482-y","url":null,"abstract":"<div><p>We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the <i>k</i>-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate <i>T</i>, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"501 - 543"},"PeriodicalIF":1.4,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135870088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-10DOI: 10.1007/s10182-023-00483-x
Daniela Marella, Giuseppe Bove
In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.
{"title":"Measures of interrater agreement for quantitative data","authors":"Daniela Marella, Giuseppe Bove","doi":"10.1007/s10182-023-00483-x","DOIUrl":"10.1007/s10182-023-00483-x","url":null,"abstract":"<div><p>In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 4","pages":"801 - 821"},"PeriodicalIF":1.4,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00483-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136296350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-05DOI: 10.1007/s10182-023-00481-z
Ton de Waal, Jacco Daalmans
Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.
{"title":"Calibrated imputation for multivariate categorical data","authors":"Ton de Waal, Jacco Daalmans","doi":"10.1007/s10182-023-00481-z","DOIUrl":"10.1007/s10182-023-00481-z","url":null,"abstract":"<div><p>Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 3","pages":"545 - 576"},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00481-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-08-22DOI: 10.1007/s10182-023-00479-7
Markus Loecher
Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or "shrunk" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.
{"title":"Debiasing SHAP scores in random forests","authors":"Markus Loecher","doi":"10.1007/s10182-023-00479-7","DOIUrl":"10.1007/s10182-023-00479-7","url":null,"abstract":"<div><p>Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods such as random forests typically outperform deep learning models on tabular data sets, their built-in variable importance algorithms are known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or \"shrunk\" SHAP scores based on sample splitting which additionally enable the detection of overfitting issues at the feature level.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"427 - 440"},"PeriodicalIF":1.4,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00479-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48943594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-06-15DOI: 10.1007/s10182-023-00478-8
Antonio Di Noia, Marzia Marcheselli, Caterina Pisani, Luca Pratelli
A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.
{"title":"A family of consistent normally distributed tests for Poissonity","authors":"Antonio Di Noia, Marzia Marcheselli, Caterina Pisani, Luca Pratelli","doi":"10.1007/s10182-023-00478-8","DOIUrl":"10.1007/s10182-023-00478-8","url":null,"abstract":"<div><p>A family of consistent tests, derived from a characterization of the probability generating function, is proposed for assessing Poissonity against a wide class of count distributions, which includes some of the most frequently adopted alternatives to the Poisson distribution. Actually, the family of test statistics is based on the difference between the plug-in estimator of the Poisson cumulative distribution function and the empirical cumulative distribution function. The test statistics have an intuitive and simple form and are asymptotically normally distributed, allowing a straightforward implementation of the test. The finite sample properties of the test are investigated by means of an extensive simulation study. The test shows satisfactory behaviour compared to other tests with known limit distribution.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"209 - 223"},"PeriodicalIF":1.4,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00478-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48755643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-04DOI: 10.1007/s10182-023-00475-x
Katarina Halaj, Bojana Milošević, Marko Obradović, M. Dolores Jiménez-Gamero
This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.
{"title":"Correlation-type goodness-of-fit tests based on independence characterizations","authors":"Katarina Halaj, Bojana Milošević, Marko Obradović, M. Dolores Jiménez-Gamero","doi":"10.1007/s10182-023-00475-x","DOIUrl":"10.1007/s10182-023-00475-x","url":null,"abstract":"<div><p>This paper uses independence-type characterizations to propose a class of test statistics which can be used for testing goodness-of-fit with several classes of null distributions. The resulting tests are consistent against fixed alternatives. Some limiting and small sample properties of the test statistics are explored. In comparison with common universal goodness-of-fit tests, the new tests exhibit better power for most of the alternatives considered, while in comparison with another characterization-based procedure, the new tests provide competitive or comparable power in various simulation settings. The handiness of the proposed tests is demonstrated through several real-data examples.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 1","pages":"185 - 207"},"PeriodicalIF":1.4,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41779980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-29DOI: 10.1007/s10182-023-00477-9
Kristin Blesch, David S. Watson, Marvin N. Wright
Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
尽管特征重要性(FI)测量方法在可解释机器学习中很受欢迎,但很少有人讨论这些方法的统计充分性。从统计学的角度来看,一个主要的区别在于分析变量在调整协变量之前和之后的重要性,即边际测量和条件测量之间的区别。我们的研究提请人们注意这一鲜为人知但却至关重要的区别,并展示其影响。我们发现,目前可用来测试条件 FI 的方法很少,而且由于数据要求不匹配,从业人员在方法应用方面一直受到严重限制。现实世界中的大多数数据都表现出复杂的特征依赖性,同时包含连续和分类特征(即混合数据)。条件 FI 方法往往忽略了这两种特性。为了填补这一空白,我们建议将条件预测影响(CPI)框架与连续山寨抽样相结合。条件预测影响(CPI)通过对有效的山寨产品进行采样,从而生成与待分析数据具有相似统计属性的合成数据,从而实现条件预测影响测量,并控制任何特征依赖性。我们特意设计了连续山寨数据来处理混合数据,因此可以将 CPI 方法扩展到此类数据集。我们通过大量模拟和一个真实世界的例子证明,我们提出的工作流程可以控制 I 型误差,实现高功率,并且与其他条件 FI 指标给出的结果一致,而边际 FI 指标可能会导致误导性解释。我们的研究结果凸显了为混合数据开发统计充分的专门方法的必要性。
{"title":"Conditional feature importance for mixed data","authors":"Kristin Blesch, David S. Watson, Marvin N. Wright","doi":"10.1007/s10182-023-00477-9","DOIUrl":"10.1007/s10182-023-00477-9","url":null,"abstract":"<div><p>Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between <i>marginal</i> and <i>conditional</i> measures. Our work draws attention to this rarely acknowledged, yet crucial distinction and showcases its implications. We find that few methods are available for testing conditional FI and practitioners have hitherto been severely restricted in method application due to mismatched data requirements. Most real-world data exhibits complex feature dependencies and incorporates both continuous and categorical features (i.e., mixed data). Both properties are oftentimes neglected by conditional FI measures. To fill this gap, we propose to combine the conditional predictive impact (CPI) framework with sequential knockoff sampling. The CPI enables conditional FI measurement that controls for any feature dependencies by sampling valid knockoffs—hence, generating synthetic data with similar statistical properties—for the data to be analysed. Sequential knockoffs were deliberately designed to handle mixed data and thus allow us to extend the CPI approach to such datasets. We demonstrate through numerous simulations and a real-world example that our proposed workflow controls type I error, achieves high power, and is in-line with results given by other conditional FI measures, whereas marginal FI metrics can result in misleading interpretations. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":"108 2","pages":"259 - 278"},"PeriodicalIF":1.4,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00477-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77609605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}