Pub Date : 2023-12-15DOI: 10.1007/s10182-023-00490-y
Virginia X. He, Matt P. Wand
We use Bayesian model selection paradigms, such as group least absolute shrinkage and selection operator priors, to facilitate generalized additive model selection. Our approach allows for the effects of continuous predictors to be categorized as either zero, linear or non-linear. Employment of carefully tailored auxiliary variables results in Gibbsian Markov chain Monte Carlo schemes for practical implementation of the approach. In addition, mean field variational algorithms with closed form updates are obtained. Whilst not as accurate, this fast variational option enhances scalability to very large data sets. A package in the R language aids use in practice.
{"title":"Bayesian generalized additive model selection including a fast variational option","authors":"Virginia X. He, Matt P. Wand","doi":"10.1007/s10182-023-00490-y","DOIUrl":"10.1007/s10182-023-00490-y","url":null,"abstract":"<div><p>We use Bayesian model selection paradigms, such as group least absolute shrinkage and selection operator priors, to facilitate generalized additive model selection. Our approach allows for the effects of continuous predictors to be categorized as either zero, linear or non-linear. Employment of carefully tailored auxiliary variables results in Gibbsian Markov chain Monte Carlo schemes for practical implementation of the approach. In addition, mean field variational algorithms with closed form updates are obtained. Whilst not as accurate, this fast variational option enhances scalability to very large data sets. A package in the <span>R</span> language aids use in practice.\u0000</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138690278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-13DOI: 10.1007/s10182-023-00491-x
Kyongwon Kim
Sufficient dimension reduction is a widely used tool to extract core information hidden in high-dimensional data for classifying, clustering, and predicting response variables. Various dimension reduction methods and their applications have been introduced in the past decades. Data analysis using sufficient dimension reduction involves two steps: dimension reduction and model estimation. However, when we implement the two-step modeling process, we consider the estimated sufficient predictor as a true predictor variable and proceed to the model development step, which includes statistical inference such as estimating confidence intervals and performing hypothesis tests. However, the outcome obtained using this method is by no means complete because it contains errors only from the model estimation step. Therefore, post dimension reduction inference is an important topic because it is essential to consider errors from sufficient dimension reduction. In this paper, we review the fundamentals of sufficient dimension reduction methods. Then, we introduce an intuitive and heuristic approach for the recently developed post dimension reduction statistical inference.
{"title":"A note on sufficient dimension reduction with post dimension reduction statistical inference","authors":"Kyongwon Kim","doi":"10.1007/s10182-023-00491-x","DOIUrl":"https://doi.org/10.1007/s10182-023-00491-x","url":null,"abstract":"<p>Sufficient dimension reduction is a widely used tool to extract core information hidden in high-dimensional data for classifying, clustering, and predicting response variables. Various dimension reduction methods and their applications have been introduced in the past decades. Data analysis using sufficient dimension reduction involves two steps: dimension reduction and model estimation. However, when we implement the two-step modeling process, we consider the estimated sufficient predictor as a true predictor variable and proceed to the model development step, which includes statistical inference such as estimating confidence intervals and performing hypothesis tests. However, the outcome obtained using this method is by no means complete because it contains errors only from the model estimation step. Therefore, post dimension reduction inference is an important topic because it is essential to consider errors from sufficient dimension reduction. In this paper, we review the fundamentals of sufficient dimension reduction methods. Then, we introduce an intuitive and heuristic approach for the recently developed post dimension reduction statistical inference.</p>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138581852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-27DOI: 10.1007/s10182-023-00488-6
Marinho G. Andrade, Katiane S. Conceição, Nalini Ravishanker
The past few decades have seen considerable interest in modeling time series of counts, with applications in many domains. Classical and Bayesian modeling have primarily focused on conditional Poisson sampling distributions at each time. There is very little research on modeling time series involving Zero-Modified (i.e., Zero Deflated or Inflated) distributions. This paper aims to fill this gap and develop models for count time series involving Zero-Modified distributions, which belong to the Power Series family and are suitable for time series exhibiting both zero-inflation and zero-deflation. A full Bayesian approach via the Hamiltonian Monte Carlo (HMC) technique enables accurate modeling and inference. The paper illustrates our approach using time series on the number of deaths from the influenza virus in the city of São Paulo, Brazil.
{"title":"Zero-modified count time series modeling with an application to influenza cases","authors":"Marinho G. Andrade, Katiane S. Conceição, Nalini Ravishanker","doi":"10.1007/s10182-023-00488-6","DOIUrl":"10.1007/s10182-023-00488-6","url":null,"abstract":"<div><p>The past few decades have seen considerable interest in modeling time series of counts, with applications in many domains. Classical and Bayesian modeling have primarily focused on conditional Poisson sampling distributions at each time. There is very little research on modeling time series involving Zero-Modified (i.e., Zero Deflated or Inflated) distributions. This paper aims to fill this gap and develop models for count time series involving Zero-Modified distributions, which belong to the Power Series family and are suitable for time series exhibiting both zero-inflation and zero-deflation. A full Bayesian approach via the Hamiltonian Monte Carlo (HMC) technique enables accurate modeling and inference. The paper illustrates our approach using time series on the number of deaths from the influenza virus in the city of São Paulo, Brazil.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138506562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-18DOI: 10.1007/s10182-023-00487-7
Pierdomenico Duttilo, Stefano Antonio Gattone, Barbara Iannone
Environmental, social and governance (ESG) criteria are increasingly integrated into investment process to contribute to overcoming global sustainability challenges. Focusing on the reaction to turmoil periods, this work analyses returns and volatility of several ESG indices and makes a comparison with their traditional counterparts from 2016 to 2022. These indices comprise the following markets: Global, the US, Europe and emerging markets. Firstly, the two-component mixture of generalized normal distribution was exploited to objectively detect financial market turmoil periods with the Naïve Bayes’ classifier. Secondly, the EGARCH-in-mean model with exogenous dummy variables was applied to capture the turmoil period impact. Results show that returns and volatility are both affected by turmoil periods. The return–risk performance differs by index type and market: the European ESG index is less volatile than its traditional market benchmark, while in the other markets, the estimated volatility is approximately the same. Moreover, ESG and non-ESG indices differ in terms of turmoil periods impact, risk premium and leverage effect.
{"title":"Mixtures of generalized normal distributions and EGARCH models to analyse returns and volatility of ESG and traditional investments","authors":"Pierdomenico Duttilo, Stefano Antonio Gattone, Barbara Iannone","doi":"10.1007/s10182-023-00487-7","DOIUrl":"https://doi.org/10.1007/s10182-023-00487-7","url":null,"abstract":"<p>Environmental, social and governance (ESG) criteria are increasingly integrated into investment process to contribute to overcoming global sustainability challenges. Focusing on the reaction to turmoil periods, this work analyses returns and volatility of several ESG indices and makes a comparison with their traditional counterparts from 2016 to 2022. These indices comprise the following markets: Global, the US, Europe and emerging markets. Firstly, the two-component mixture of generalized normal distribution was exploited to objectively detect financial market turmoil periods with the Naïve Bayes’ classifier. Secondly, the EGARCH-in-mean model with exogenous dummy variables was applied to capture the turmoil period impact. Results show that returns and volatility are both affected by turmoil periods. The return–risk performance differs by index type and market: the European ESG index is less volatile than its traditional market benchmark, while in the other markets, the estimated volatility is approximately the same. Moreover, ESG and non-ESG indices differ in terms of turmoil periods impact, risk premium and leverage effect.</p>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138506566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-15DOI: 10.1007/s10182-023-00486-8
David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün
In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.
{"title":"Mixture of experts distributional regression: implementation using robust estimation with adaptive first-order methods","authors":"David Rügamer, Florian Pfisterer, Bernd Bischl, Bettina Grün","doi":"10.1007/s10182-023-00486-8","DOIUrl":"10.1007/s10182-023-00486-8","url":null,"abstract":"<div><p>In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in <i>mixdistreg</i>, an <span>R</span> software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00486-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138506564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-03DOI: 10.1007/s10182-023-00485-9
Patrick Schulze, Simon Wiegrebe, Paul W. Thurner, Christian Heumann, Matthias Aßenmacher
The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the method of composition, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the R package stm by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.
高级主题建模的目的不仅在于探索潜在的主题结构,还在于估计所发现的主题与理论上相关的元数据之间的关系。用于估算这种关系的方法必须考虑到拓扑结构不是直接观察到的,而是以无监督的方式估算出来的,通常是通过普通的主题模型。为实现这一目的,经常使用的程序是构成法,这是一种蒙特卡罗抽样技术,对元数据协变量的抽样主题比例进行多次重复线性回归。在本文中,我们对这种方法提出了两点修改建议:首先,我们用更合适的 Beta 回归取代了线性回归,从而大大改进了 R 软件包 stm 中现有的组成方法实现。其次,我们从根本上改进了整个估计框架,用完全的贝叶斯方法取代了目前的频繁法和贝叶斯方法的混合方法。这样就能更恰当地量化不确定性。我们通过调查德国议员的 Twitter 帖子与其选区相关的不同元数据协变量之间的关系来说明我们改进后的方法,并使用结构主题模型来估计主题比例。
{"title":"A Bayesian approach to modeling topic-metadata relationships","authors":"Patrick Schulze, Simon Wiegrebe, Paul W. Thurner, Christian Heumann, Matthias Aßenmacher","doi":"10.1007/s10182-023-00485-9","DOIUrl":"10.1007/s10182-023-00485-9","url":null,"abstract":"<div><p>The objective of advanced topic modeling is not only to explore latent topical structures, but also to estimate relationships between the discovered topics and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion, usually by common topic models. A frequently used procedure to achieve this is the <i>method of composition</i>, a Monte Carlo sampling technique performing multiple repeated linear regressions of sampled topic proportions on metadata covariates. In this paper, we propose two modifications of this approach: First, we substantially refine the existing implementation of the method of composition from the <span>R</span> package <span>stm</span> by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts, using the structural topic model to estimate topic proportions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00485-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135820119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.
{"title":"GPS data on tourists: a spatial analysis on road networks","authors":"Nicoletta D’Angelo, Antonino Abbruzzo, Mauro Ferrante, Giada Adelfio, Marcello Chiodi","doi":"10.1007/s10182-023-00484-w","DOIUrl":"10.1007/s10182-023-00484-w","url":null,"abstract":"<div><p>This paper proposes a spatial point process model on a linear network to analyse cruise passengers’ stop activities. It identifies and models tourists’ stop intensity at the destination as a function of their main determinants. For this purpose, we consider data collected on cruise passengers through the integration of traditional questionnaire-based survey methods and GPS tracking data in two cities, namely Palermo (Italy) and Dubrovnik (Croatia). Firstly, the density-based spatial clustering of applications with noise algorithm is applied to identify stop locations from GPS tracking data. The influence of individual-related variables and itinerary-related characteristics is considered within a framework of a Gibbs point process model. The proposed model describes spatial stop intensity at the destination, accounting for the geometry of the underlying road network, individual-related variables, contextual-level information, and the spatial interaction amongst stop points. The analysis succeeds in quantifying the influence of both individual-related variables and trip-related characteristics on stop intensity. An interaction parameter allows for measuring the degree of dependence amongst cruise passengers in stop location decisions.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00484-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-31DOI: 10.1007/s10182-023-00482-y
Paul M. Beaumont, Aaron D. Smallwood
We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the k-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate T, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.
我们分析了 k 因子格根鲍尔自回归移动平均(GARMA)模型的约束平方和估计器(CSS)的估计和推断相关问题。我们给出了估计器的理论结果,并表明决定周期长度的参数是渐近独立的,在有限周期内以样本大小 T 的速率收敛。其余参数缺乏独立性,以标准速率收敛。与现有文献类似,由于相关参数位于参数空间的边界上,因此在检验非周期性长记忆假设时存在一些挑战。我们展示了模拟结果,以探索估计器的小样本特性,这些结果支持大多数分布结果,同时也强调了值得进一步探索的领域。我们通过对 IBM 交易量的应用证明了理论和估计器的适用性。
{"title":"Conditional sum of squares estimation of k-factor GARMA models","authors":"Paul M. Beaumont, Aaron D. Smallwood","doi":"10.1007/s10182-023-00482-y","DOIUrl":"10.1007/s10182-023-00482-y","url":null,"abstract":"<div><p>We analyze issues related to estimation and inference for the constrained sum of squares estimator (CSS) of the <i>k</i>-factor Gegenbauer autoregressive moving average (GARMA) model. We present theoretical results for the estimator and show that the parameters that determine the cycle lengths are asymptotically independent, converging at rate <i>T</i>, the sample size, for finite cycles. The remaining parameters lack independence and converge at the standard rate. Analogous with existing literature, some challenges exist for testing the hypothesis of non-cyclical long memory, since the associated parameter lies on the boundary of the parameter space. We present simulation results to explore small sample properties of the estimator, which support most distributional results, while also highlighting areas that merit additional exploration. We demonstrate the applicability of the theory and estimator with an application to IBM trading volume.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135870088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-10DOI: 10.1007/s10182-023-00483-x
Daniela Marella, Giuseppe Bove
Abstract In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.
{"title":"Measures of interrater agreement for quantitative data","authors":"Daniela Marella, Giuseppe Bove","doi":"10.1007/s10182-023-00483-x","DOIUrl":"https://doi.org/10.1007/s10182-023-00483-x","url":null,"abstract":"Abstract In this paper measures of interrater absolute agreement for quantitative measurements based on the standard deviation are proposed. Such indices allow (i) to overcome the limits affecting the intraclass correlation index; (ii) to measure the interrater agreement on single targets. Estimators of the proposed measures are introduced and their sampling properties are investigated for normal and non-normal data. Simulated data are employed to demonstrate the accuracy and practical utility of the new indices for assessing agreement. Finally, an application to assess the consistency of measurements performed by radiologists evaluating tumor size of lung cancer is presented.","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136296350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-05DOI: 10.1007/s10182-023-00481-z
Ton de Waal, Jacco Daalmans
Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.
{"title":"Calibrated imputation for multivariate categorical data","authors":"Ton de Waal, Jacco Daalmans","doi":"10.1007/s10182-023-00481-z","DOIUrl":"10.1007/s10182-023-00481-z","url":null,"abstract":"<div><p>Non-response is a major problem for anyone collecting and processing data. A commonly used technique to deal with missing data is imputation, where missing values are estimated and filled in into the dataset. Imputation can become challenging if the variable to be imputed has to comply with a known total. Even more challenging is the case where several variables in the same dataset need to be imputed and, in addition to known totals, logical restrictions between variables have to be satisfied. In our paper, we develop an approach for a broad class of imputation methods for multivariate categorical data such that previously published totals are preserved while logical restrictions on the data are satisfied. The developed approach can be used in combination with any imputation model that estimates imputation probabilities, i.e. the probability that imputation of a certain category for a variable in a certain unit leads to the correct value for this variable and unit.</p></div>","PeriodicalId":55446,"journal":{"name":"Asta-Advances in Statistical Analysis","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-023-00481-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135482185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}