Pub Date : 2024-03-21DOI: 10.1007/s11222-024-10420-w
Jasmin Rühl, Sarah Friedrich
The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version, the so-called wild bootstrap. For the latter, three sub-versions based on differing distributions of the underlying random multipliers are examined. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the resampling methods are further applied to data on the long-term survival in patients with early-stage Hodgkin’s disease.
g 公式可用于估计治疗效果,同时考虑观察性研究中的混杂偏倚。然而,对于可能存在竞争风险的时间到事件终点,构建有效的因果风险差异点式置信区间和时间同步置信区间非常复杂。一个方便的解决方案是通过重采样方法来近似相应随机过程的渐近分布。在本文中,我们考虑了三种不同的重采样方法,即经典的非参数自举法、配备重采样方法的影响函数以及基于马氏自举法的版本,即所谓的野生自举法。对于后者,我们研究了基于基础随机乘数不同分布的三个子版本。我们建立了一个模拟研究来比较不同技术的准确性,结果表明,如果样本量适中,并且已经积累了足够的相关事件的数据,一般情况下野生自举法更受欢迎。为了说明问题,我们还将重采样方法进一步应用于早期霍奇金病患者的长期生存数据。
{"title":"Resampling-based confidence intervals and bands for the average treatment effect in observational studies with competing risks","authors":"Jasmin Rühl, Sarah Friedrich","doi":"10.1007/s11222-024-10420-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10420-w","url":null,"abstract":"<p>The g-formula can be used to estimate the treatment effect while accounting for confounding bias in observational studies. With regard to time-to-event endpoints, possibly subject to competing risks, the construction of valid pointwise confidence intervals and time-simultaneous confidence bands for the causal risk difference is complicated, however. A convenient solution is to approximate the asymptotic distribution of the corresponding stochastic process by means of resampling approaches. In this paper, we consider three different resampling methods, namely the classical nonparametric bootstrap, the influence function equipped with a resampling approach as well as a martingale-based bootstrap version, the so-called wild bootstrap. For the latter, three sub-versions based on differing distributions of the underlying random multipliers are examined. We set up a simulation study to compare the accuracy of the different techniques, which reveals that the wild bootstrap should in general be preferred if the sample size is moderate and sufficient data on the event of interest have been accrued. For illustration, the resampling methods are further applied to data on the long-term survival in patients with early-stage Hodgkin’s disease.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140201223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s11222-024-10416-6
Kes Ward, Gaetano Romano, Idris Eckley, Paul Fearnhead
Online changepoint detection algorithms that are based on (generalised) likelihood-ratio tests have been shown to have excellent statistical properties. However, a simple online implementation is computationally infeasible as, at time T, it involves considering O(T) possible locations for the change. Recently, the FOCuS algorithm has been introduced for detecting changes in mean in Gaussian data that decreases the per-iteration cost to (O(log T)). This is possible by using pruning ideas, which reduce the set of changepoint locations that need to be considered at time T to approximately (log T). We show that if one wishes to perform the likelihood ratio test for a different one-parameter exponential family model, then exactly the same pruning rule can be used, and again one need only consider approximately (log T) locations at iteration T. Furthermore, we show how we can adaptively perform the maximisation step of the algorithm so that we need only maximise the test statistic over a small subset of these possible locations. Empirical results show that the resulting online algorithm, which can detect changes under a wide range of models, has a constant-per-iteration cost on average.
基于(广义)似然比检验的在线变化点检测算法已被证明具有出色的统计特性。然而,简单的在线实施在计算上是不可行的,因为在时间 T 上,需要考虑 O(T) 个可能的变化位置。最近,针对高斯数据均值变化的检测引入了 FOCuS 算法,该算法将每次迭代成本降低到了(O(log T))。这是通过使用剪枝思想实现的,剪枝思想将需要在时间 T 上考虑的变化点位置集减少到大约 (log T) 。我们证明,如果希望对不同的单参数指数族模型进行似然比检验,那么可以使用完全相同的剪枝规则,同样只需要在迭代 T 时考虑大约 (log T) 个位置。此外,我们还证明了如何自适应地执行算法的最大化步骤,从而只需要在这些可能位置的一小部分上最大化检验统计量。实证结果表明,由此产生的在线算法可以在多种模型下检测变化,平均每次迭代成本不变。
{"title":"A constant-per-iteration likelihood ratio test for online changepoint detection for exponential family models","authors":"Kes Ward, Gaetano Romano, Idris Eckley, Paul Fearnhead","doi":"10.1007/s11222-024-10416-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10416-6","url":null,"abstract":"<p>Online changepoint detection algorithms that are based on (generalised) likelihood-ratio tests have been shown to have excellent statistical properties. However, a simple online implementation is computationally infeasible as, at time <i>T</i>, it involves considering <i>O</i>(<i>T</i>) possible locations for the change. Recently, the FOCuS algorithm has been introduced for detecting changes in mean in Gaussian data that decreases the per-iteration cost to <span>(O(log T))</span>. This is possible by using pruning ideas, which reduce the set of changepoint locations that need to be considered at time <i>T</i> to approximately <span>(log T)</span>. We show that if one wishes to perform the likelihood ratio test for a different one-parameter exponential family model, then exactly the same pruning rule can be used, and again one need only consider approximately <span>(log T)</span> locations at iteration <i>T</i>. Furthermore, we show how we can adaptively perform the maximisation step of the algorithm so that we need only maximise the test statistic over a small subset of these possible locations. Empirical results show that the resulting online algorithm, which can detect changes under a wide range of models, has a constant-per-iteration cost on average.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s11222-024-10410-y
David Rodríguez-Vítores, Carlos Matrán
This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of clusters of the involved covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. We show that such groupings of covariance matrices can be achieved through simple modifications of the CEM (Classification Expectation Maximization) algorithm. Our approach leads to propose Gaussian Mixture Models for model-based clustering and discriminant analysis, in which covariance matrices are clustered according to a parsimonious criterion, creating intermediate steps between the fourteen widely known parsimonious models. The added versatility not only allows us to obtain models with fewer parameters for fitting the data, but also provides greater interpretability. We show its usefulness for model-based clustering and discriminant analysis, providing algorithms to find approximate solutions verifying suitable size, shape and orientation constraints, and applying them to both simulation and real data examples.
{"title":"Improving model choice in classification: an approach based on clustering of covariance matrices","authors":"David Rodríguez-Vítores, Carlos Matrán","doi":"10.1007/s11222-024-10410-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10410-y","url":null,"abstract":"<p>This work introduces a refinement of the Parsimonious Model for fitting a Gaussian Mixture. The improvement is based on the consideration of clusters of the involved covariance matrices according to a criterion, such as sharing Principal Directions. This and other similarity criteria that arise from the spectral decomposition of a matrix are the bases of the Parsimonious Model. We show that such groupings of covariance matrices can be achieved through simple modifications of the CEM (Classification Expectation Maximization) algorithm. Our approach leads to propose Gaussian Mixture Models for model-based clustering and discriminant analysis, in which covariance matrices are clustered according to a parsimonious criterion, creating intermediate steps between the fourteen widely known parsimonious models. The added versatility not only allows us to obtain models with fewer parameters for fitting the data, but also provides greater interpretability. We show its usefulness for model-based clustering and discriminant analysis, providing algorithms to find approximate solutions verifying suitable size, shape and orientation constraints, and applying them to both simulation and real data examples.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1007/s11222-023-10379-0
Faïcel Chamroukhi, Nhat Thien Pham, Van Hà Hoang, Geoffrey J. McLachlan
We consider the statistical analysis of heterogeneous data for prediction, in situations where the observations include functions, typically time series. We extend the modeling with mixtures-of-experts (ME), as a framework of choice in modeling heterogeneity in data for prediction with vectorial observations, to this functional data analysis context. We first present a new family of ME models, named functional ME (FME), in which the predictors are potentially noisy observations, from entire functions. Furthermore, the data generating process of the predictor and the real response, is governed by a hidden discrete variable representing an unknown partition. Second, by imposing sparsity on derivatives of the underlying functional parameters via Lasso-like regularizations, we provide sparse and interpretable functional representations of the FME models called iFME. We develop dedicated expectation–maximization algorithms for Lasso-like regularized maximum-likelihood parameter estimation strategies to fit the models. The proposed models and algorithms are studied in simulated scenarios and in applications to two real data sets, and the obtained results demonstrate their performance in accurately capturing complex nonlinear relationships and in clustering the heterogeneous regression data.
{"title":"Functional mixtures-of-experts","authors":"Faïcel Chamroukhi, Nhat Thien Pham, Van Hà Hoang, Geoffrey J. McLachlan","doi":"10.1007/s11222-023-10379-0","DOIUrl":"https://doi.org/10.1007/s11222-023-10379-0","url":null,"abstract":"<p>We consider the statistical analysis of heterogeneous data for prediction, in situations where the observations include functions, typically time series. We extend the modeling with mixtures-of-experts (ME), as a framework of choice in modeling heterogeneity in data for prediction with vectorial observations, to this functional data analysis context. We first present a new family of ME models, named functional ME (FME), in which the predictors are potentially noisy observations, from entire functions. Furthermore, the data generating process of the predictor and the real response, is governed by a hidden discrete variable representing an unknown partition. Second, by imposing sparsity on derivatives of the underlying functional parameters via Lasso-like regularizations, we provide sparse and interpretable functional representations of the FME models called iFME. We develop dedicated expectation–maximization algorithms for Lasso-like regularized maximum-likelihood parameter estimation strategies to fit the models. The proposed models and algorithms are studied in simulated scenarios and in applications to two real data sets, and the obtained results demonstrate their performance in accurately capturing complex nonlinear relationships and in clustering the heterogeneous regression data.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-17DOI: 10.1007/s11222-024-10396-7
Ian Meneghel Danilevicz, Valdério Anselmo Reisen, Pascal Bondon
Linear fixed effect models are a general way to fit panel or longitudinal data with a distinct intercept for each unit. Based on expectile and M-quantile approaches, we propose alternative regression estimation methods to estimate the parameters of linear fixed effect models. The estimation functions are penalized by the least absolute shrinkage and selection operator to reduce the dimensionality of the data. Some asymptotic properties of the estimators are established, and finite sample size investigations are conducted to verify the empirical performances of the estimation methods. The computational implementations of the procedures are discussed, and real economic panel data from the Organisation for Economic Cooperation and Development are analyzed to show the usefulness of the methods in a practical problem.
{"title":"Expectile and M-quantile regression for panel data","authors":"Ian Meneghel Danilevicz, Valdério Anselmo Reisen, Pascal Bondon","doi":"10.1007/s11222-024-10396-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10396-7","url":null,"abstract":"<p>Linear fixed effect models are a general way to fit panel or longitudinal data with a distinct intercept for each unit. Based on expectile and M-quantile approaches, we propose alternative regression estimation methods to estimate the parameters of linear fixed effect models. The estimation functions are penalized by the least absolute shrinkage and selection operator to reduce the dimensionality of the data. Some asymptotic properties of the estimators are established, and finite sample size investigations are conducted to verify the empirical performances of the estimation methods. The computational implementations of the procedures are discussed, and real economic panel data from the Organisation for Economic Cooperation and Development are analyzed to show the usefulness of the methods in a practical problem.\u0000</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The development of modern science and technology has facilitated the collection of a large amount of matrix data in fields such as biomedicine. Matrix data modeling has been extensively studied, which advances from the naive approach of flattening the matrix into a vector. However, existing matrix modeling methods mainly focus on homogeneous data, failing to handle the data heterogeneity frequently encountered in the biomedical field, where samples from the same study belong to several underlying subgroups, and different subgroups follow different models. In this paper, we focus on regression-based heterogeneity analysis. We propose a matrix data heterogeneity analysis framework, by combining matrix bilinear sparse decomposition and penalized fusion techniques, which enables data-driven subgroup detection, including determining the number of subgroups and subgrouping membership. A rigorous theoretical analysis is conducted, including asymptotic consistency in terms of subgroup detection, the number of subgroups, and regression coefficients. Numerous numerical studies based on simulated and real data have been constructed, showcasing the superior performance of the proposed method in analyzing matrix heterogeneous data.
{"title":"Matrix regression heterogeneity analysis","authors":"Fengchuan Zhang, Sanguo Zhang, Shi-Ming Li, Mingyang Ren","doi":"10.1007/s11222-024-10401-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10401-z","url":null,"abstract":"<p>The development of modern science and technology has facilitated the collection of a large amount of matrix data in fields such as biomedicine. Matrix data modeling has been extensively studied, which advances from the naive approach of flattening the matrix into a vector. However, existing matrix modeling methods mainly focus on homogeneous data, failing to handle the data heterogeneity frequently encountered in the biomedical field, where samples from the same study belong to several underlying subgroups, and different subgroups follow different models. In this paper, we focus on regression-based heterogeneity analysis. We propose a matrix data heterogeneity analysis framework, by combining matrix bilinear sparse decomposition and penalized fusion techniques, which enables data-driven subgroup detection, including determining the number of subgroups and subgrouping membership. A rigorous theoretical analysis is conducted, including asymptotic consistency in terms of subgroup detection, the number of subgroups, and regression coefficients. Numerous numerical studies based on simulated and real data have been constructed, showcasing the superior performance of the proposed method in analyzing matrix heterogeneous data.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-16DOI: 10.1007/s11222-024-10407-7
Xia Junwen, Zhan Zishu, Zhang Jingxiao
In survival contexts, substantial literature exists on estimating optimal treatment regimes, where treatments are assigned based on personal characteristics to maximize the survival probability. These methods assume that a set of covariates is sufficient to deconfound the treatment-outcome relationship. However, this assumption can be limited in observational studies or randomized trials in which non-adherence occurs. Therefore, we propose a novel approach to estimating optimal treatment regimes when certain confounders are unobservable and a binary instrumental variable is available. Specifically, via a binary instrumental variable, we propose a semiparametric estimator for optimal treatment regimes by maximizing a Kaplan–Meier-like estimator of the survival function. Furthermore, to increase resistance to model misspecification, we construct novel doubly robust estimators. Since the estimators of the survival function are jagged, we incorporate kernel smoothing methods to improve performance. Under appropriate regularity conditions, the asymptotic properties are rigorously established. Moreover, the finite sample performance is evaluated through simulation studies. Finally, we illustrate our method using data from the National Cancer Institute’s prostate, lung, colorectal, and ovarian cancer screening trial.
{"title":"Doubly robust estimation of optimal treatment regimes for survival data using an instrumental variable","authors":"Xia Junwen, Zhan Zishu, Zhang Jingxiao","doi":"10.1007/s11222-024-10407-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10407-7","url":null,"abstract":"<p>In survival contexts, substantial literature exists on estimating optimal treatment regimes, where treatments are assigned based on personal characteristics to maximize the survival probability. These methods assume that a set of covariates is sufficient to deconfound the treatment-outcome relationship. However, this assumption can be limited in observational studies or randomized trials in which non-adherence occurs. Therefore, we propose a novel approach to estimating optimal treatment regimes when certain confounders are unobservable and a binary instrumental variable is available. Specifically, via a binary instrumental variable, we propose a semiparametric estimator for optimal treatment regimes by maximizing a Kaplan–Meier-like estimator of the survival function. Furthermore, to increase resistance to model misspecification, we construct novel doubly robust estimators. Since the estimators of the survival function are jagged, we incorporate kernel smoothing methods to improve performance. Under appropriate regularity conditions, the asymptotic properties are rigorously established. Moreover, the finite sample performance is evaluated through simulation studies. Finally, we illustrate our method using data from the National Cancer Institute’s prostate, lung, colorectal, and ovarian cancer screening trial.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140156495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14DOI: 10.1007/s11222-024-10406-8
Alessio Farcomeni, Marco Geraci
We introduce quantile ratio regression. Our proposed model assumes that the ratio of two arbitrary quantiles of a continuous response distribution is a function of a linear predictor. Thanks to basic quantile properties, estimation can be carried out on the scale of either the response or the link function. The advantage of using the latter becomes tangible when implementing fast optimizers for linear regression in the presence of large datasets. We show the theoretical properties of the estimator and derive an efficient method to obtain standard errors. The good performance and merit of our methods are illustrated by means of a simulation study and a real data analysis; where we investigate income inequality in the European Union (EU) using data from a sample of about two million households. We find a significant association between inequality, as measured by quantile ratios, and certain macroeconomic indicators; and we identify countries with outlying income inequality relative to the rest of the EU. An R implementation of the proposed methods is freely available.
我们介绍量值比回归。我们提出的模型假定连续响应分布的两个任意量级的比值是线性预测因子的函数。利用量值的基本特性,可以在响应或链接函数的尺度上进行估计。当在大型数据集中实施线性回归的快速优化时,使用后者的优势就显而易见了。我们展示了估计器的理论特性,并推导出一种获取标准误差的有效方法。我们通过模拟研究和实际数据分析,对欧盟(EU)的收入不平等现象进行了调查。我们发现,以量子比率衡量的不平等与某些宏观经济指标之间存在重要关联;我们还确定了与欧盟其他国家相比收入不平等程度偏高的国家。我们免费提供所建议方法的 R 实现。
{"title":"Quantile ratio regression","authors":"Alessio Farcomeni, Marco Geraci","doi":"10.1007/s11222-024-10406-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10406-8","url":null,"abstract":"<p>We introduce quantile ratio regression. Our proposed model assumes that the ratio of two arbitrary quantiles of a continuous response distribution is a function of a linear predictor. Thanks to basic quantile properties, estimation can be carried out on the scale of either the response or the link function. The advantage of using the latter becomes tangible when implementing fast optimizers for linear regression in the presence of large datasets. We show the theoretical properties of the estimator and derive an efficient method to obtain standard errors. The good performance and merit of our methods are illustrated by means of a simulation study and a real data analysis; where we investigate income inequality in the European Union (EU) using data from a sample of about two million households. We find a significant association between inequality, as measured by quantile ratios, and certain macroeconomic indicators; and we identify countries with outlying income inequality relative to the rest of the EU. An <span>R</span> implementation of the proposed methods is freely available.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140152790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-13DOI: 10.1007/s11222-024-10412-w
Janice L. Scealy, Kassel L. Hingee, John T. Kent, Andrew T. A. Wood
The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.
受限多项式倾斜成对交互分布(RPPI)为成分数据提供了一个灵活的模型。它尤其适用于组成成分的某些边际分布集中在零附近,可能具有右偏斜的情况。本文结合两种思路,为该模型开发了一种可操作的稳健估计方法。第一个想法是在加法对数比率变换后使用分数匹配估计。由此产生的估计器对数据组成中的零自动不敏感。第二个想法是在估计方程中加入适当的权重。由此产生的估计器还能抵御异常值。这些特性在模拟研究中得到了证实,我们还进一步证明,即使在没有模型污染的情况下,我们新的抗异常值估计器在高浓度环境下也是有效的。我们以微生物组数据为例进行了说明。文章附有一个用户友好的 R 软件包。
{"title":"Robust score matching for compositional data","authors":"Janice L. Scealy, Kassel L. Hingee, John T. Kent, Andrew T. A. Wood","doi":"10.1007/s11222-024-10412-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10412-w","url":null,"abstract":"<p>The restricted polynomially-tilted pairwise interaction (RPPI) distribution gives a flexible model for compositional data. It is particularly well-suited to situations where some of the marginal distributions of the components of a composition are concentrated near zero, possibly with right skewness. This article develops a method of tractable robust estimation for the model by combining two ideas. The first idea is to use score matching estimation after an additive log-ratio transformation. The resulting estimator is automatically insensitive to zeros in the data compositions. The second idea is to incorporate suitable weights in the estimating equations. The resulting estimator is additionally resistant to outliers. These properties are confirmed in simulation studies where we further also demonstrate that our new outlier-robust estimator is efficient in high concentration settings, even in the case when there is no model contamination. An example is given using microbiome data. A user-friendly R package accompanies the article.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140116841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce a quantile Generalized Measure of Correlation (GMC) to describe nonlinear quantile relationship between response variable and predictors. The introduced correlation takes values between zero and one. It is zero if and only if the conditional quantile function is equal to the unconditional quantile. We also introduce a quantile partial Generalized Measure of Correlation. Estimators of these correlations are developed. Notably by adopting machine learning methods, our estimation procedures allow the dimension of predictors very large. Under mild conditions, we establish the estimators’ consistency. For construction of confidence interval, we adopt sample splitting and show that the corresponding estimators are asymptotic normal. We also consider composite quantile GMC by integrating information from different quantile levels. Numerical studies are conducted to illustrate our methods. Moreover, we apply our methods to analyze genome-wide association study data from Carworth Farms White mice.
{"title":"Quantile generalized measures of correlation","authors":"Xinyu Zhang, Hongwei Shi, Niwen Zhou, Falong Tan, Xu Guo","doi":"10.1007/s11222-024-10414-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10414-8","url":null,"abstract":"<p>In this paper, we introduce a quantile Generalized Measure of Correlation (GMC) to describe nonlinear quantile relationship between response variable and predictors. The introduced correlation takes values between zero and one. It is zero if and only if the conditional quantile function is equal to the unconditional quantile. We also introduce a quantile partial Generalized Measure of Correlation. Estimators of these correlations are developed. Notably by adopting machine learning methods, our estimation procedures allow the dimension of predictors very large. Under mild conditions, we establish the estimators’ consistency. For construction of confidence interval, we adopt sample splitting and show that the corresponding estimators are asymptotic normal. We also consider composite quantile GMC by integrating information from different quantile levels. Numerical studies are conducted to illustrate our methods. Moreover, we apply our methods to analyze genome-wide association study data from Carworth Farms White mice.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140116788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}