首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Block-wise primal-dual algorithms for large-scale doubly penalized ANOVA modeling 大规模双罚方差分析建模的分块原始双算法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-15 DOI: 10.1016/j.csda.2024.107932
Penghui Fu , Zhiqiang Tan

For multivariate nonparametric regression, doubly penalized ANOVA modeling (DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empirical norms as penalties on the component functions such as main effects and multi-way interactions in a functional ANOVA decomposition of the underlying regression function. The two penalties play complementary roles: the HTV penalty promotes sparsity in the selection of basis functions within each component function, whereas the empirical-norm penalty promotes sparsity in the selection of component functions. To facilitate large-scale training of DPAM using backfitting or block minimization, two suitable primal-dual algorithms are developed, including both batch and stochastic versions, for updating each component function in single-block optimization. Existing applications of primal-dual algorithms are intractable for DPAM with both HTV and empirical-norm penalties. The validity and advantage of the stochastic primal-dual algorithms are demonstrated through extensive numerical experiments, compared with their batch versions and a previous active-set algorithm, in large-scale scenarios.

对于多变量非参数回归,最近提出了双重惩罚性方差分析建模(DPAM),使用分层总变异(HTV)和经验准则作为基础回归函数的函数方差分解中对主效应和多向交互作用等分量函数的惩罚。这两种惩罚起到了互补作用:HTV 惩罚促进了每个分量函数内基函数选择的稀疏性,而经验准则惩罚则促进了分量函数选择的稀疏性。为了便于使用反拟合或块最小化对 DPAM 进行大规模训练,我们开发了两种合适的原始双算法,包括批处理和随机版本,用于在单块优化中更新每个分量函数。对于具有 HTV 和经验规范惩罚的 DPAM 而言,现有的原始二元算法应用是难以实现的。通过广泛的数值实验,与批处理版本和以前的主动集算法相比,随机原始二元算法在大规模场景中的有效性和优势得到了证明。
{"title":"Block-wise primal-dual algorithms for large-scale doubly penalized ANOVA modeling","authors":"Penghui Fu ,&nbsp;Zhiqiang Tan","doi":"10.1016/j.csda.2024.107932","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107932","url":null,"abstract":"<div><p>For multivariate nonparametric regression, doubly penalized ANOVA modeling (DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empirical norms as penalties on the component functions such as main effects and multi-way interactions in a functional ANOVA decomposition of the underlying regression function. The two penalties play complementary roles: the HTV penalty promotes sparsity in the selection of basis functions within each component function, whereas the empirical-norm penalty promotes sparsity in the selection of component functions. To facilitate large-scale training of DPAM using backfitting or block minimization, two suitable primal-dual algorithms are developed, including both batch and stochastic versions, for updating each component function in single-block optimization. Existing applications of primal-dual algorithms are intractable for DPAM with both HTV and empirical-norm penalties. The validity and advantage of the stochastic primal-dual algorithms are demonstrated through extensive numerical experiments, compared with their batch versions and a previous active-set algorithm, in large-scale scenarios.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000161/pdfft?md5=58d7dad0311ba31547548e5dad010a62&pid=1-s2.0-S0167947324000161-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139915466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible regularized estimation in high-dimensional mixed membership models 高维混合成员模型中的灵活正则化估计
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-09 DOI: 10.1016/j.csda.2024.107931
Nicholas Marco , Damla Şentürk , Shafali Jeste , Charlotte C. DiStefano , Abigail Dickinson , Donatello Telesca

Mixed membership models are an extension of finite mixture models, where each observation can partially belong to more than one mixture component. A probabilistic framework for mixed membership models of high-dimensional continuous data is proposed with a focus on scalability and interpretability. The novel probabilistic representation of mixed membership is based on convex combinations of dependent multivariate Gaussian random vectors. In this setting, scalability is ensured through approximations of a tensor covariance structure through multivariate eigen-approximations with adaptive regularization imposed through shrinkage priors. Conditional weak posterior consistency is established on an unconstrained model, allowing for a simple posterior sampling scheme while keeping many of the desired theoretical properties of our model. The model is motivated by two biomedical case studies: a case study on functional brain imaging of children with autism spectrum disorder (ASD) and a case study on gene expression data from breast cancer tissue. These applications highlight how the typical assumption made in cluster analysis, that each observation comes from one homogeneous subgroup, may often be restrictive in several applications, leading to unnatural interpretations of data features.

混合成员模型是有限混合模型的扩展,其中每个观测值可以部分地属于一个以上的混合成分。本文提出了高维连续数据混合成员模型的概率框架,重点关注可扩展性和可解释性。新颖的混合成员概率表示法基于依赖多变量高斯随机向量的凸组合。在这种情况下,通过多变量特征逼近张量协方差结构来确保可扩展性,并通过收缩先验施加自适应正则化。在无约束模型上建立了条件弱后验一致性,允许采用简单的后验采样方案,同时保持了我们模型的许多理想理论特性。该模型由两个生物医学案例研究激发:一个是自闭症谱系障碍(ASD)儿童的脑功能成像案例研究,另一个是乳腺癌组织基因表达数据案例研究。这些应用凸显了聚类分析中的典型假设,即每个观察结果都来自一个同质子群,在一些应用中往往会受到限制,从而导致对数据特征的不自然解释。
{"title":"Flexible regularized estimation in high-dimensional mixed membership models","authors":"Nicholas Marco ,&nbsp;Damla Şentürk ,&nbsp;Shafali Jeste ,&nbsp;Charlotte C. DiStefano ,&nbsp;Abigail Dickinson ,&nbsp;Donatello Telesca","doi":"10.1016/j.csda.2024.107931","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107931","url":null,"abstract":"<div><p>Mixed membership models are an extension of finite mixture models, where each observation can partially belong to more than one mixture component. A probabilistic framework for mixed membership models of high-dimensional continuous data is proposed with a focus on scalability and interpretability. The novel probabilistic representation of mixed membership is based on convex combinations of dependent multivariate Gaussian random vectors. In this setting, scalability is ensured through approximations of a tensor covariance structure through multivariate eigen-approximations with adaptive regularization imposed through shrinkage priors. Conditional weak posterior consistency is established on an unconstrained model, allowing for a simple posterior sampling scheme while keeping many of the desired theoretical properties of our model. The model is motivated by two biomedical case studies: a case study on functional brain imaging of children with autism spectrum disorder (ASD) and a case study on gene expression data from breast cancer tissue. These applications highlight how the typical assumption made in cluster analysis, that each observation comes from one homogeneous subgroup, may often be restrictive in several applications, leading to unnatural interpretations of data features.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter estimation and random number generation for student Lévy processes 学生莱维过程的参数估计和随机数生成
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-08 DOI: 10.1016/j.csda.2024.107933
Shuaiyu Li , Yunpei Wu , Yuzhong Cheng

To address the challenges in estimating parameters of the widely applied Student-Lévy process, the study introduces two distinct methods: a likelihood-based approach and a data-driven approach. A two-step quasi-likelihood-based method is initially proposed, countering the non-closed nature of the Student-Lévy process's distribution function under convolution. This method utilizes the limiting properties observed in high-frequency data, offering estimations via a quasi-likelihood function characterized by asymptotic normality. Additionally, a novel neural-network-based parameter estimation technique is advanced, independent of high-frequency observation assumptions. Utilizing a CNN-LSTM framework, this method effectively processes sparse, local jump-related data, extracts deep features, and maps these to the parameter space using a fully connected neural network. This innovative approach ensures minimal assumption reliance, end-to-end processing, and high scalability, marking a significant advancement in parameter estimation techniques. The efficacy of both methods is substantiated through comprehensive numerical experiments, demonstrating their robust performance in diverse scenarios.

为解决广泛应用的学生-李维过程参数估计难题,本研究引入了两种不同的方法:基于似然法和数据驱动法。最初提出的是一种基于似然法的两步法,以应对卷积作用下学生-莱维过程分布函数的非封闭性。该方法利用在高频数据中观察到的极限特性,通过具有渐近正态性特征的准似然函数进行估计。此外,还提出了一种新颖的基于神经网络的参数估计技术,与高频观测假设无关。该方法利用 CNN-LSTM 框架,有效处理稀疏的局部跳跃相关数据,提取深度特征,并利用全连接神经网络将这些特征映射到参数空间。这种创新方法确保了最小的假设依赖、端到端处理和高可扩展性,标志着参数估计技术的重大进步。这两种方法的功效通过全面的数值实验得到了证实,证明了它们在不同场景下的强大性能。
{"title":"Parameter estimation and random number generation for student Lévy processes","authors":"Shuaiyu Li ,&nbsp;Yunpei Wu ,&nbsp;Yuzhong Cheng","doi":"10.1016/j.csda.2024.107933","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107933","url":null,"abstract":"<div><p>To address the challenges in estimating parameters of the widely applied Student-Lévy process, the study introduces two distinct methods: a likelihood-based approach and a data-driven approach. A two-step quasi-likelihood-based method is initially proposed, countering the non-closed nature of the Student-Lévy process's distribution function under convolution. This method utilizes the limiting properties observed in high-frequency data, offering estimations via a quasi-likelihood function characterized by asymptotic normality. Additionally, a novel neural-network-based parameter estimation technique is advanced, independent of high-frequency observation assumptions. Utilizing a CNN-LSTM framework, this method effectively processes sparse, local jump-related data, extracts deep features, and maps these to the parameter space using a fully connected neural network. This innovative approach ensures minimal assumption reliance, end-to-end processing, and high scalability, marking a significant advancement in parameter estimation techniques. The efficacy of both methods is substantiated through comprehensive numerical experiments, demonstrating their robust performance in diverse scenarios.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust heavy-tailed versions of generalized linear models with applications in actuarial science 广义线性模型的稳健重尾版本在精算学中的应用
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-01 DOI: 10.1016/j.csda.2024.107920
Philippe Gagnon, Yuxi Wang

Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. The objective is to propose an alternative approach which is modelling-based and thus fundamentally different. Such an approach allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The proposed approach possesses appealing theoretical and empirical properties.

广义线性模型(GLM)是统计学中最常用的一类模型。例如,在保险精算学中,伽玛变体被用于保险索赔金额的建模。伽马线性模型的一个缺陷是对异常值(即错误或极端数据点)没有鲁棒性。因此,大部分数据和异常值的趋势差异会导致推论和预测的偏差。为解决这一问题,人们引入了稳健方法。最常用的稳健方法是频数法,包括一个由对数似然导数修正得出的估计器。我们的目标是提出一种基于建模的替代方法,这种方法在本质上是不同的。它允许对建模进行理解和解释,并可应用于频繁主义和贝叶斯统计分析。这种方法具有吸引人的理论和经验特性。
{"title":"Robust heavy-tailed versions of generalized linear models with applications in actuarial science","authors":"Philippe Gagnon,&nbsp;Yuxi Wang","doi":"10.1016/j.csda.2024.107920","DOIUrl":"10.1016/j.csda.2024.107920","url":null,"abstract":"<div><p>Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. The objective is to propose an alternative approach which is modelling-based and thus fundamentally different. Such an approach allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The proposed approach possesses appealing theoretical and empirical properties.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000045/pdfft?md5=d65027421e97468881072a344f6ff2f7&pid=1-s2.0-S0167947324000045-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139663190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Goodness-of-fit test for point processes first-order intensity 点过程一阶强度拟合优度检验
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-01 DOI: 10.1016/j.csda.2024.107929
M.I. Borrajo , W. González-Manteiga , M.D. Martínez-Miranda

Modelling the first-order intensity function is one of the main aims in point process theory. An appropriate model describes the first-order intensity as a nonparametric function of spatial covariates. A formal testing procedure is presented to assess the goodness-of-fit of this model, assuming an inhomogeneous Poisson point process. The test is based on a quadratic distance between two kernel intensity estimators. The asymptotic normality of the test statistic is proved and a bootstrap procedure to approximate its distribution is suggested. The proposal is illustrated with two applications to real data sets, and an extensive simulation study to evaluate its finite-sample performance.

一阶强度函数建模是点过程理论的主要目标之一。一个合适的模型将一阶强度描述为空间协变量的非参数函数。假设有一个不均匀的泊松点过程,本文提出了一个正式的测试程序来评估该模型的拟合优度。该检验基于两个核强度估计值之间的二次距离。证明了检验统计量的渐近正态性,并提出了近似其分布的引导程序。该建议通过对真实数据集的两个应用进行了说明,并通过广泛的模拟研究对其有限样本性能进行了评估。
{"title":"Goodness-of-fit test for point processes first-order intensity","authors":"M.I. Borrajo ,&nbsp;W. González-Manteiga ,&nbsp;M.D. Martínez-Miranda","doi":"10.1016/j.csda.2024.107929","DOIUrl":"10.1016/j.csda.2024.107929","url":null,"abstract":"<div><p>Modelling the first-order intensity function is one of the main aims in point process theory. An appropriate model describes the first-order intensity as a nonparametric function of spatial covariates. A formal testing procedure is presented to assess the goodness-of-fit of this model, assuming an inhomogeneous Poisson point process. The test is based on a quadratic distance between two kernel intensity estimators. The asymptotic normality of the test statistic is proved and a bootstrap procedure to approximate its distribution is suggested. The proposal is illustrated with two applications to real data sets, and an extensive simulation study to evaluate its finite-sample performance.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000136/pdfft?md5=1b0078e479b3b5d2b8e20fa93b0c25e2&pid=1-s2.0-S0167947324000136-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139663189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous quantile regression for longitudinal data with subgroup structures 具有分组结构的纵向数据的异质性量子回归
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-01-29 DOI: 10.1016/j.csda.2024.107928
Zhaohan Hou, Lei Wang

Subgroup analysis for modeling longitudinal data with heterogeneity across all individuals has drawn attention in the modern statistical learning. In this paper, we focus on heterogeneous quantile regression model and propose to achieve variable selection, heterogeneous subgrouping and parameter estimation simultaneously, by using the smoothed generalized estimating equations in conjunction with the multi-directional separation penalty. The proposed method allows individuals to be divided into multiple subgroups for different heterogeneous covariates such that estimation efficiency can be gained through incorporating individual correlation structure and sharing information within subgroups. A data-driven procedure based on a modified BIC is applied to estimate the number of subgroups. Theoretical properties of the oracle estimator given the underlying true subpopulation information are firstly provided and then it is shown that the proposed estimator is equivalent to the oracle estimator under some conditions. The finite-sample performance of the proposed estimators is studied through simulations and an application to an AIDS dataset is also presented.

对所有个体具有异质性的纵向数据进行分组分析建模已引起现代统计学习的关注。本文以异质性量化回归模型为研究对象,提出通过平滑广义估计方程结合多向分离惩罚,同时实现变量选择、异质性分组和参数估计。所提出的方法允许针对不同的异质性协变量将个体划分为多个子组,从而通过纳入个体相关结构和共享子组内信息来提高估计效率。该方法采用基于修正 BIC 的数据驱动程序来估算子群数量。首先给出了给定基本真实子群信息的神谕估计器的理论特性,然后证明了在某些条件下,所提出的估计器等同于神谕估计器。通过模拟研究了所提估计器的有限样本性能,并介绍了在艾滋病数据集上的应用。
{"title":"Heterogeneous quantile regression for longitudinal data with subgroup structures","authors":"Zhaohan Hou,&nbsp;Lei Wang","doi":"10.1016/j.csda.2024.107928","DOIUrl":"10.1016/j.csda.2024.107928","url":null,"abstract":"<div><p><span><span>Subgroup analysis for modeling longitudinal data with heterogeneity across all individuals has drawn attention in the modern statistical learning. In this paper, we focus on heterogeneous </span>quantile regression model and propose to achieve variable selection, heterogeneous subgrouping and parameter estimation simultaneously, by using the smoothed generalized estimating equations in conjunction with the multi-directional separation penalty. The proposed method allows individuals to be divided into multiple subgroups for different heterogeneous </span>covariates<span><span> such that estimation efficiency can be gained through incorporating individual correlation structure and sharing information within subgroups. A data-driven procedure based on a modified </span>BIC is applied to estimate the number of subgroups. Theoretical properties of the oracle estimator given the underlying true subpopulation information are firstly provided and then it is shown that the proposed estimator is equivalent to the oracle estimator under some conditions. The finite-sample performance of the proposed estimators is studied through simulations and an application to an AIDS dataset is also presented.</span></p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139579432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A unified framework of analyzing missing data and variable selection using regularized likelihood 使用正则化似然法分析缺失数据和变量选择的统一框架
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-01-26 DOI: 10.1016/j.csda.2024.107919
Yuan Bian , Grace Y. Yi , Wenqing He

Missing data arise commonly in applications, and research on this topic has received extensive attention in the past few decades. Various inference methods have been developed under different missing data mechanisms, including missing at random and missing not at random. The assessment of a feasible missing data mechanism is, however, difficult due to the lack of validation data. The problem is further complicated by the presence of spurious variables in covariates. Focusing on missingness in the response variable, a unified modeling scheme is proposed by utilizing the parametric generalized additive model to characterize various types of missing data processes. Taking the generalized linear model to facilitate the dependence of the response on the associated covariates, the concurrent estimation and variable selection procedures are developed using regularized likelihood, and the asymptotic properties for the resultant estimators are rigorously established. The proposed methods are appealing in their flexibility and generality; they circumvent the need of assuming a particular missing data mechanism that is required by most available methods. Empirical studies demonstrate that the proposed methods result in satisfactory performance in finite sample settings. Extensions to accommodating missingness in both the response and covariates are also discussed.

缺失数据在应用中经常出现,在过去几十年里,有关这一主题的研究受到了广泛关注。在不同的缺失数据机制下,包括随机缺失和非随机缺失,已经开发出了各种推断方法。然而,由于缺乏验证数据,评估可行的缺失数据机制非常困难。由于协变量中存在虚假变量,问题变得更加复杂。针对响应变量的缺失,我们提出了一个统一的建模方案,利用参数广义加法模型来描述各种类型的缺失数据过程。利用广义线性模型来简化响应对相关协变量的依赖性,使用正则化似然法开发了并行估计和变量选择程序,并严格建立了估计结果的渐近特性。所提出的方法具有灵活性和通用性,避免了大多数现有方法所要求的特定缺失数据机制假设。实证研究表明,所提出的方法在有限样本环境中的性能令人满意。此外,还讨论了如何扩展以适应响应和协变量的缺失。
{"title":"A unified framework of analyzing missing data and variable selection using regularized likelihood","authors":"Yuan Bian ,&nbsp;Grace Y. Yi ,&nbsp;Wenqing He","doi":"10.1016/j.csda.2024.107919","DOIUrl":"10.1016/j.csda.2024.107919","url":null,"abstract":"<div><p><span>Missing data arise commonly in applications, and research on this topic has received extensive attention in the past few decades. Various inference methods have been developed under different missing data mechanisms<span>, including missing at random and missing not at random. The assessment of a feasible missing data mechanism is, however, difficult due to the lack of validation data. The problem is further complicated by the presence of spurious variables in </span></span>covariates<span><span><span><span><span>. Focusing on missingness in the response variable, a unified modeling scheme is proposed by utilizing the </span>parametric </span>generalized additive model to characterize various types of missing data processes. Taking the </span>generalized linear model to facilitate the dependence of the response on the associated covariates, the concurrent estimation and variable selection procedures are developed using regularized likelihood, and the </span>asymptotic properties for the resultant estimators are rigorously established. The proposed methods are appealing in their flexibility and generality; they circumvent the need of assuming a particular missing data mechanism that is required by most available methods. Empirical studies demonstrate that the proposed methods result in satisfactory performance in finite sample settings. Extensions to accommodating missingness in both the response and covariates are also discussed.</span></p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139579488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simple approach for local and global variable importance in nonlinear regression models 非线性回归模型中局部和全局变量重要性的简单方法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-01-22 DOI: 10.1016/j.csda.2023.107914
Emily T. Winn-Nuñez , Maryclare Griffin , Lorin Crawford

The ability to interpret machine learning models has become increasingly important as their usage in data science continues to rise. Most current interpretability methods are optimized to work on either (i) a global scale, where the goal is to rank features based on their contributions to overall variation in an observed population, or (ii) the local level, which aims to detail on how important a feature is to a particular individual in the data set. In this work, a new operator is proposed called the “GlObal And Local Score” (GOALS): a simple post hoc approach to simultaneously assess local and global feature variable importance in nonlinear models. Motivated by problems in biomedicine, the approach is demonstrated using Gaussian process regression where the task of understanding how genetic markers are associated with disease progression both within individuals and across populations is of high interest. Detailed simulations and real data analyses illustrate the flexible and efficient utility of GOALS over state-of-the-art variable importance strategies.

随着机器学习模型在数据科学中的应用不断增加,其解释能力也变得越来越重要。目前大多数可解释性方法都是在以下两种情况下进行优化的:(i) 全局范围,目标是根据特征对观察群体整体变异的贡献进行排序;或 (ii) 局部水平,目标是详细说明特征对数据集中特定个体的重要性。在这项工作中,我们提出了一种名为 "GlObal And Local Score"(GOALS)的新算子:一种简单的事后方法,可同时评估非线性模型中局部和全局特征变量的重要性。受生物医学问题的启发,该方法使用高斯过程回归进行了演示,其中,了解遗传标记如何与个体内和群体间的疾病进展相关联是一项非常有意义的任务。详细的模拟和实际数据分析表明,与最先进的变量重要性策略相比,GOALS 具有灵活、高效的实用性。
{"title":"A simple approach for local and global variable importance in nonlinear regression models","authors":"Emily T. Winn-Nuñez ,&nbsp;Maryclare Griffin ,&nbsp;Lorin Crawford","doi":"10.1016/j.csda.2023.107914","DOIUrl":"10.1016/j.csda.2023.107914","url":null,"abstract":"<div><p>The ability to interpret machine learning models has become increasingly important as their usage in data science continues to rise. Most current interpretability methods are optimized to work on either (<em>i</em>) a global scale, where the goal is to rank features based on their contributions to overall variation in an observed population, or (<em>ii</em>) the local level, which aims to detail on how important a feature is to a particular individual in the data set. In this work, a new operator is proposed called the “GlObal And Local Score” (GOALS): a simple <em>post hoc</em> approach to simultaneously assess local and global feature variable importance in nonlinear models. Motivated by problems in biomedicine, the approach is demonstrated using Gaussian process regression where the task of understanding how genetic markers are associated with disease progression both within individuals and across populations is of high interest. Detailed simulations and real data analyses illustrate the flexible and efficient utility of GOALS over state-of-the-art variable importance strategies.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947323002256/pdfft?md5=3d75a881294f96b9bf9c0f7b5c55b255&pid=1-s2.0-S0167947323002256-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139555181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data 聚类后差异检验:应用于生态和生物数据的有效推断和实际考虑因素
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-01-17 DOI: 10.1016/j.csda.2023.107916
Benjamin Hivert , Denis Agniel , Rodolphe Thiébaut , Boris P. Hejblum

Clustering is part of unsupervised analysis methods that group samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are thus used for the inference process because the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. Three novel statistical hypothesis tests are introduced, each designed to account for the clustering process. These tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations. The proposed tests were applied in two distinct contexts: animal ecology and immunology, demonstrating the relevance of the results with real datasets.

聚类是无监督分析方法的一部分,它将样本分成同质且独立的观测子群,也称为聚类。为了解释聚类,通常使用统计假设检验来推断将估计聚类彼此显著区分开来的变量。然而,由于假设是从聚类结果中推导出来的,因此推论过程中使用了数据驱动的假设。这种对数据的双重使用导致传统的假设检验无法控制 I 类错误率,特别是因为聚类过程中的不确定性及其可能造成的人为差异。本文介绍了三种新的统计假设检验,每种检验的设计都考虑到了聚类过程。这些检验通过仅识别包含真正信号的变量来区分观察组,从而有效控制 I 类错误率。所提出的检验方法被应用于动物生态学和免疫学这两个不同的领域,证明了其与真实数据集的相关性。
{"title":"Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data","authors":"Benjamin Hivert ,&nbsp;Denis Agniel ,&nbsp;Rodolphe Thiébaut ,&nbsp;Boris P. Hejblum","doi":"10.1016/j.csda.2023.107916","DOIUrl":"10.1016/j.csda.2023.107916","url":null,"abstract":"<div><p><span>Clustering is part of unsupervised analysis methods that group samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing<span> is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are thus used for the inference process because the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the </span></span>clustering process and the potential artificial differences it could create. Three novel statistical hypothesis tests are introduced, each designed to account for the clustering process. These tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations. The proposed tests were applied in two distinct contexts: animal ecology and immunology, demonstrating the relevance of the results with real datasets.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139496619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated subgroup identification from multi-source data 从多源数据中综合识别分组
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-01-11 DOI: 10.1016/j.csda.2024.107918
Lihui Shao , Jiaqi Wu , Weiping Zhang , Yu Chen

Subgroup identification is crucial in dealing with the heterogeneous population and has wide applications in various areas, such as clinical trials and market segmentation. With the prevalence of multi-source data, there is a practical need to identify subgroups based on multi-source data. This paper proposes a working-independence pseudo-loglikelihood and integrates the parameters of each source into a pairwise fusion penalty for simultaneous parameter estimation and subgroup identification. To implement the proposed method, an alternating direction method of multipliers (ADMM) algorithm is derived. Furthermore, the weak oracle properties of parameter estimation are established, illustrating the latent subgroups can be consistently identified. Finally, numerical simulations and an analysis of a randomized trial on reduced nicotine standards for cigarettes are conducted to evaluate the performance of the proposed method.

亚组识别对于处理异质人群至关重要,在临床试验和市场细分等多个领域有着广泛的应用。随着多源数据的普及,基于多源数据的亚组识别有了实际需求。本文提出了一种与工作无关的伪对数概率,并将每个来源的参数整合到一个成对融合惩罚中,以同时进行参数估计和亚组识别。为了实现所提出的方法,推导出了一种交替方向乘法(ADMM)算法。此外,还建立了参数估计的弱甲骨文特性,说明可以持续识别潜在子群。最后,对降低香烟尼古丁标准的随机试验进行了数值模拟和分析,以评估所提方法的性能。
{"title":"Integrated subgroup identification from multi-source data","authors":"Lihui Shao ,&nbsp;Jiaqi Wu ,&nbsp;Weiping Zhang ,&nbsp;Yu Chen","doi":"10.1016/j.csda.2024.107918","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107918","url":null,"abstract":"<div><p><span>Subgroup identification is crucial in dealing with the heterogeneous population and has wide applications in various areas, such as clinical trials and market segmentation. With the prevalence of multi-source data, there is a practical need to identify subgroups based on multi-source data. This paper proposes a working-independence pseudo-loglikelihood and integrates the parameters of each source into a pairwise fusion penalty for simultaneous parameter estimation and subgroup identification. To implement the proposed method, an </span>alternating direction method of multipliers<span> (ADMM) algorithm is derived. Furthermore, the weak oracle properties of parameter estimation are established, illustrating the latent subgroups can be consistently identified. Finally, numerical simulations and an analysis of a randomized trial on reduced nicotine standards for cigarettes are conducted to evaluate the performance of the proposed method.</span></p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139433559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1