首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Variable selection of Kolmogorov-Smirnov maximization with a penalized surrogate loss 带有惩罚性替代损失的科尔莫哥洛夫-斯米尔诺夫最大化的变量选择
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-13 DOI: 10.1016/j.csda.2024.107944
Xiefang Lin , Fang Fang

Kolmogorov-Smirnov (KS) statistic is quite popular in many areas as the major performance evaluation criterion for binary classification due to its explicit business intension. Fang and Chen (2019) proposed a novel DMKS method that directly maximizes the KS statistic and compares favorably with the popular existing methods. However, DMKS did not consider the critical problem of variable selection since the special form of KS brings great challenge to establish the DMKS estimator's asymptotic distribution which is most likely to be nonstandard. This intractable issue is handled by introducing a surrogate loss function which leads to a n-consistent estimator for the true parameter up to a multiplicative scalar. Then a nonconcave penalty function is combined to achieve the variable selection consistency and asymptotical normality with the oracle property. Results of empirical studies confirm the theoretical results and show advantages of the proposed SKS (Surrogated Kolmogorov-Smirnov) method compared to the original DMKS method without variable selection.

Kolmogorov-Smirnov(KS)统计量因其明确的商业意图而在许多领域颇为流行,是二元分类的主要性能评价标准。Fang 和 Chen(2019)提出了一种新颖的 DMKS 方法,该方法可直接使 KS 统计量最大化,与现有的流行方法相比效果更佳。然而,DMKS 并没有考虑变量选择这一关键问题,因为 KS 的特殊形式给建立 DMKS 估计器的渐近分布带来了巨大挑战,而 DMKS 估计器的渐近分布很可能是非标准的。为了解决这个棘手的问题,我们引入了一个替代损失函数,该函数会导致一个 n 个一致的真实参数估计器,其最大值为一个乘法标量。然后,结合非凹式惩罚函数来实现变量选择一致性和具有神谕特性的渐近正态性。实证研究的结果证实了理论结果,并显示了与不带变量选择的原始 DMKS 方法相比,所提出的 SKS(代理科尔莫哥洛夫-斯米尔诺夫)方法的优势。
{"title":"Variable selection of Kolmogorov-Smirnov maximization with a penalized surrogate loss","authors":"Xiefang Lin ,&nbsp;Fang Fang","doi":"10.1016/j.csda.2024.107944","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107944","url":null,"abstract":"<div><p>Kolmogorov-Smirnov (KS) statistic is quite popular in many areas as the major performance evaluation criterion for binary classification due to its explicit business intension. <span>Fang and Chen (2019)</span> proposed a novel DMKS method that directly maximizes the KS statistic and compares favorably with the popular existing methods. However, DMKS did not consider the critical problem of variable selection since the special form of KS brings great challenge to establish the DMKS estimator's asymptotic distribution which is most likely to be nonstandard. This intractable issue is handled by introducing a surrogate loss function which leads to a <span><math><msqrt><mrow><mi>n</mi></mrow></msqrt></math></span>-consistent estimator for the true parameter up to a multiplicative scalar. Then a nonconcave penalty function is combined to achieve the variable selection consistency and asymptotical normality with the oracle property. Results of empirical studies confirm the theoretical results and show advantages of the proposed SKS (Surrogated Kolmogorov-Smirnov) method compared to the original DMKS method without variable selection.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140160927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pairwise share ratio interpretations of compositional regression models 成分回归模型的配对份额比解释
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-05 DOI: 10.1016/j.csda.2024.107945
Lukas Dargel , Christine Thomas-Agnan

The interpretation of regression models with compositional vectors as response and/or explanatory variables has been approached from different perspectives. The initial approaches are performed in coordinate space subsequent to applying a log-ratio transformation to the compositional vectors. Given that these models exhibit non-linearity concerning classical operations within real space, an alternative approach has been proposed. This approach relies on infinitesimal increments or derivatives, interpreted within a simplex framework. Consequently, it offers interpretations of elasticities or semi-elasticities in the original space of shares which are independent of any log-ratio transformations. Some functions of these elasticities or semi-elasticities turn out to be constant throughout the sample observations, making them natural parameters for interpreting CoDa models. These parameters are linked to relative variations of pairwise share ratios of the response and/or of the explanatory variables. Approximations of share ratio variations are derived and linked to these natural parameters. A real dataset on the French presidential election is utilized to illustrate each type of interpretation in detail.

人们从不同的角度解释以组成向量为响应和/或解释变量的回归模型。最初的方法是在坐标空间中对组合向量进行对数比率转换。鉴于这些模型在实空间内的经典运算表现出非线性,人们提出了另一种方法。这种方法依赖于无穷小增量或导数,并在单纯形框架内进行解释。因此,它可以解释原始份额空间中的弹性或半弹性,这些弹性或半弹性与任何对数比率变换无关。这些弹性或半弹性的某些函数在整个样本观测过程中保持不变,因此成为解释 CoDa 模型的自然参数。这些参数与响应变量和/或解释变量的成对份额比的相对变化有关。得出了份额比变化的近似值,并将其与这些自然参数联系起来。我们利用法国总统选举的真实数据集来详细说明每种解释类型。
{"title":"Pairwise share ratio interpretations of compositional regression models","authors":"Lukas Dargel ,&nbsp;Christine Thomas-Agnan","doi":"10.1016/j.csda.2024.107945","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107945","url":null,"abstract":"<div><p>The interpretation of regression models with compositional vectors as response and/or explanatory variables has been approached from different perspectives. The initial approaches are performed in coordinate space subsequent to applying a log-ratio transformation to the compositional vectors. Given that these models exhibit non-linearity concerning classical operations within real space, an alternative approach has been proposed. This approach relies on infinitesimal increments or derivatives, interpreted within a simplex framework. Consequently, it offers interpretations of elasticities or semi-elasticities in the original space of shares which are independent of any log-ratio transformations. Some functions of these elasticities or semi-elasticities turn out to be constant throughout the sample observations, making them natural parameters for interpreting CoDa models. These parameters are linked to relative variations of pairwise share ratios of the response and/or of the explanatory variables. Approximations of share ratio variations are derived and linked to these natural parameters. A real dataset on the French presidential election is utilized to illustrate each type of interpretation in detail.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140061880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Factor selection in screening experiments by aggregation over random models 通过对随机模型进行汇总,在筛选实验中选择因子
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-24 DOI: 10.1016/j.csda.2024.107940
Rakhi Singh , John Stufken

Screening experiments are useful for identifying a small number of truly important factors from a large number of potentially important factors. The Gauss-Dantzig Selector (GDS) is often the preferred analysis method for screening experiments. Just considering main-effects models can result in erroneous conclusions, but including interaction terms, even if restricted to two-factor interactions, increases the number of model terms dramatically and challenges the GDS analysis. A new analysis method, called Gauss-Dantzig Selector Aggregation over Random Models (GDS-ARM), which performs a GDS analysis on multiple models that include only some randomly selected interactions, is proposed. Results from these different analyses are then aggregated to identify the important factors. The proposed method is discussed, the appropriate choices for the tuning parameters are suggested, and the performance of the method is studied on real and simulated data.

筛选实验有助于从大量潜在的重要因素中找出少数真正重要的因素。高斯-丹齐格选择器(GDS)通常是筛选实验的首选分析方法。仅仅考虑主效应模型可能会导致错误的结论,但加入交互项,即使仅限于双因素交互,也会大大增加模型项的数量,从而对 GDS 分析提出挑战。本文提出了一种新的分析方法,称为随机模型高斯-丹齐格选择器聚合法(GDS-ARM),该方法对只包含一些随机选择的交互作用的多个模型进行 GDS 分析。然后汇总这些不同分析的结果,以确定重要因素。对所提出的方法进行了讨论,提出了调整参数的适当选择,并在真实数据和模拟数据上研究了该方法的性能。
{"title":"Factor selection in screening experiments by aggregation over random models","authors":"Rakhi Singh ,&nbsp;John Stufken","doi":"10.1016/j.csda.2024.107940","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107940","url":null,"abstract":"<div><p>Screening experiments are useful for identifying a small number of truly important factors from a large number of potentially important factors. The Gauss-Dantzig Selector (GDS) is often the preferred analysis method for screening experiments. Just considering main-effects models can result in erroneous conclusions, but including interaction terms, even if restricted to two-factor interactions, increases the number of model terms dramatically and challenges the GDS analysis. A new analysis method, called Gauss-Dantzig Selector Aggregation over Random Models (GDS-ARM), which performs a GDS analysis on multiple models that include only some randomly selected interactions, is proposed. Results from these different analyses are then aggregated to identify the important factors. The proposed method is discussed, the appropriate choices for the tuning parameters are suggested, and the performance of the method is studied on real and simulated data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139992568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential estimation for mixture of regression models for heterogeneous population 异质人口混合回归模型的序列估计
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-23 DOI: 10.1016/j.csda.2024.107942
Na You , Hongsheng Dai , Xueqin Wang , Qingyun Yu

Heterogeneity among patients commonly exists in clinical studies and leads to challenges in medical research. It is widely accepted that there exist various sub-types in the population and they are distinct from each other. The approach of identifying the sub-types and thus tailoring disease prevention and treatment is known as precision medicine. The mixture model is a classical statistical model to cluster the heterogeneous population into homogeneous sub-populations. However, for the highly heterogeneous population with multiple components, its parameter estimation and clustering results may be ambiguous due to the dependence of the EM algorithm on the initial values. For sub-typing purposes, the finite mixture of regression models with concomitant variables is considered and a novel statistical method is proposed to identify the main components with large proportions in the mixture sequentially. Compared to existing typical statistical inferences, the new method not only requires no pre-specification on the number of components for model fitting, but also provides more reliable parameter estimation and clustering results. Simulation studies demonstrated the superiority of the proposed method. Real data analysis on the drug response prediction illustrated its reliability in the parameter estimation and capability to identify the important subgroup.

在临床研究中,患者之间普遍存在异质性,这给医学研究带来了挑战。人们普遍认为,人群中存在各种亚型,它们彼此不同。识别亚型,从而有针对性地预防和治疗疾病的方法被称为精准医学。混合模型是将异质性人群聚类为同质性亚人群的经典统计模型。然而,对于具有多个成分的高度异质性人群,由于 EM 算法对初始值的依赖性,其参数估计和聚类结果可能模糊不清。为了实现分类型的目的,我们考虑了具有伴随变量的有限回归模型混合物,并提出了一种新的统计方法来依次识别混合物中比例较大的主要成分。与现有的典型统计推断方法相比,新方法不仅不需要预先设定模型拟合的成分数量,而且能提供更可靠的参数估计和聚类结果。模拟研究证明了所提方法的优越性。对药物反应预测的真实数据分析表明了该方法在参数估计方面的可靠性和识别重要亚组的能力。
{"title":"Sequential estimation for mixture of regression models for heterogeneous population","authors":"Na You ,&nbsp;Hongsheng Dai ,&nbsp;Xueqin Wang ,&nbsp;Qingyun Yu","doi":"10.1016/j.csda.2024.107942","DOIUrl":"10.1016/j.csda.2024.107942","url":null,"abstract":"<div><p>Heterogeneity among patients commonly exists in clinical studies and leads to challenges in medical research. It is widely accepted that there exist various sub-types in the population and they are distinct from each other. The approach of identifying the sub-types and thus tailoring disease prevention and treatment is known as precision medicine. The mixture model is a classical statistical model to cluster the heterogeneous population into homogeneous sub-populations. However, for the highly heterogeneous population with multiple components, its parameter estimation and clustering results may be ambiguous due to the dependence of the EM algorithm on the initial values. For sub-typing purposes, the finite mixture of regression models with concomitant variables is considered and a novel statistical method is proposed to identify the main components with large proportions in the mixture sequentially. Compared to existing typical statistical inferences, the new method not only requires no pre-specification on the number of components for model fitting, but also provides more reliable parameter estimation and clustering results. Simulation studies demonstrated the superiority of the proposed method. Real data analysis on the drug response prediction illustrated its reliability in the parameter estimation and capability to identify the important subgroup.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000264/pdfft?md5=2528169fc71715f70721d132e55408cf&pid=1-s2.0-S0167947324000264-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139954170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference on order restricted means of inverse Gaussian populations under heteroscedasticity 异方差条件下的反高斯群体阶次限制均值推理
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-23 DOI: 10.1016/j.csda.2024.107943
Anjana Mondal, Somesh Kumar

The hypothesis testing problem of homogeneity of k (2) inverse Gaussian means against ordered alternatives is studied when nuisance or scale-like parameters are unknown and unequal. The maximum likelihood estimators (MLEs) of means and scale-like parameters are obtained when means satisfy some simple order restrictions and scale-like parameters are unknown and unequal. An iterative algorithm is proposed for finding these estimators. It has been proved that under a specific condition, the proposed algorithm converges to the true MLEs uniquely. A likelihood ratio test and two simultaneous tests are proposed. Further, an algorithm for finding the MLEs of parameters is given when means are equal but unknown. Using the estimators, the likelihood ratio test is developed for testing against ordered alternative means. Using the asymptotic distribution, the asymptotic likelihood ratio test is proposed. However, for small samples, it does not perform well. Hence, a parametric bootstrap likelihood ratio test (PB LRT) is proposed. Therefore, the asymptotic validity of the bootstrap procedure has been shown. Using the Box-type approximation method, test statistics are developed for the two-sample problem of equality of means when scale-like parameters are heterogeneous. Using these, two PB-based heuristic tests are proposed. Asymptotic null distributions are derived and PB accuracy is also developed. Two asymptotic tests are also proposed using the asymptotic null distributions. To get the critical points and test statistics of the three PB tests and two asymptotic tests, an ‘R’ package is developed and shared on GitHub. Applications of the tests are illustrated using real data.

当干扰参数或类尺度参数未知且不相等时,研究了 k (≥2) 反高斯均值与有序替代值的同质性假设检验问题。当均值满足一些简单的阶次限制,而类尺度参数未知且不相等时,可获得均值和类尺度参数的最大似然估计值(MLE)。为找到这些估计值提出了一种迭代算法。实验证明,在特定条件下,所提出的算法能唯一收敛到真正的 MLEs。还提出了一个似然比检验和两个同步检验。此外,还给出了在均值相等但未知的情况下寻找参数 MLE 的算法。利用估计值,开发了针对有序替代均值的似然比检验。利用渐近分布,提出了渐近似然比检验。然而,对于小样本,它的表现并不理想。因此,提出了参数自举似然比检验(PB LRT)。因此,自举程序的渐近有效性得到了证明。利用盒型近似法,为类比例参数异质时均值相等的双样本问题建立了检验统计量。利用这些统计量,提出了两个基于 PB 的启发式检验。推导出了渐近零分布,并发展了 PB 精确度。利用渐近零分布还提出了两个渐近检验。为了获得三个 PB 检验和两个渐近检验的临界点和检验统计量,开发了一个 "R "软件包,并在 GitHub 上共享。使用真实数据说明了这些检验的应用。
{"title":"Inference on order restricted means of inverse Gaussian populations under heteroscedasticity","authors":"Anjana Mondal,&nbsp;Somesh Kumar","doi":"10.1016/j.csda.2024.107943","DOIUrl":"10.1016/j.csda.2024.107943","url":null,"abstract":"<div><p>The hypothesis testing problem of homogeneity of <em>k</em> <span><math><mo>(</mo><mo>≥</mo><mn>2</mn><mo>)</mo></math></span> inverse Gaussian means against ordered alternatives is studied when nuisance or scale-like parameters are unknown and unequal. The maximum likelihood estimators (MLEs) of means and scale-like parameters are obtained when means satisfy some simple order restrictions and scale-like parameters are unknown and unequal. An iterative algorithm is proposed for finding these estimators. It has been proved that under a specific condition, the proposed algorithm converges to the true MLEs uniquely. A likelihood ratio test and two simultaneous tests are proposed. Further, an algorithm for finding the MLEs of parameters is given when means are equal but unknown. Using the estimators, the likelihood ratio test is developed for testing against ordered alternative means. Using the asymptotic distribution, the asymptotic likelihood ratio test is proposed. However, for small samples, it does not perform well. Hence, a parametric bootstrap likelihood ratio test (PB LRT) is proposed. Therefore, the asymptotic validity of the bootstrap procedure has been shown. Using the Box-type approximation method, test statistics are developed for the two-sample problem of equality of means when scale-like parameters are heterogeneous. Using these, two PB-based heuristic tests are proposed. Asymptotic null distributions are derived and PB accuracy is also developed. Two asymptotic tests are also proposed using the asymptotic null distributions. To get the critical points and test statistics of the three PB tests and two asymptotic tests, an ‘R’ package is developed and shared on GitHub. Applications of the tests are illustrated using real data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139954222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A stochastic process representation for time warping functions 时间扭曲函数的随机过程表示法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-20 DOI: 10.1016/j.csda.2024.107941
Yijia Ma, Xinyu Zhou, Wei Wu

Time warping function provides a mathematical representation to measure phase variability in functional data. Recent studies have developed various approaches to estimate optimal warping between functions. However, a principled, linear, generative representation on time warping functions is still under-explored. This is highly challenging because the warping functions are non-linear in the conventional L2 space. To address this problem, a new linear warping space is defined and a stochastic process representation is proposed to characterize time warping functions. The key is to define an inner-product structure on the time warping space, followed by a transformation which maps the warping functions into a sub-space of the L2 space. With certain constraints on the warping functions, this transformation is an isometric isomorphism. In the transformed space, the L2 basis in the Hilbert space is adopted for representation, which can be easily utilized to generate time warping functions by using different types of stochastic process. The effectiveness of this representation is demonstrated through its use as a new penalty in the penalized function registration, accompanied by an efficient gradient method to minimize the cost function. The new penalized method is illustrated through simulations that properly characterize nonuniform and correlated constraints in the time domain. Furthermore, this representation is utilized to develop a boxplot for warping functions, which can estimate templates and identify warping outliers. Finally, this representation is applied to a Covid-19 dataset to construct boxplots and identify states with outlying growth patterns.

时间扭曲函数为测量功能数据的相位变异性提供了一种数学表示方法。最近的研究开发了各种方法来估计函数之间的最佳扭曲。然而,关于时间翘曲函数的原则性、线性、生成性表示仍未得到充分探索。由于翘曲函数在传统的 L2 空间中是非线性的,因此这极具挑战性。为了解决这个问题,我们定义了一个新的线性翘曲空间,并提出了一种随机过程表示法来描述时间翘曲函数。关键在于定义时间扭曲空间的内积结构,然后进行变换,将扭曲函数映射到 L2 空间的子空间。在对翘曲函数有一定限制的情况下,这种变换是一种等距同构。在变换后的空间中,采用了希尔伯特空间中的 L2 基来表示,可以很容易地利用不同类型的随机过程来生成时间扭曲函数。这种表示方法的有效性体现在它被用作惩罚函数注册中的一种新的惩罚,并伴随着一种有效的梯度方法来最小化成本函数。新的惩罚方法通过模拟来说明,它能正确描述时域中的非均匀和相关约束。此外,这种表示法还被用于开发翘曲函数的箱形图,它可以估计模板并识别翘曲异常值。最后,将这种表示法应用于 Covid-19 数据集,以构建方框图并识别具有离群增长模式的状态。
{"title":"A stochastic process representation for time warping functions","authors":"Yijia Ma,&nbsp;Xinyu Zhou,&nbsp;Wei Wu","doi":"10.1016/j.csda.2024.107941","DOIUrl":"10.1016/j.csda.2024.107941","url":null,"abstract":"<div><p>Time warping function provides a mathematical representation to measure phase variability in functional data. Recent studies have developed various approaches to estimate optimal warping between functions. However, a principled, linear, generative representation on time warping functions is still under-explored. This is highly challenging because the warping functions are non-linear in the conventional <span><math><msup><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> space. To address this problem, a new linear warping space is defined and a stochastic process representation is proposed to characterize time warping functions. The key is to define an inner-product structure on the time warping space, followed by a transformation which maps the warping functions into a sub-space of the <span><math><msup><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> space. With certain constraints on the warping functions, this transformation is an isometric isomorphism. In the transformed space, the <span><math><msup><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> basis in the Hilbert space is adopted for representation, which can be easily utilized to generate time warping functions by using different types of stochastic process. The effectiveness of this representation is demonstrated through its use as a new penalty in the penalized function registration, accompanied by an efficient gradient method to minimize the cost function. The new penalized method is illustrated through simulations that properly characterize nonuniform and correlated constraints in the time domain. Furthermore, this representation is utilized to develop a boxplot for warping functions, which can estimate templates and identify warping outliers. Finally, this representation is applied to a Covid-19 dataset to construct boxplots and identify states with outlying growth patterns.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139918017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Block-wise primal-dual algorithms for large-scale doubly penalized ANOVA modeling 大规模双罚方差分析建模的分块原始双算法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-15 DOI: 10.1016/j.csda.2024.107932
Penghui Fu , Zhiqiang Tan

For multivariate nonparametric regression, doubly penalized ANOVA modeling (DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empirical norms as penalties on the component functions such as main effects and multi-way interactions in a functional ANOVA decomposition of the underlying regression function. The two penalties play complementary roles: the HTV penalty promotes sparsity in the selection of basis functions within each component function, whereas the empirical-norm penalty promotes sparsity in the selection of component functions. To facilitate large-scale training of DPAM using backfitting or block minimization, two suitable primal-dual algorithms are developed, including both batch and stochastic versions, for updating each component function in single-block optimization. Existing applications of primal-dual algorithms are intractable for DPAM with both HTV and empirical-norm penalties. The validity and advantage of the stochastic primal-dual algorithms are demonstrated through extensive numerical experiments, compared with their batch versions and a previous active-set algorithm, in large-scale scenarios.

对于多变量非参数回归,最近提出了双重惩罚性方差分析建模(DPAM),使用分层总变异(HTV)和经验准则作为基础回归函数的函数方差分解中对主效应和多向交互作用等分量函数的惩罚。这两种惩罚起到了互补作用:HTV 惩罚促进了每个分量函数内基函数选择的稀疏性,而经验准则惩罚则促进了分量函数选择的稀疏性。为了便于使用反拟合或块最小化对 DPAM 进行大规模训练,我们开发了两种合适的原始双算法,包括批处理和随机版本,用于在单块优化中更新每个分量函数。对于具有 HTV 和经验规范惩罚的 DPAM 而言,现有的原始二元算法应用是难以实现的。通过广泛的数值实验,与批处理版本和以前的主动集算法相比,随机原始二元算法在大规模场景中的有效性和优势得到了证明。
{"title":"Block-wise primal-dual algorithms for large-scale doubly penalized ANOVA modeling","authors":"Penghui Fu ,&nbsp;Zhiqiang Tan","doi":"10.1016/j.csda.2024.107932","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107932","url":null,"abstract":"<div><p>For multivariate nonparametric regression, doubly penalized ANOVA modeling (DPAM) has recently been proposed, using hierarchical total variations (HTVs) and empirical norms as penalties on the component functions such as main effects and multi-way interactions in a functional ANOVA decomposition of the underlying regression function. The two penalties play complementary roles: the HTV penalty promotes sparsity in the selection of basis functions within each component function, whereas the empirical-norm penalty promotes sparsity in the selection of component functions. To facilitate large-scale training of DPAM using backfitting or block minimization, two suitable primal-dual algorithms are developed, including both batch and stochastic versions, for updating each component function in single-block optimization. Existing applications of primal-dual algorithms are intractable for DPAM with both HTV and empirical-norm penalties. The validity and advantage of the stochastic primal-dual algorithms are demonstrated through extensive numerical experiments, compared with their batch versions and a previous active-set algorithm, in large-scale scenarios.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000161/pdfft?md5=58d7dad0311ba31547548e5dad010a62&pid=1-s2.0-S0167947324000161-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139915466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible regularized estimation in high-dimensional mixed membership models 高维混合成员模型中的灵活正则化估计
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-09 DOI: 10.1016/j.csda.2024.107931
Nicholas Marco , Damla Şentürk , Shafali Jeste , Charlotte C. DiStefano , Abigail Dickinson , Donatello Telesca

Mixed membership models are an extension of finite mixture models, where each observation can partially belong to more than one mixture component. A probabilistic framework for mixed membership models of high-dimensional continuous data is proposed with a focus on scalability and interpretability. The novel probabilistic representation of mixed membership is based on convex combinations of dependent multivariate Gaussian random vectors. In this setting, scalability is ensured through approximations of a tensor covariance structure through multivariate eigen-approximations with adaptive regularization imposed through shrinkage priors. Conditional weak posterior consistency is established on an unconstrained model, allowing for a simple posterior sampling scheme while keeping many of the desired theoretical properties of our model. The model is motivated by two biomedical case studies: a case study on functional brain imaging of children with autism spectrum disorder (ASD) and a case study on gene expression data from breast cancer tissue. These applications highlight how the typical assumption made in cluster analysis, that each observation comes from one homogeneous subgroup, may often be restrictive in several applications, leading to unnatural interpretations of data features.

混合成员模型是有限混合模型的扩展,其中每个观测值可以部分地属于一个以上的混合成分。本文提出了高维连续数据混合成员模型的概率框架,重点关注可扩展性和可解释性。新颖的混合成员概率表示法基于依赖多变量高斯随机向量的凸组合。在这种情况下,通过多变量特征逼近张量协方差结构来确保可扩展性,并通过收缩先验施加自适应正则化。在无约束模型上建立了条件弱后验一致性,允许采用简单的后验采样方案,同时保持了我们模型的许多理想理论特性。该模型由两个生物医学案例研究激发:一个是自闭症谱系障碍(ASD)儿童的脑功能成像案例研究,另一个是乳腺癌组织基因表达数据案例研究。这些应用凸显了聚类分析中的典型假设,即每个观察结果都来自一个同质子群,在一些应用中往往会受到限制,从而导致对数据特征的不自然解释。
{"title":"Flexible regularized estimation in high-dimensional mixed membership models","authors":"Nicholas Marco ,&nbsp;Damla Şentürk ,&nbsp;Shafali Jeste ,&nbsp;Charlotte C. DiStefano ,&nbsp;Abigail Dickinson ,&nbsp;Donatello Telesca","doi":"10.1016/j.csda.2024.107931","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107931","url":null,"abstract":"<div><p>Mixed membership models are an extension of finite mixture models, where each observation can partially belong to more than one mixture component. A probabilistic framework for mixed membership models of high-dimensional continuous data is proposed with a focus on scalability and interpretability. The novel probabilistic representation of mixed membership is based on convex combinations of dependent multivariate Gaussian random vectors. In this setting, scalability is ensured through approximations of a tensor covariance structure through multivariate eigen-approximations with adaptive regularization imposed through shrinkage priors. Conditional weak posterior consistency is established on an unconstrained model, allowing for a simple posterior sampling scheme while keeping many of the desired theoretical properties of our model. The model is motivated by two biomedical case studies: a case study on functional brain imaging of children with autism spectrum disorder (ASD) and a case study on gene expression data from breast cancer tissue. These applications highlight how the typical assumption made in cluster analysis, that each observation comes from one homogeneous subgroup, may often be restrictive in several applications, leading to unnatural interpretations of data features.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter estimation and random number generation for student Lévy processes 学生莱维过程的参数估计和随机数生成
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-08 DOI: 10.1016/j.csda.2024.107933
Shuaiyu Li , Yunpei Wu , Yuzhong Cheng

To address the challenges in estimating parameters of the widely applied Student-Lévy process, the study introduces two distinct methods: a likelihood-based approach and a data-driven approach. A two-step quasi-likelihood-based method is initially proposed, countering the non-closed nature of the Student-Lévy process's distribution function under convolution. This method utilizes the limiting properties observed in high-frequency data, offering estimations via a quasi-likelihood function characterized by asymptotic normality. Additionally, a novel neural-network-based parameter estimation technique is advanced, independent of high-frequency observation assumptions. Utilizing a CNN-LSTM framework, this method effectively processes sparse, local jump-related data, extracts deep features, and maps these to the parameter space using a fully connected neural network. This innovative approach ensures minimal assumption reliance, end-to-end processing, and high scalability, marking a significant advancement in parameter estimation techniques. The efficacy of both methods is substantiated through comprehensive numerical experiments, demonstrating their robust performance in diverse scenarios.

为解决广泛应用的学生-李维过程参数估计难题,本研究引入了两种不同的方法:基于似然法和数据驱动法。最初提出的是一种基于似然法的两步法,以应对卷积作用下学生-莱维过程分布函数的非封闭性。该方法利用在高频数据中观察到的极限特性,通过具有渐近正态性特征的准似然函数进行估计。此外,还提出了一种新颖的基于神经网络的参数估计技术,与高频观测假设无关。该方法利用 CNN-LSTM 框架,有效处理稀疏的局部跳跃相关数据,提取深度特征,并利用全连接神经网络将这些特征映射到参数空间。这种创新方法确保了最小的假设依赖、端到端处理和高可扩展性,标志着参数估计技术的重大进步。这两种方法的功效通过全面的数值实验得到了证实,证明了它们在不同场景下的强大性能。
{"title":"Parameter estimation and random number generation for student Lévy processes","authors":"Shuaiyu Li ,&nbsp;Yunpei Wu ,&nbsp;Yuzhong Cheng","doi":"10.1016/j.csda.2024.107933","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107933","url":null,"abstract":"<div><p>To address the challenges in estimating parameters of the widely applied Student-Lévy process, the study introduces two distinct methods: a likelihood-based approach and a data-driven approach. A two-step quasi-likelihood-based method is initially proposed, countering the non-closed nature of the Student-Lévy process's distribution function under convolution. This method utilizes the limiting properties observed in high-frequency data, offering estimations via a quasi-likelihood function characterized by asymptotic normality. Additionally, a novel neural-network-based parameter estimation technique is advanced, independent of high-frequency observation assumptions. Utilizing a CNN-LSTM framework, this method effectively processes sparse, local jump-related data, extracts deep features, and maps these to the parameter space using a fully connected neural network. This innovative approach ensures minimal assumption reliance, end-to-end processing, and high scalability, marking a significant advancement in parameter estimation techniques. The efficacy of both methods is substantiated through comprehensive numerical experiments, demonstrating their robust performance in diverse scenarios.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139726075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust heavy-tailed versions of generalized linear models with applications in actuarial science 广义线性模型的稳健重尾版本在精算学中的应用
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-02-01 DOI: 10.1016/j.csda.2024.107920
Philippe Gagnon, Yuxi Wang

Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. The objective is to propose an alternative approach which is modelling-based and thus fundamentally different. Such an approach allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The proposed approach possesses appealing theoretical and empirical properties.

广义线性模型(GLM)是统计学中最常用的一类模型。例如,在保险精算学中,伽玛变体被用于保险索赔金额的建模。伽马线性模型的一个缺陷是对异常值(即错误或极端数据点)没有鲁棒性。因此,大部分数据和异常值的趋势差异会导致推论和预测的偏差。为解决这一问题,人们引入了稳健方法。最常用的稳健方法是频数法,包括一个由对数似然导数修正得出的估计器。我们的目标是提出一种基于建模的替代方法,这种方法在本质上是不同的。它允许对建模进行理解和解释,并可应用于频繁主义和贝叶斯统计分析。这种方法具有吸引人的理论和经验特性。
{"title":"Robust heavy-tailed versions of generalized linear models with applications in actuarial science","authors":"Philippe Gagnon,&nbsp;Yuxi Wang","doi":"10.1016/j.csda.2024.107920","DOIUrl":"10.1016/j.csda.2024.107920","url":null,"abstract":"<div><p>Generalized linear models (GLMs) form one of the most popular classes of models in statistics. The gamma variant is used, for instance, in actuarial science for the modelling of claim amounts in insurance. A flaw of GLMs is that they are not robust against outliers (i.e., against erroneous or extreme data points). A difference in trends in the bulk of the data and the outliers thus yields skewed inference and predictions. To address this problem, robust methods have been introduced. The most commonly applied robust method is frequentist and consists in an estimator which is derived from a modification of the derivative of the log-likelihood. The objective is to propose an alternative approach which is modelling-based and thus fundamentally different. Such an approach allows for an understanding and interpretation of the modelling, and it can be applied for both frequentist and Bayesian statistical analyses. The proposed approach possesses appealing theoretical and empirical properties.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000045/pdfft?md5=d65027421e97468881072a344f6ff2f7&pid=1-s2.0-S0167947324000045-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139663190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1