首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Measure selection for functional linear model 函数线性模型的测度选择
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-08 DOI: 10.1016/j.csda.2025.108270
Su I Iao, Hans-Georg Müller
Advancements in modern science have led to an increased prevalence of functional data, which are usually viewed as elements of the space of square-integrable functions L2. Core methods in functional data analysis, such as functional principal component analysis, are typically grounded in the Hilbert structure of L2 and rely on inner products based on integrals with respect to the Lebesgue measure over a fixed domain. A more flexible framework is proposed, where the measure can be arbitrary, allowing natural extensions to unbounded domains and prompting the question of optimal measure choice. Specifically, a novel functional linear model is introduced that incorporates a data-adaptive choice of the measure that defines the space, alongside an enhanced function principal component analysis. Selecting a good measure can improve the model’s predictive performance, especially when the underlying processes are not well-represented when adopting the default Lebesgue measure. Simulations, as well as applications to COVID-19 data and the National Health and Nutrition Examination Survey data, show that the proposed approach consistently outperforms the conventional functional linear model.
现代科学的进步导致了函数数据的日益流行,它通常被视为平方可积函数L2空间的元素。功能数据分析的核心方法,如功能主成分分析,通常以L2的希尔伯特结构为基础,并依赖于基于固定域上关于勒贝格测度的积分的内积。提出了一个更灵活的框架,其中度量可以是任意的,允许自然扩展到无界域,并提出了最优度量选择的问题。具体来说,介绍了一种新的功能线性模型,该模型结合了定义空间的测量的数据自适应选择,以及增强的功能主成分分析。选择一个好的度量可以提高模型的预测性能,特别是当采用默认的Lebesgue度量时,底层过程没有得到很好的表示。模拟以及对COVID-19数据和国家健康与营养检查调查数据的应用表明,所提出的方法始终优于传统的功能线性模型。
{"title":"Measure selection for functional linear model","authors":"Su I Iao,&nbsp;Hans-Georg Müller","doi":"10.1016/j.csda.2025.108270","DOIUrl":"10.1016/j.csda.2025.108270","url":null,"abstract":"<div><div>Advancements in modern science have led to an increased prevalence of functional data, which are usually viewed as elements of the space of square-integrable functions <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>. Core methods in functional data analysis, such as functional principal component analysis, are typically grounded in the Hilbert structure of <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span> and rely on inner products based on integrals with respect to the Lebesgue measure over a fixed domain. A more flexible framework is proposed, where the measure can be arbitrary, allowing natural extensions to unbounded domains and prompting the question of optimal measure choice. Specifically, a novel functional linear model is introduced that incorporates a data-adaptive choice of the measure that defines the space, alongside an enhanced function principal component analysis. Selecting a good measure can improve the model’s predictive performance, especially when the underlying processes are not well-represented when adopting the default Lebesgue measure. Simulations, as well as applications to COVID-19 data and the National Health and Nutrition Examination Survey data, show that the proposed approach consistently outperforms the conventional functional linear model.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108270"},"PeriodicalIF":1.6,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145099731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kernel density estimation with a Markov chain Monte Carlo sample 核密度估计与马尔可夫链蒙特卡罗样本
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-04 DOI: 10.1016/j.csda.2025.108271
Hang J. Kim , Steven N. MacEachern , Young Min Kim , Yoonsuh Jung
Bayesian inference relies on the posterior distribution, which is often estimated with a Markov chain Monte Carlo sampler. The sampler produces a dependent stream of variates from the limiting distribution of the Markov chain, the posterior distribution. When one wishes to display the estimated posterior density, a natural choice is the histogram. However, abundant literature has shown that the kernel density estimator is more accurate than the histogram in terms of mean integrated squared error for an i.i.d. sample. With this as motivation, a kernel density estimation method is proposed that is appropriate for the dependence in the Markov chain Monte Carlo output. To account for the dependence, the cross-validation criterion is modified to select the bandwidth in standard kernel density estimation approaches. A data-driven adjustment to the biased cross-validation method is suggested with introducing the integrated autocorrelation time of the kernel. The convergence of the modified bandwidth to the optimal bandwidth is shown by adapting theorems from the time series literature. Simulation studies show that the proposed method finds the bandwidth close to the optimal value, while standard methods lead to smaller bandwidths under Markov chain samples and hence to undersmoothed density estimates. A study with real data shows that the proposed method has a considerably smaller integrated mean squared error than standard methods. The R package KDEmcmc to implement the suggested algorithm is available on the Comprehensive R Archive Network.
贝叶斯推理依赖于后验分布,而后验分布通常是用马尔可夫链蒙特卡罗采样器估计的。采样器从马尔可夫链的极限分布,即后验分布中产生一个相关的变量流。当人们希望显示估计的后验密度时,一个自然的选择是直方图。然而,大量的文献表明,核密度估计器比直方图更准确的平均积分平方误差对于i.i.d.样本。以此为动机,提出了一种适合于马尔可夫链蒙特卡罗输出依赖的核密度估计方法。为了考虑相关性,交叉验证准则被修改为在标准核密度估计方法中选择带宽。通过引入核的积分自相关时间,提出了一种数据驱动的对有偏差交叉验证方法的调整。修正后的带宽收敛于最优带宽是由时间序列文献中的定理来证明的。仿真研究表明,该方法能找到接近最优值的带宽,而标准方法在马尔可夫链样本下的带宽较小,导致密度估计不够平滑。实际数据的研究表明,该方法的积分均方误差比标准方法小得多。实现建议算法的R包KDEmcmc可以在综合R存档网络上获得。
{"title":"Kernel density estimation with a Markov chain Monte Carlo sample","authors":"Hang J. Kim ,&nbsp;Steven N. MacEachern ,&nbsp;Young Min Kim ,&nbsp;Yoonsuh Jung","doi":"10.1016/j.csda.2025.108271","DOIUrl":"10.1016/j.csda.2025.108271","url":null,"abstract":"<div><div>Bayesian inference relies on the posterior distribution, which is often estimated with a Markov chain Monte Carlo sampler. The sampler produces a dependent stream of variates from the limiting distribution of the Markov chain, the posterior distribution. When one wishes to display the estimated posterior density, a natural choice is the histogram. However, abundant literature has shown that the kernel density estimator is more accurate than the histogram in terms of mean integrated squared error for an i.i.d. sample. With this as motivation, a kernel density estimation method is proposed that is appropriate for the dependence in the Markov chain Monte Carlo output. To account for the dependence, the cross-validation criterion is modified to select the bandwidth in standard kernel density estimation approaches. A data-driven adjustment to the biased cross-validation method is suggested with introducing the integrated autocorrelation time of the kernel. The convergence of the modified bandwidth to the optimal bandwidth is shown by adapting theorems from the time series literature. Simulation studies show that the proposed method finds the bandwidth close to the optimal value, while standard methods lead to smaller bandwidths under Markov chain samples and hence to undersmoothed density estimates. A study with real data shows that the proposed method has a considerably smaller integrated mean squared error than standard methods. The R package <span>KDEmcmc</span> to implement the suggested algorithm is available on the Comprehensive R Archive Network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108271"},"PeriodicalIF":1.6,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145049123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Overview of normal-reference tests for high-dimensional means with implementation in the R package ‘HDNRA’ 在R包‘ HDNRA ’中实现的高维手段的正常参考测试概述
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-09-03 DOI: 10.1016/j.csda.2025.108269
Pengfei Wang , Tianming Zhu , Jin-Ting Zhang
The challenge of testing for equal mean vectors in high-dimensional data poses significant difficulties in statistical inference. Much of the existing literature introduces methods that often rely on stringent regularity conditions for the underlying covariance matrices, enabling asymptotic normality of test statistics. However, this can lead to complications in controlling test size. To address these issues, a new set of tests has emerged, leveraging the normal-reference approach to improve reliability. The latest normal-reference methods for testing equality of mean vectors in high-dimensional samples, potentially with differing covariance structures, are reviewed. The theoretical underpinnings of these tests are revisited, providing a new unified justification for the validity of centralized L2-norm-based normal-reference tests (NRTs) by deriving the convergence rate of the distance between the null distribution of the test statistic and its corresponding normal-reference distribution. To facilitate practical application, an R package, HDNRA, is introduced, implementing these NRTs and extending beyond the two-sample problem to accommodate general linear hypothesis testing (GLHT). The package, designed with user-friendliness in mind, achieves efficient computation through a core implemented in C++ using Rcpp, OpenMP, and RcppArmadillo. Examples with real datasets are included, showcasing the application of various tests and providing insights into their practical utility.
在高维数据中检验等均值向量的挑战给统计推断带来了很大的困难。许多现有文献介绍的方法往往依赖于底层协方差矩阵的严格正则性条件,从而实现检验统计量的渐近正态性。然而,这可能会导致控制测试规模的复杂性。为了解决这些问题,出现了一组新的测试,利用正常参考方法来提高可靠性。本文综述了最新的用于检验高维样本中可能具有不同协方差结构的平均向量相等性的标准参考方法。本文重新审视了这些检验的理论基础,通过推导检验统计量的零分布与其相应的正态参考分布之间的距离的收敛速率,为集中式基于l2规范的正态参考检验(nrt)的有效性提供了新的统一证明。为了便于实际应用,引入了R包HDNRA,实现了这些nrt,并将其扩展到双样本问题之外,以适应一般线性假设检验(GLHT)。该包在设计时考虑到用户友好性,通过使用Rcpp、OpenMP和RcppArmadillo在c++中实现的核心实现了高效的计算。包含真实数据集的示例,展示了各种测试的应用程序,并提供了对其实际效用的见解。
{"title":"Overview of normal-reference tests for high-dimensional means with implementation in the R package ‘HDNRA’","authors":"Pengfei Wang ,&nbsp;Tianming Zhu ,&nbsp;Jin-Ting Zhang","doi":"10.1016/j.csda.2025.108269","DOIUrl":"10.1016/j.csda.2025.108269","url":null,"abstract":"<div><div>The challenge of testing for equal mean vectors in high-dimensional data poses significant difficulties in statistical inference. Much of the existing literature introduces methods that often rely on stringent regularity conditions for the underlying covariance matrices, enabling asymptotic normality of test statistics. However, this can lead to complications in controlling test size. To address these issues, a new set of tests has emerged, leveraging the normal-reference approach to improve reliability. The latest normal-reference methods for testing equality of mean vectors in high-dimensional samples, potentially with differing covariance structures, are reviewed. The theoretical underpinnings of these tests are revisited, providing a new unified justification for the validity of centralized <span><math><msup><mi>L</mi><mn>2</mn></msup></math></span>-norm-based normal-reference tests (NRTs) by deriving the convergence rate of the distance between the null distribution of the test statistic and its corresponding normal-reference distribution. To facilitate practical application, an <span>R</span> package, <span>HDNRA</span>, is introduced, implementing these NRTs and extending beyond the two-sample problem to accommodate general linear hypothesis testing (GLHT). The package, designed with user-friendliness in mind, achieves efficient computation through a core implemented in <span>C++</span> using <span>Rcpp</span>, <span>OpenMP</span>, and <span>RcppArmadillo</span>. Examples with real datasets are included, showcasing the application of various tests and providing insights into their practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108269"},"PeriodicalIF":1.6,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145010731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-dimensional subgroup functional quantile regression with panel and dependent data 高维亚群功能分位数回归与面板和相关数据
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-27 DOI: 10.1016/j.csda.2025.108268
Xiao-Ge Yu, Han-Ying Liang
High-dimensional additive functional partial linear single-index quantile regression with high-dimensional parameters under subgroup panel data is investigated. Based on spline-based approach, we construct oracle estimators of the unknown parameter and functions, and discuss their consistency with rates and asymptotic normality under α-mixing assumptions. A penalized estimation method by using the SCAD technique is introduced to estimate the additive functions and parameter, enabling variable selection and automatic identification of the number of groups. Hypothesis testing for the parameter is also considered, and the asymptotic distributions of the restricted estimators and the test statistic are derived under both the null and local alternative hypotheses. Simulation studies and real data analysis are conducted to verify the validity of the proposed methods and applications.
研究了子群面板数据下具有高维参数的高维加性泛函偏线性单指标分位数回归。基于样条方法构造了未知参数和函数的oracle估计量,并讨论了它们在α-混合假设下与速率的一致性和渐近正态性。引入了一种基于SCAD技术的惩罚估计方法来估计加性函数和参数,实现了组数的变量选择和自动识别。考虑了参数的假设检验,导出了零假设和局部备用假设下的限制估计量和检验统计量的渐近分布。通过仿真研究和实际数据分析,验证了所提方法及其应用的有效性。
{"title":"High-dimensional subgroup functional quantile regression with panel and dependent data","authors":"Xiao-Ge Yu,&nbsp;Han-Ying Liang","doi":"10.1016/j.csda.2025.108268","DOIUrl":"10.1016/j.csda.2025.108268","url":null,"abstract":"<div><div>High-dimensional additive functional partial linear single-index quantile regression with high-dimensional parameters under subgroup panel data is investigated. Based on spline-based approach, we construct oracle estimators of the unknown parameter and functions, and discuss their consistency with rates and asymptotic normality under <span><math><mi>α</mi></math></span>-mixing assumptions. A penalized estimation method by using the SCAD technique is introduced to estimate the additive functions and parameter, enabling variable selection and automatic identification of the number of groups. Hypothesis testing for the parameter is also considered, and the asymptotic distributions of the restricted estimators and the test statistic are derived under both the null and local alternative hypotheses. Simulation studies and real data analysis are conducted to verify the validity of the proposed methods and applications.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"214 ","pages":"Article 108268"},"PeriodicalIF":1.6,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models 发现损坏数据中的因果结构:锚定高斯DAG模型中的节俭性
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-18 DOI: 10.1016/j.csda.2025.108267
Joonho Shin , Junhyoung Chung , Seyong Hwang , Gunwoong Park
This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.
本研究的重点是恢复锚定的高斯有向无环图形(DAG)模型,以解决发现数据集中变量之间的因果关系或有向关系的挑战,这些变量要么被故意掩盖,要么被测量误差污染。主要贡献是通过引入锚定节俭假设,放宽了锚定高斯DAG模型现有的限制性可识别条件。这个假设假定真图是那些满足潜在变量和观察变量可能分布的图中最节俭的,从而使真马尔可夫等价类(MEC)可识别。锚定节俭假设的有效性分别用图论和概率论来证明。另一个主要贡献是锚定sp和节俭pc算法的发展。其中,锚定- sp算法在满足马尔可夫条件的所有可能图中寻找最节俭的图,而节俭- pc算法在一些图中寻找最节俭的图。因此,节俭- pc算法在计算上更可行,但它需要一个额外的节俭-忠诚假设来保证可靠性。各种模拟支持本研究的理论发现,并证明了所提出的算法与最先进的算法(如ACI, PC和MMHC)相比的实际有效性。此外,该算法在蛋白质信号数据和乳腺癌数据中的应用表明,它在揭示蛋白质之间的关系和癌症相关细胞核特征方面是有效的。
{"title":"Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models","authors":"Joonho Shin ,&nbsp;Junhyoung Chung ,&nbsp;Seyong Hwang ,&nbsp;Gunwoong Park","doi":"10.1016/j.csda.2025.108267","DOIUrl":"10.1016/j.csda.2025.108267","url":null,"abstract":"<div><div>This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108267"},"PeriodicalIF":1.6,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144908977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empirical likelihood based Bayesian variable selection 基于经验似然的贝叶斯变量选择
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-13 DOI: 10.1016/j.csda.2025.108258
Yichen Cheng , Yichuan Zhao
Empirical likelihood is a popular nonparametric statistical tool that does not require any distributional assumptions. The possibility of conducting variable selection via Bayesian empirical likelihood is studied both theoretically and empirically. Theoretically, it is shown that when the prior distribution satisfies certain mild conditions, the corresponding Bayesian empirical likelihood estimators are posteriorly consistent and variable selection consistent. As special cases, the prior of Bayesian empirical likelihood LASSO and SCAD satisfy such conditions and thus can identify the non-zero elements of the parameters with probability approaching 1. In addition, it is easy to verify that those conditions are met for other widely used priors such as ridge, elastic net and adaptive LASSO. Empirical likelihood depends on a parameter that needs to be obtained by numerically solving a non-linear equation. Thus, there exists no conjugate prior for the posterior distribution, which causes the slow convergence of the MCMC sampling algorithm in some cases. To solve this problem, an approximation distribution is used as the proposal to enhance the acceptance rate and, therefore, facilitate faster computation. The computational results demonstrate quick convergence for the examples used in the paper. Both simulations and real data analyses are performed to illustrate the advantages of the proposed methods.
经验似然是一种流行的非参数统计工具,它不需要任何分布假设。从理论和实证两方面研究了贝叶斯经验似然法进行变量选择的可能性。从理论上表明,当先验分布满足一定温和条件时,相应的贝叶斯经验似然估计量后验一致,变量选择一致。作为特殊情况,贝叶斯经验似然LASSO和SCAD的先验满足这些条件,能够以接近1的概率识别出参数的非零元素。此外,对于山脊、弹性网和自适应LASSO等其他广泛使用的先验算法,也很容易验证这些条件是否满足。经验似然依赖于一个参数,该参数需要通过数值求解非线性方程来获得。因此,后验分布不存在共轭先验,导致MCMC采样算法在某些情况下收敛速度较慢。为了解决这一问题,建议采用近似分布来提高接受率,从而加快计算速度。计算结果表明,本文所用算例具有较快的收敛性。仿真和实际数据分析表明了所提方法的优越性。
{"title":"Empirical likelihood based Bayesian variable selection","authors":"Yichen Cheng ,&nbsp;Yichuan Zhao","doi":"10.1016/j.csda.2025.108258","DOIUrl":"10.1016/j.csda.2025.108258","url":null,"abstract":"<div><div>Empirical likelihood is a popular nonparametric statistical tool that does not require any distributional assumptions. The possibility of conducting variable selection via Bayesian empirical likelihood is studied both theoretically and empirically. Theoretically, it is shown that when the prior distribution satisfies certain mild conditions, the corresponding Bayesian empirical likelihood estimators are posteriorly consistent and variable selection consistent. As special cases, the prior of Bayesian empirical likelihood LASSO and SCAD satisfy such conditions and thus can identify the non-zero elements of the parameters with probability approaching 1. In addition, it is easy to verify that those conditions are met for other widely used priors such as ridge, elastic net and adaptive LASSO. Empirical likelihood depends on a parameter that needs to be obtained by numerically solving a non-linear equation. Thus, there exists no conjugate prior for the posterior distribution, which causes the slow convergence of the MCMC sampling algorithm in some cases. To solve this problem, an approximation distribution is used as the proposal to enhance the acceptance rate and, therefore, facilitate faster computation. The computational results demonstrate quick convergence for the examples used in the paper. Both simulations and real data analyses are performed to illustrate the advantages of the proposed methods.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108258"},"PeriodicalIF":1.6,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144893327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation of semiparametric probit model based on case-cohort interval-censored failure time data 基于病例队列间隔截尾失效时间数据的半参数概率模型估计
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-10 DOI: 10.1016/j.csda.2025.108266
Mingyue Du, Ricong Zeng
The estimation of semiparametric probit model is discussed for the situation where one observes interval-censored failure time data arising from case-cohort studies. The probit model has recently attracted some attention for regression analysis of failure time data partly due to the popularity of the normal distribution and its similarity to linear models. Although some methods have been developed in the literature for its estimation, it does not seem to exist an established approach for the situation of case-cohort interval-censored data. To address this, a pseudo-maximum likelihood method is proposed and furthermore, an EM algorithm is developed for its implementation. The resulting estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. To assess the empirical performance of the proposed method, a simulation study is conducted and indicates that it works well in practical situations. In addition, it is applied to a set of real data arising from an AIDS clinical trial that motivated this study.
针对病例队列研究中出现的间隔截尾失效时间数据,讨论了半参数概率模型的估计问题。probit模型近年来在故障时间数据的回归分析中引起了一些关注,部分原因是由于正态分布的普及及其与线性模型的相似性。虽然文献中已经开发了一些方法来估计它,但对于病例队列间隔审查数据的情况,似乎没有一种既定的方法。为了解决这个问题,提出了伪极大似然方法,并进一步开发了一种EM算法来实现它。结果表明,回归参数的估计量是一致的,并且渐近地服从正态分布。为了评估该方法的经验性能,进行了仿真研究,并表明该方法在实际情况下效果良好。此外,它被应用于一组来自艾滋病临床试验的真实数据,这些临床试验激发了本研究。
{"title":"Estimation of semiparametric probit model based on case-cohort interval-censored failure time data","authors":"Mingyue Du,&nbsp;Ricong Zeng","doi":"10.1016/j.csda.2025.108266","DOIUrl":"10.1016/j.csda.2025.108266","url":null,"abstract":"<div><div>The estimation of semiparametric probit model is discussed for the situation where one observes interval-censored failure time data arising from case-cohort studies. The probit model has recently attracted some attention for regression analysis of failure time data partly due to the popularity of the normal distribution and its similarity to linear models. Although some methods have been developed in the literature for its estimation, it does not seem to exist an established approach for the situation of case-cohort interval-censored data. To address this, a pseudo-maximum likelihood method is proposed and furthermore, an EM algorithm is developed for its implementation. The resulting estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. To assess the empirical performance of the proposed method, a simulation study is conducted and indicates that it works well in practical situations. In addition, it is applied to a set of real data arising from an AIDS clinical trial that motivated this study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108266"},"PeriodicalIF":1.6,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating a smooth covariance for functional data 估计函数数据的平滑协方差
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-06 DOI: 10.1016/j.csda.2025.108255
Uche Mbaka , James Owen Ramsay , Michelle Carey
Functional data analysis frequently involves estimating a smooth covariance function based on observed data. This estimation is essential for understanding interactions among functions and constitutes a fundamental aspect of numerous advanced methodologies, including functional principal component analysis. Two approaches for estimating smooth covariance functions in the presence of measurement errors are introduced. The first method employs a low-rank approximation of the covariance matrix, while the second ensures positive definiteness via a Cholesky decomposition. Both approaches employ the use of penalized regression to produce smooth covariance estimates and have been validated through comprehensive simulation studies. The practical application of these methods is demonstrated through the examination of average weekly milk yields in dairy cows as well as egg-laying patterns of Mediterranean fruit flies.
函数数据分析经常涉及基于观测数据估计平滑协方差函数。这种评估对于理解功能之间的相互作用是必不可少的,并且构成了许多高级方法的基本方面,包括功能主成分分析。介绍了在存在测量误差的情况下估计光滑协方差函数的两种方法。第一种方法采用协方差矩阵的低秩近似,而第二种方法通过Cholesky分解确保正确定性。这两种方法都使用惩罚回归来产生平滑的协方差估计,并通过全面的模拟研究进行了验证。这些方法的实际应用是通过检查奶牛的平均每周产奶量以及地中海果蝇的产卵模式来证明的。
{"title":"Estimating a smooth covariance for functional data","authors":"Uche Mbaka ,&nbsp;James Owen Ramsay ,&nbsp;Michelle Carey","doi":"10.1016/j.csda.2025.108255","DOIUrl":"10.1016/j.csda.2025.108255","url":null,"abstract":"<div><div>Functional data analysis frequently involves estimating a smooth covariance function based on observed data. This estimation is essential for understanding interactions among functions and constitutes a fundamental aspect of numerous advanced methodologies, including functional principal component analysis. Two approaches for estimating smooth covariance functions in the presence of measurement errors are introduced. The first method employs a low-rank approximation of the covariance matrix, while the second ensures positive definiteness via a Cholesky decomposition. Both approaches employ the use of penalized regression to produce smooth covariance estimates and have been validated through comprehensive simulation studies. The practical application of these methods is demonstrated through the examination of average weekly milk yields in dairy cows as well as egg-laying patterns of Mediterranean fruit flies.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108255"},"PeriodicalIF":1.6,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable selection in AUC-optimizing classification auc优化分类中的变量选择
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-05 DOI: 10.1016/j.csda.2025.108256
Hyungwoo Kim , Seung Jun Shin
Optimizing the receiver operating characteristic (ROC) curve is a popular way to evaluate a binary classifier under imbalanced scenarios frequently encountered in practice. A practical approach to constructing a linear binary classifier is presented by simultaneously optimizing the area under the ROC curve (AUC) and selecting informative variables in high dimensions. In particular, the smoothly clipped absolute deviation (SCAD) penalty is employed, and its oracle property is established, which enables the development of a consistent BIC-type information criterion that greatly facilitates the tuning procedure. Both simulated and real data analyses demonstrate the promising performance of the proposed method in terms of AUC optimization and variable selection.
优化接收者工作特征(ROC)曲线是在实践中经常遇到的不平衡场景下评估二值分类器的常用方法。提出了一种构建线性二元分类器的实用方法,即同时优化ROC曲线下面积(AUC)和选择高维信息变量。特别地,采用了平滑裁剪绝对偏差(SCAD)惩罚,并建立了其oracle属性,从而能够开发一致的bic类型信息标准,大大简化了调优过程。仿真和实际数据分析均证明了该方法在AUC优化和变量选择方面具有良好的性能。
{"title":"Variable selection in AUC-optimizing classification","authors":"Hyungwoo Kim ,&nbsp;Seung Jun Shin","doi":"10.1016/j.csda.2025.108256","DOIUrl":"10.1016/j.csda.2025.108256","url":null,"abstract":"<div><div>Optimizing the receiver operating characteristic (ROC) curve is a popular way to evaluate a binary classifier under imbalanced scenarios frequently encountered in practice. A practical approach to constructing a linear binary classifier is presented by simultaneously optimizing the area under the ROC curve (AUC) and selecting informative variables in high dimensions. In particular, the smoothly clipped absolute deviation (SCAD) penalty is employed, and its oracle property is established, which enables the development of a consistent BIC-type information criterion that greatly facilitates the tuning procedure. Both simulated and real data analyses demonstrate the promising performance of the proposed method in terms of AUC optimization and variable selection.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108256"},"PeriodicalIF":1.6,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Random effects misspecification and its consequences for prediction in generalized linear mixed models 广义线性混合模型中的随机效应、错配及其预测后果
IF 1.6 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-29 DOI: 10.1016/j.csda.2025.108254
Quan Vu , Francis K.C. Hui , Samuel Muller , A.H. Welsh
When fitting generalized linear mixed models, choosing the random effects distribution is an important decision. As random effects are unobserved, misspecification of their distribution is a real possibility. Thus, the consequences of random effects misspecification for point prediction and prediction inference of random effects in generalized linear mixed models need to be investigated. A combination of theory, simulation, and a real application is used to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors, and prediction intervals. Results show that the level of shrinkage for the predicted random effects can differ greatly under the two random effect distributions, and so is susceptible to misspecification. Also, the unconditional mean squared prediction errors for the random effects are almost always larger under the misspecified normal random effects distribution, while results for the mean squared prediction errors conditional on the random effects are more complicated but remain generally larger under the misspecified distribution (especially when the true random effect is close to the mean of one of the component distributions in the true mixture distribution). Results for prediction intervals indicate that the overall coverage probability is, in contrast, not greatly impacted by misspecification. It is concluded that misspecifying the random effects distribution can affect prediction of random effects, and greater caution is recommended when adopting the normality assumption in generalized linear mixed models.
在拟合广义线性混合模型时,选择随机效应分布是一个重要决策。由于随机效应是无法观察到的,因此对其分布的错误描述是很有可能的。因此,需要研究广义线性混合模型中随机效应错配对点预测和随机效应预测推理的影响。本文采用理论、模拟和实际应用相结合的方法,探讨了当正确的规范是正态分布的混合时,对随机效应分布使用普通正态假设的效果,重点关注对点预测、均方预测误差和预测区间的影响。结果表明,在两种随机效应分布下,预测的随机效应收缩水平会有很大差异,因此容易出现误规范。此外,在错误指定的正态随机效应分布下,随机效应的无条件均方预测误差几乎总是较大,而在错误指定的分布下,随机效应条件下的均方预测误差结果更复杂,但通常仍然较大(特别是当真实随机效应接近真实混合分布中某个分量分布的平均值时)。相反,预测区间的结果表明,总体覆盖概率不受规格错误的影响。结果表明,随机效应分布的指定不当会影响随机效应的预测,建议在广义线性混合模型中采用正态性假设时要更加谨慎。
{"title":"Random effects misspecification and its consequences for prediction in generalized linear mixed models","authors":"Quan Vu ,&nbsp;Francis K.C. Hui ,&nbsp;Samuel Muller ,&nbsp;A.H. Welsh","doi":"10.1016/j.csda.2025.108254","DOIUrl":"10.1016/j.csda.2025.108254","url":null,"abstract":"<div><div>When fitting generalized linear mixed models, choosing the random effects distribution is an important decision. As random effects are unobserved, misspecification of their distribution is a real possibility. Thus, the consequences of random effects misspecification for point prediction and prediction inference of random effects in generalized linear mixed models need to be investigated. A combination of theory, simulation, and a real application is used to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors, and prediction intervals. Results show that the level of shrinkage for the predicted random effects can differ greatly under the two random effect distributions, and so is susceptible to misspecification. Also, the unconditional mean squared prediction errors for the random effects are almost always larger under the misspecified normal random effects distribution, while results for the mean squared prediction errors conditional on the random effects are more complicated but remain generally larger under the misspecified distribution (especially when the true random effect is close to the mean of one of the component distributions in the true mixture distribution). Results for prediction intervals indicate that the overall coverage probability is, in contrast, not greatly impacted by misspecification. It is concluded that misspecifying the random effects distribution can affect prediction of random effects, and greater caution is recommended when adopting the normality assumption in generalized linear mixed models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108254"},"PeriodicalIF":1.6,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144738685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1