首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Empirical Bayes Poisson matrix completion 经验贝叶斯泊松矩阵补全
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-06 DOI: 10.1016/j.csda.2024.107976
Xiao Li , Takeru Matsuda , Fumiyasu Komaki

An empirical Bayes method for the Poisson matrix denoising and completion problems is proposed, and a corresponding algorithm called EBPM (Empirical Bayes Poisson Matrix) is developed. This approach is motivated by the non-central singular value shrinkage prior, which was used for the estimation of the mean matrix parameter of a matrix-variate normal distribution. Numerical experiments show that the EBPM algorithm outperforms the common nuclear norm penalized method in both matrix denoising and completion. The EBPM algorithm is highly efficient and does not require heuristic parameter tuning, as opposed to the nuclear norm penalized method, in which the regularization parameter should be selected. The EBPM algorithm also performs better than others in real-data applications.

针对泊松矩阵去噪和补全问题提出了一种经验贝叶斯方法,并开发了一种名为 EBPM(经验贝叶斯泊松矩阵)的相应算法。这种方法的灵感来自非中心奇异值收缩先验,该先验用于估计矩阵变量正态分布的平均矩阵参数。数值实验表明,EBPM 算法在矩阵去噪和补全方面都优于普通核规范惩罚法。与需要选择正则化参数的核规范惩罚法相比,EBPM 算法效率高,不需要启发式参数调整。EBPM 算法在实际数据应用中的表现也优于其他算法。
{"title":"Empirical Bayes Poisson matrix completion","authors":"Xiao Li ,&nbsp;Takeru Matsuda ,&nbsp;Fumiyasu Komaki","doi":"10.1016/j.csda.2024.107976","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107976","url":null,"abstract":"<div><p>An empirical Bayes method for the Poisson matrix denoising and completion problems is proposed, and a corresponding algorithm called EBPM (Empirical Bayes Poisson Matrix) is developed. This approach is motivated by the non-central singular value shrinkage prior, which was used for the estimation of the mean matrix parameter of a matrix-variate normal distribution. Numerical experiments show that the EBPM algorithm outperforms the common nuclear norm penalized method in both matrix denoising and completion. The EBPM algorithm is highly efficient and does not require heuristic parameter tuning, as opposed to the nuclear norm penalized method, in which the regularization parameter should be selected. The EBPM algorithm also performs better than others in real-data applications.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107976"},"PeriodicalIF":1.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000604/pdfft?md5=1823ebfe249fd22a2c430281b6468d2f&pid=1-s2.0-S0167947324000604-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140880266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transfer learning via random forests: A one-shot federated approach 通过随机森林进行迁移学习:单次联合方法
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-06 DOI: 10.1016/j.csda.2024.107975
Pengcheng Xiang , Ling Zhou , Lu Tang

A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.

我们开发了一种使用随机森林(FTRF)的单次联合迁移学习方法,通过利用来自辅助站点的信息来提高目标数据站点的预测准确性。理论和数值结果表明,无论可能存在的数据异质性(包括各站点数据分布不平衡和非 IID 数据分布以及模型规范错误)如何,所提出的联合迁移学习方法的准确性至少与单独在目标数据上训练的模型相当。FTRF 能够评估目标站点和辅助站点之间的相似性,使目标站点能够自主选择更多相似站点信息,从而提高预测性能。为确保通信效率,FTRF 采用了模型平均化思想,目标站点和辅助站点之间只需进行一轮通信。只有来自辅助站点的拟合模型才会被发送到目标站点。与传统的模型平均不同,FTRF 在估算模型平均权重时,将其他站点的预测结果和原始变量纳入其中,从而形成了一种取决于变量的权重,以更好地利用辅助站点的模型来改进预测。五个实际数据实例表明,与不利用辅助信息的方法相比,FTRF 可将预测误差减少 2-40%。
{"title":"Transfer learning via random forests: A one-shot federated approach","authors":"Pengcheng Xiang ,&nbsp;Ling Zhou ,&nbsp;Lu Tang","doi":"10.1016/j.csda.2024.107975","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107975","url":null,"abstract":"<div><p>A one-shot <u>f</u>ederated <u>t</u>ransfer learning method using <u>r</u>andom <u>f</u>orests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107975"},"PeriodicalIF":1.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FDR control for linear log-contrast models with high-dimensional compositional covariates 具有高维组成协变量的线性对数对比模型的 FDR 控制
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-03 DOI: 10.1016/j.csda.2024.107973
Panxu Yuan, Changhan Jin, Gaorong Li

Linear log-contrast models have been widely used to describe the relationship between the response of interest and the compositional covariates, in which one central task is to identify the significant compositional covariates while controlling the false discovery rate (FDR) at a nominal level. To achieve this goal, a new FDR control method is proposed for linear log-contrast models with high-dimensional compositional covariates. An appealing feature of the proposed method is that it completely bypasses the traditional p-values and utilizes only the symmetry property of the test statistic for the unimportant compositional covariates to give an upper bound of the FDR. Under some regularity conditions, the FDR can be asymptotically controlled at the nominal level for the proposed method in theory, and the theoretical power is also proven to approach one as the sample size tends to infinity. The finite-sample performance of the proposed method is evaluated through extensive simulation studies, and applications to microbiome compositional datasets are also provided.

线性对数对比模型已被广泛用于描述相关响应与组成协变量之间的关系,其中的一个核心任务是识别重要的组成协变量,同时将误诊率(FDR)控制在名义水平。为了实现这一目标,我们针对具有高维组成协变量的线性对数对比模型提出了一种新的 FDR 控制方法。所提方法的一个吸引人的特点是,它完全绕过了传统的 p 值,只利用不重要的组成协变量的检验统计量的对称性来给出 FDR 的上界。在某些规则性条件下,所提方法的 FDR 可以在理论上渐进地控制在标称水平,而且当样本量趋于无穷大时,理论功率也被证明接近于 1。通过大量的模拟研究评估了所提方法的有限样本性能,并将其应用于微生物组成分数据集。
{"title":"FDR control for linear log-contrast models with high-dimensional compositional covariates","authors":"Panxu Yuan,&nbsp;Changhan Jin,&nbsp;Gaorong Li","doi":"10.1016/j.csda.2024.107973","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107973","url":null,"abstract":"<div><p>Linear log-contrast models have been widely used to describe the relationship between the response of interest and the compositional covariates, in which one central task is to identify the significant compositional covariates while controlling the false discovery rate (FDR) at a nominal level. To achieve this goal, a new FDR control method is proposed for linear log-contrast models with high-dimensional compositional covariates. An appealing feature of the proposed method is that it completely bypasses the traditional p-values and utilizes only the symmetry property of the test statistic for the unimportant compositional covariates to give an upper bound of the FDR. Under some regularity conditions, the FDR can be asymptotically controlled at the nominal level for the proposed method in theory, and the theoretical power is also proven to approach one as the sample size tends to infinity. The finite-sample performance of the proposed method is evaluated through extensive simulation studies, and applications to microbiome compositional datasets are also provided.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107973"},"PeriodicalIF":1.8,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140878933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian simultaneous factorization and prediction using multi-omic data 利用多组数据进行贝叶斯同步因式分解和预测
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-30 DOI: 10.1016/j.csda.2024.107974
Sarah Samorodnitsky , Chris H. Wendt , Eric F. Lock

Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including “blockwise” missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.

多组学数据的综合因子化方法可估算出解释生物变异的因子。因子可被视为预测结果的协变量,因子化可用于缺失值的补偿。然而,目前还没有任何方法能为这些任务提供统计推断和不确定性量化的综合框架。我们提出了一个新颖的框架--贝叶斯同时因式分解(BSF),在概率框架内将多组学变异同时分解为联合结构和个体结构。BSF 使用共轭正态前验,通过求解结构化核规范惩罚目标可以估计出该模型的后验模式,该目标还能实现秩选择并激励超参数的选择。然后,BSF 被扩展为在估计潜在因子的同时预测连续或二元表型,称为贝叶斯同步因式分解和预测(BSFP)。BSF 和 BSFP 可同时进行估算,即在模型拟合过程中进行估算,并对缺失数据(包括 "顺时针 "缺失)进行完全后验推断。模拟结果表明,BSFP 在恢复潜在变异结构方面具有竞争力,并证明了在预测模型中考虑估计因式分解不确定性的重要性。在随机缺失和非随机缺失假设下,通过模拟检验了 BSF 的归因性能。最后,BSFP 被用于根据一项艾滋病相关阻塞性肺病研究中支气管肺泡灌洗液代谢组和蛋白质组预测肺功能,揭示了与肺功能下降相关的多组学模式,以及由共同代谢组和蛋白质组丰度模式驱动的阻塞性肺病患者群。
{"title":"Bayesian simultaneous factorization and prediction using multi-omic data","authors":"Sarah Samorodnitsky ,&nbsp;Chris H. Wendt ,&nbsp;Eric F. Lock","doi":"10.1016/j.csda.2024.107974","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107974","url":null,"abstract":"<div><p>Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including “blockwise” missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107974"},"PeriodicalIF":1.8,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140905435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CR-Lasso: Robust cellwise regularized sparse regression CR-Lasso:稳健的单元正则化稀疏回归
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-30 DOI: 10.1016/j.csda.2024.107971
Peng Su , Garth Tarr , Samuel Muller , Suojin Wang

Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. A robust Lasso-type cellwise regularization procedure is proposed which is coined CR-Lasso, that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. The evaluation of this approach involves simulation studies that compare its selection and prediction performance with several sparse regression methods. The results demonstrate that CR-Lasso is competitive within the considered settings. The effectiveness of the proposed method is further illustrated through an analysis of a bone mineral density dataset.

对于数据科学家来说,细胞污染仍然是一个具有挑战性的问题,尤其是在需要选择稀疏特征的研究领域。传统的稳健方法在处理此类污染数据集时可能既不可行也不高效。本文提出了一种稳健的 Lasso 型单元正则化程序,被称为 CR-Lasso,通过同时最小化回归损失和单元偏差度量,在存在单元离群值的情况下进行特征选择。对这种方法的评估包括模拟研究,将其选择和预测性能与几种稀疏回归方法进行比较。结果表明,在所考虑的设置中,CR-Lasso 是有竞争力的。通过对骨矿物质密度数据集的分析,进一步说明了所提方法的有效性。
{"title":"CR-Lasso: Robust cellwise regularized sparse regression","authors":"Peng Su ,&nbsp;Garth Tarr ,&nbsp;Samuel Muller ,&nbsp;Suojin Wang","doi":"10.1016/j.csda.2024.107971","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107971","url":null,"abstract":"<div><p>Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. A robust Lasso-type cellwise regularization procedure is proposed which is coined CR-Lasso, that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. The evaluation of this approach involves simulation studies that compare its selection and prediction performance with several sparse regression methods. The results demonstrate that CR-Lasso is competitive within the considered settings. The effectiveness of the proposed method is further illustrated through an analysis of a bone mineral density dataset.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107971"},"PeriodicalIF":1.8,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000550/pdfft?md5=7f097cb47b472d8dfd0dc105cc9fcafa&pid=1-s2.0-S0167947324000550-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian estimation of large-scale simulation models with Gaussian process regression surrogates 利用高斯过程回归代理对大规模仿真模型进行贝叶斯估计
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-23 DOI: 10.1016/j.csda.2024.107972
Sylvain Barde

Large scale, computationally expensive simulation models pose a particular challenge when it comes to estimating their parameters from empirical data. Most simulation models do not possess closed-form expressions for their likelihood function, requiring the use of simulation-based inference, such as simulated method of moments, indirect inference, likelihood-free inference or approximate Bayesian computation. However, given the high computational requirements of large-scale models, it is often difficult to run these estimation methods, as they require more simulated runs that can feasibly be carried out. The aim is to address the problem by providing a full Bayesian estimation framework where the true but intractable likelihood function of the simulation model is replaced by one generated by a surrogate model trained on the limited simulated data. This is provided by a Linear Model of Coregionalization, where each latent variable is a sparse variational Gaussian process, chosen for its desirable convergence and consistency properties. The effectiveness of the approach is tested using both a simulated Bayesian computing analysis on a known data generating process, and an empirical application in which the free parameters of a computationally demanding agent-based model are estimated on US macroeconomic data.

大规模、计算成本高昂的仿真模型在从经验数据中估计其参数时提出了特别的挑战。大多数仿真模型的似然函数不具备闭式表达式,这就需要使用基于仿真的推断方法,如模拟矩法、间接推断、无似然推断或近似贝叶斯计算。然而,由于大规模模型的计算要求很高,这些估计方法往往难以运行,因为它们需要更多的模拟运行,而这是不可能实现的。我们的目标是通过提供一个完整的贝叶斯估计框架来解决这个问题,在这个框架中,模拟模型的真实但难以处理的似然函数被一个在有限的模拟数据上训练过的代理模型所生成的函数所取代。该模型由核心区域化线性模型提供,其中每个潜变量都是一个稀疏的变分高斯过程,该过程具有理想的收敛性和一致性。通过对已知数据生成过程的模拟贝叶斯计算分析,以及在美国宏观经济数据上估算计算要求较高的代理模型自由参数的经验应用,对该方法的有效性进行了测试。
{"title":"Bayesian estimation of large-scale simulation models with Gaussian process regression surrogates","authors":"Sylvain Barde","doi":"10.1016/j.csda.2024.107972","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107972","url":null,"abstract":"<div><p>Large scale, computationally expensive simulation models pose a particular challenge when it comes to estimating their parameters from empirical data. Most simulation models do not possess closed-form expressions for their likelihood function, requiring the use of simulation-based inference, such as simulated method of moments, indirect inference, likelihood-free inference or approximate Bayesian computation. However, given the high computational requirements of large-scale models, it is often difficult to run these estimation methods, as they require more simulated runs that can feasibly be carried out. The aim is to address the problem by providing a full Bayesian estimation framework where the true but intractable likelihood function of the simulation model is replaced by one generated by a surrogate model trained on the limited simulated data. This is provided by a Linear Model of Coregionalization, where each latent variable is a sparse variational Gaussian process, chosen for its desirable convergence and consistency properties. The effectiveness of the approach is tested using both a simulated Bayesian computing analysis on a known data generating process, and an empirical application in which the free parameters of a computationally demanding agent-based model are estimated on US macroeconomic data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107972"},"PeriodicalIF":1.8,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000562/pdfft?md5=b53b8e5e84e9796eca1b2069b126ea59&pid=1-s2.0-S0167947324000562-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140644251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous Treatment Effect-based Random Forest: HTERF 基于异质性治疗效果的随机森林:HTERF
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-16 DOI: 10.1016/j.csda.2024.107970
Bérénice-Alexia Jocteur , Véronique Maume-Deschamps , Pierre Ribereau

Estimates of causal effects are needed to answer what-if questions about shifts in policy, such as new treatments in pharmacology or new pricing strategies for business owners. A new non-parametric approach is proposed to estimate the heterogeneous treatment effect based on random forests (HTERF). The potential outcome framework with unconfoundedness shows that the HTERF is pointwise almost surely consistent with the true treatment effect. Interpretability results are also presented.

需要对因果效应进行估计,以回答有关政策转变的假设问题,如药理学的新疗法或企业主的新定价策略。本文提出了一种新的非参数方法来估计基于随机森林(HTERF)的异质性治疗效果。无边界的潜在结果框架表明,HTERF 在点上几乎肯定与真实治疗效果一致。同时还给出了可解释性结果。
{"title":"Heterogeneous Treatment Effect-based Random Forest: HTERF","authors":"Bérénice-Alexia Jocteur ,&nbsp;Véronique Maume-Deschamps ,&nbsp;Pierre Ribereau","doi":"10.1016/j.csda.2024.107970","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107970","url":null,"abstract":"<div><p>Estimates of causal effects are needed to answer what-if questions about shifts in policy, such as new treatments in pharmacology or new pricing strategies for business owners. A new non-parametric approach is proposed to estimate the heterogeneous treatment effect based on random forests (HTERF). The potential outcome framework with unconfoundedness shows that the HTERF is pointwise almost surely consistent with the true treatment effect. Interpretability results are also presented.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107970"},"PeriodicalIF":1.8,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable selection using data splitting and projection for principal fitted component models in high dimension 利用数据分割和投影为高维度主拟合分量模型选择变量
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-15 DOI: 10.1016/j.csda.2024.107960
Seungchul Baek , Hoyoung Park , Junyong Park

Sufficient dimension reduction (SDR) is such an effective way to detect nonlinear relationship between response variable and covariates by reducing the dimensionality of covariates without information loss. The principal fitted component (PFC) model is a way to implement SDR using some class of basis functions, however the PFC model is not efficient when there are many irrelevant or noisy covariates. There have been a few studies on the selection of variables in the PFC model via penalized regression or sequential likelihood ratio test. A novel variable selection technique in the PFC model has been proposed by incorporating a recent development in multiple testing such as mirror statistics and random data splitting. It is highlighted how we construct a mirror statistic in the PFC model using the idea of projection of coefficients to the other space generated from data splitting. The proposed method is superior to some existing methods in terms of false discovery rate (FDR) control and applicability to high-dimensional cases. In particular, the proposed method outperforms other methods as the number of covariates tends to be getting larger, which would be appealing in high dimensional data analysis. Simulation studies and analyses of real data sets have been conducted to show the finite sample performance and the gain that it yields compared to existing methods.

充分降维(SDR)是一种在不损失信息的情况下降低协变量维度,从而检测响应变量与协变量之间非线性关系的有效方法。主拟合分量(PFC)模型是使用某类基函数实现 SDR 的一种方法,但当存在许多无关或噪声协变量时,主拟合分量模型并不有效。有一些研究通过惩罚回归或序列似然比检验来选择 PFC 模型中的变量。我们结合镜像统计和随机数据分割等多重检验的最新发展,提出了一种新的 PFC 模型变量选择技术。重点介绍了我们如何在 PFC 模型中利用系数投影到数据分割产生的其他空间的思想来构建镜像统计量。所提出的方法在错误发现率(FDR)控制和适用于高维情况方面优于现有的一些方法。特别是,随着协变量的数量越来越多,所提出的方法优于其他方法,这在高维数据分析中很有吸引力。我们对真实数据集进行了仿真研究和分析,以显示有限样本的性能以及与现有方法相比所产生的收益。
{"title":"Variable selection using data splitting and projection for principal fitted component models in high dimension","authors":"Seungchul Baek ,&nbsp;Hoyoung Park ,&nbsp;Junyong Park","doi":"10.1016/j.csda.2024.107960","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107960","url":null,"abstract":"<div><p>Sufficient dimension reduction (SDR) is such an effective way to detect nonlinear relationship between response variable and covariates by reducing the dimensionality of covariates without information loss. The principal fitted component (PFC) model is a way to implement SDR using some class of basis functions, however the PFC model is not efficient when there are many irrelevant or noisy covariates. There have been a few studies on the selection of variables in the PFC model via penalized regression or sequential likelihood ratio test. A novel variable selection technique in the PFC model has been proposed by incorporating a recent development in multiple testing such as mirror statistics and random data splitting. It is highlighted how we construct a mirror statistic in the PFC model using the idea of projection of coefficients to the other space generated from data splitting. The proposed method is superior to some existing methods in terms of false discovery rate (FDR) control and applicability to high-dimensional cases. In particular, the proposed method outperforms other methods as the number of covariates tends to be getting larger, which would be appealing in high dimensional data analysis. Simulation studies and analyses of real data sets have been conducted to show the finite sample performance and the gain that it yields compared to existing methods.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107960"},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian taut splines for estimating the number of modes 用于估算模式数的贝叶斯紧绷样条曲线
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-15 DOI: 10.1016/j.csda.2024.107961
José E. Chacón , Javier Fernández Serrano

The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.

概率密度函数中的模态数代表了模型的复杂程度,也可以看作是子群的数量。尽管具有相关性,但这一领域的研究还很有限。本文提出了一种在单变量设置中估算模式数的新方法,该方法侧重于预测准确性,其灵感来源于该问题的一些被忽视的方面:对解中结构的需求、模式的主观性和不确定性,以及融合局部和全局密度特性的整体观的便利性。该技术在贝叶斯推理范式中结合了灵活的核估计器和简约的组合样条,提供了软解决方案并结合了专家判断。该程序包括特征探索、模型选择和模式测试,在体育分析案例研究中展示了多个配套的可视化工具。一项全面的模拟研究还表明,传统的模式驱动方法很难提供准确的结果。在这种情况下,新方法成为一种顶级替代方法,为分析人员提供了创新的解决方案。
{"title":"Bayesian taut splines for estimating the number of modes","authors":"José E. Chacón ,&nbsp;Javier Fernández Serrano","doi":"10.1016/j.csda.2024.107961","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107961","url":null,"abstract":"<div><p>The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107961"},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000458/pdfft?md5=9c9dde675ebe359be2107f0ce88120f0&pid=1-s2.0-S0167947324000458-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler 通过 HMC-pCN 采样器解决具有 SA-Roundtrip 先验的贝叶斯成像反演问题
IF 1.8 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-10 DOI: 10.1016/j.csda.2024.107930
Jiayu Qian , Yuanyuan Liu , Jingya Yang , Qingping Zhou

Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.

在解决许多科学和工程领域的成像反演问题时,使用深度生成先验的贝叶斯推理受到了广泛关注。先验分布的选择是从可用的先验测量中学习的,因此也是先验测量的重要表征学习。SA-Roundtrip 是一种新颖的深度生成先验,用于控制采样生成和识别数据的内在维度。该先验在双向生成对抗网络中加入了自注意结构。随后,使用汉密尔顿蒙特卡洛预处理 Crank-Nicolson 算法(HMC-pCN)对低维潜空间中的后验分布进行贝叶斯推理。利用 MNIST 和 TomoPhantom 数据集对计算机断层扫描(CT)重建进行的实验表明,所提出的方法优于最先进的比较方法,能持续产生稳健、卓越的点估算器以及精确的不确定性量化。
{"title":"Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler","authors":"Jiayu Qian ,&nbsp;Yuanyuan Liu ,&nbsp;Jingya Yang ,&nbsp;Qingping Zhou","doi":"10.1016/j.csda.2024.107930","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107930","url":null,"abstract":"<div><p>Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107930"},"PeriodicalIF":1.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1