首页 > 最新文献

Journal of Machine Learning Research最新文献

英文 中文
Dynamic Bayesian Learning for Spatiotemporal Mechanistic Models. 时空机制模型的动态贝叶斯学习。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Sudipto Banerjee, Xiang Chen, Ian Frankenburg, Daniel Zhou

We develop an approach for Bayesian learning of spatiotemporal dynamical mechanistic models. Such learning consists of statistical emulation of the mechanistic system that can efficiently interpolate the output of the system from arbitrary inputs. The emulated learner can then be used to train the system from noisy data achieved by melding information from observed data with the emulated mechanistic system. This joint melding of mechanistic systems employ hierarchical state-space models with Gaussian process regression. Assuming the dynamical system is controlled by a finite collection of inputs, Gaussian process regression learns the effect of these parameters through a number of training runs, driving the stochastic innovations of the spatiotemporal state-space component. This enables efficient modeling of the dynamics over space and time. This article details exact inference with analytically accessible posterior distributions in hierarchical matrix-variate Normal and Wishart models in designing the emulator. This step obviates expensive iterative algorithms such as Markov chain Monte Carlo or variational approximations. We also show how emulation is applicable to large-scale emulation by designing a dynamic Bayesian transfer learning framework. Inference on η proceeds using Markov chain Monte Carlo as a post-emulation step using the emulator as a regression component. We demonstrate this framework through solving inverse problems arising in the analysis of ordinary and partial nonlinear differential equations and, in addition, to a black-box computer model generating spatiotemporal dynamics across a graphical model.

我们开发了一种时空动态机制模型的贝叶斯学习方法。这种学习包括对机械系统的统计仿真,该系统可以有效地从任意输入插入系统的输出。然后,模拟的学习器可以用于从噪声数据中训练系统,这些噪声数据是通过将观测数据中的信息与模拟的机械系统融合而得到的。这种机械系统的联合融合采用高斯过程回归的分层状态空间模型。假设动力系统由有限的输入集合控制,高斯过程回归通过一系列的训练运行来学习这些参数的影响,从而驱动时空状态-空间分量的随机创新。这使得可以有效地对空间和时间上的动态进行建模。本文在仿真器的设计中详细介绍了层次矩阵变量正态和Wishart模型中可解析后验分布的精确推理。这一步避免了昂贵的迭代算法,如马尔可夫链蒙特卡罗或变分近似。我们还通过设计一个动态贝叶斯迁移学习框架来展示仿真如何适用于大规模仿真。对η的推断使用马尔可夫链蒙特卡罗作为后仿真步骤,使用仿真器作为回归组件。我们通过解决在分析常非线性和偏非线性微分方程中出现的反问题,以及通过图形模型生成时空动态的黑箱计算机模型来演示该框架。
{"title":"Dynamic Bayesian Learning for Spatiotemporal Mechanistic Models.","authors":"Sudipto Banerjee, Xiang Chen, Ian Frankenburg, Daniel Zhou","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We develop an approach for Bayesian learning of spatiotemporal dynamical mechanistic models. Such learning consists of statistical emulation of the mechanistic system that can efficiently interpolate the output of the system from arbitrary inputs. The emulated learner can then be used to train the system from noisy data achieved by melding information from observed data with the emulated mechanistic system. This joint melding of mechanistic systems employ hierarchical state-space models with Gaussian process regression. Assuming the dynamical system is controlled by a finite collection of inputs, Gaussian process regression learns the effect of these parameters through a number of training runs, driving the stochastic innovations of the spatiotemporal state-space component. This enables efficient modeling of the dynamics over space and time. This article details exact inference with analytically accessible posterior distributions in hierarchical matrix-variate Normal and Wishart models in designing the emulator. This step obviates expensive iterative algorithms such as Markov chain Monte Carlo or variational approximations. We also show how emulation is applicable to large-scale emulation by designing a dynamic Bayesian transfer learning framework. Inference on <math><mi>η</mi></math> proceeds using Markov chain Monte Carlo as a post-emulation step using the emulator as a regression component. We demonstrate this framework through solving inverse problems arising in the analysis of ordinary and partial nonlinear differential equations and, in addition, to a black-box computer model generating spatiotemporal dynamics across a graphical model.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12676262/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data. 异构组结构数据的贝叶斯多组高斯过程模型。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Didong Li, Andrew Jones, Sudipto Banerjee, Barbara Engelhardt

Gaussian processes are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Scientific data are often heterogeneous in their inputs and contain multiple known discrete groups of samples; thus, it is desirable to leverage the similarity among groups while accounting for heterogeneity across groups. We propose multi-group Gaussian processes (MGGPs) defined over R p × 𝒞 , where 𝒞 is a finite set representing the group label, by developing general classes of valid (positive definite) covariance functions on such domains. MGGPs are able to accurately recover relationships between the groups and efficiently share strength across samples from all groups during inference, while capturing distinct group-specific behaviors in the conditional posterior distributions. We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.

高斯过程在功能数据分析、机器学习和复杂依赖关系建模的空间统计中无处不在。科学数据的输入通常是异构的,并且包含多个已知的离散样本组;因此,在考虑组间异质性的同时,利用组间的相似性是可取的。我们通过在这些域上建立有效(正定)协方差函数的一般类,提出了定义在R p x上的多群高斯过程(MGGPs),其中的是表示群标记的有限集合。mggp能够准确地恢复组之间的关系,并在推理过程中有效地在所有组的样本之间共享强度,同时在条件后验分布中捕获不同的组特定行为。我们通过模拟实验证明了MGGP中的推理,并将我们提出的MGGP回归框架应用于基因表达数据,通过联合建模连续变量和分类变量来说明多组高斯过程的行为和增强的推理能力。
{"title":"Bayesian Multi-Group Gaussian Process Models for Heterogeneous Group-Structured Data.","authors":"Didong Li, Andrew Jones, Sudipto Banerjee, Barbara Engelhardt","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Gaussian processes are pervasive in functional data analysis, machine learning, and spatial statistics for modeling complex dependencies. Scientific data are often heterogeneous in their inputs and contain multiple known discrete groups of samples; thus, it is desirable to leverage the similarity among groups while accounting for heterogeneity across groups. We propose multi-group Gaussian processes (MGGPs) defined over <math> <msup><mrow><mi>R</mi></mrow> <mrow><mi>p</mi></mrow> </msup> <mo>×</mo> <mi>𝒞</mi></math> , where <math><mi>𝒞</mi></math> is a finite set representing the group label, by developing general classes of valid (positive definite) covariance functions on such domains. MGGPs are able to accurately recover relationships between the groups and efficiently share strength across samples from all groups during inference, while capturing distinct group-specific behaviors in the conditional posterior distributions. We demonstrate inference in MGGPs through simulation experiments, and we apply our proposed MGGP regression framework to gene expression data to illustrate the behavior and enhanced inferential capabilities of multi-group Gaussian processes by jointly modeling continuous and categorical variables.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12463451/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145187362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Asymptotic Inference for Multi-Stage Stationary Treatment Policy with Variable Selection. 具有变量选择的多阶段平稳处理策略的渐近推理。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Daiqi Gao, Yufeng Liu, Donglin Zeng

Dynamic treatment regimes or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies in practice, namely multi-stage stationary treatment policies, prescribes treatment assignment probabilities using the same decision function across stages, where the decision is based on the same set of features consisting of time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature on constructing valid inference for the value function associated with dynamic treatment policies, little work has focused on the policies themselves, especially in the presence of high-dimensional features. We aim to fill the gap in this work. Specifically, we first obtain the multi-stage stationary treatment policy by minimizing the negative augmented inverse probability weighted estimator of the value function to increase asymptotic efficiency. An L 1 penalty is applied on the policy parameters to select important features. We then construct one-step improvements of the policy parameter estimators for valid inference. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the features increases with the sample size. Our numerical studies demonstrate that the proposed method estimates a sparse policy with a near-optimal value function and conducts valid inference for the policy parameters.

动态治疗制度或政策是针对个体特征量身定制的多个阶段的一系列决策功能。在实践中,一类重要的治疗策略,即多阶段平稳治疗策略,使用相同的决策函数规定治疗分配概率,其中决策基于由时间演变变量组成的相同特征集(例如,常规收集的疾病生物标志物)。尽管已经有大量关于构建与动态处理策略相关的价值函数的有效推断的文献,但很少有工作关注策略本身,特别是在存在高维特征的情况下。我们的目标是填补这项工作的空白。具体而言,我们首先通过最小化值函数的负增广逆概率加权估计来获得多阶段平稳处理策略,以提高渐近效率。对策略参数施加l1惩罚,以选择重要的特征。然后,我们构造了有效推理的策略参数估计器的一步改进。从理论上讲,我们证明了改进的估计量是渐近正态的,即使以较慢的收敛速度估计干扰参数,并且特征的维数随着样本量的增加而增加。数值研究表明,该方法估计了一个具有近最优值函数的稀疏策略,并对策略参数进行了有效的推断。
{"title":"Asymptotic Inference for Multi-Stage Stationary Treatment Policy with Variable Selection.","authors":"Daiqi Gao, Yufeng Liu, Donglin Zeng","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Dynamic treatment regimes or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies in practice, namely multi-stage stationary treatment policies, prescribes treatment assignment probabilities using the same decision function across stages, where the decision is based on the same set of features consisting of time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature on constructing valid inference for the value function associated with dynamic treatment policies, little work has focused on the policies themselves, especially in the presence of high-dimensional features. We aim to fill the gap in this work. Specifically, we first obtain the multi-stage stationary treatment policy by minimizing the negative augmented inverse probability weighted estimator of the value function to increase asymptotic efficiency. An <i>L</i> <sub>1</sub> penalty is applied on the policy parameters to select important features. We then construct one-step improvements of the policy parameter estimators for valid inference. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the features increases with the sample size. Our numerical studies demonstrate that the proposed method estimates a sparse policy with a near-optimal value function and conducts valid inference for the policy parameters.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987690/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147464271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Data Sketching for Varying Coefficient Regression Models. 变系数回归模型的贝叶斯数据草图。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Rajarshi Guhaniyogi, Laura Baracaldo, Sudipto Banerjee

Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We introduce Bayesian data sketching for varying coefficient models to obviate computational challenges presented by large sample sizes. To address the challenges of analyzing large data, we compress the functional response vector and predictor matrix by a random linear transformation to achieve dimension reduction and conduct inference on the compressed data. Our approach distinguishes itself from several existing methods for analyzing large functional data in that it requires neither the development of new models or algorithms, nor any specialized computational hardware while delivering fully model-based Bayesian inference. Well-established methods and algorithms for varying coefficient regression models can be applied to the compressed data. We establish posterior contraction rates for estimating the varying coefficients and predicting the outcome at new locations with the randomly compressed data model. We use simulation experiments and analyze remote sensed vegetation data to empirically illustrate the inferential and computational efficiency of our approach.

变系数模型是函数数据模型中估计非线性回归函数的常用方法。它们的贝叶斯变体在大数据应用中受到了有限的关注,主要是由于使用马尔可夫链蒙特卡罗(MCMC)算法的后验计算过于缓慢。我们为变系数模型引入贝叶斯数据草图,以避免大样本量带来的计算挑战。为了解决大数据分析的难题,我们通过随机线性变换对功能响应向量和预测矩阵进行压缩,实现降维,并对压缩后的数据进行推理。我们的方法与现有的几种分析大型功能数据的方法不同,因为它既不需要开发新的模型或算法,也不需要任何专门的计算硬件,同时提供完全基于模型的贝叶斯推理。成熟的变系数回归模型方法和算法可以应用于压缩数据。我们用随机压缩的数据模型建立了后验收缩率,用于估计变化系数和预测新位置的结果。通过模拟实验和遥感植被数据分析,实证证明了该方法的推理效率和计算效率。
{"title":"Bayesian Data Sketching for Varying Coefficient Regression Models.","authors":"Rajarshi Guhaniyogi, Laura Baracaldo, Sudipto Banerjee","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We introduce Bayesian data sketching for varying coefficient models to obviate computational challenges presented by large sample sizes. To address the challenges of analyzing large data, we compress the functional response vector and predictor matrix by a random linear transformation to achieve dimension reduction and conduct inference on the compressed data. Our approach distinguishes itself from several existing methods for analyzing large functional data in that it requires neither the development of new models or algorithms, nor any specialized computational hardware while delivering fully model-based Bayesian inference. Well-established methods and algorithms for varying coefficient regression models can be applied to the compressed data. We establish posterior contraction rates for estimating the varying coefficients and predicting the outcome at new locations with the randomly compressed data model. We use simulation experiments and analyze remote sensed vegetation data to empirically illustrate the inferential and computational efficiency of our approach.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12666391/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145662038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DisC2o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data. 用于分析现实世界高维数据的协变量移位的分布式因果推理。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Jiayi Tong, Jie Hu, George Hripcsak, Yang Ning, Yong Chen

High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC2o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we apply the algorithm to a real-world data set to present the readiness of implementation and validity.

高维医疗保健数据,如电子健康记录(EHR)数据和索赔数据,由于存在大量变量和需要整合来自多个临床站点的数据,带来了两个主要挑战。第三个关键挑战是协变量移位方面异质性的潜在存在。在本文中,我们提出了一个考虑协变量移位的分布式学习算法来估计高维数据的平均处理效果(ATE),命名为disc20 - hd。利用替代似然法,我们的方法校准了倾向评分和结果模型的估计,以近似地达到期望的协变量平衡特性,同时考虑了多个临床地点的协变量转移。我们证明了我们的分布协变量平衡倾向得分估计量可以近似于由多个站点的数据池化而得到的池化估计量。如果倾向得分模型或结果回归模型被正确指定,所提出的估计量保持一致。当倾向得分和结果模型都正确指定时,可以实现半参数效率界。我们进行了仿真研究,以证明所提出算法的性能;此外,我们将算法应用于现实世界的数据集,以展示实现的准备和有效性。
{"title":"DisC<sup>2</sup>o-HD: Distributed causal inference with covariates shift for analyzing real-world high-dimensional data.","authors":"Jiayi Tong, Jie Hu, George Hripcsak, Yang Ning, Yong Chen","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>High-dimensional healthcare data, such as electronic health records (EHR) data and claims data, present two primary challenges due to the large number of variables and the need to consolidate data from multiple clinical sites. The third key challenge is the potential existence of heterogeneity in terms of covariate shift. In this paper, we propose a distributed learning algorithm accounting for covariate shift to estimate the average treatment effect (ATE) for high-dimensional data, named DisC<sup>2</sup>o-HD. Leveraging the surrogate likelihood method, our method calibrates the estimates of the propensity score and outcome models to approximately attain the desired covariate balancing property, while accounting for the covariate shift across multiple clinical sites. We show that our distributed covariate balancing propensity score estimator can approximate the pooled estimator, which is obtained by pooling the data from multiple sites together. The proposed estimator remains consistent if either the propensity score model or the outcome regression model is correctly specified. The semiparametric efficiency bound is achieved when both the propensity score and the outcome models are correctly specified. We conduct simulation studies to demonstrate the performance of the proposed algorithm; additionally, we apply the algorithm to a real-world data set to present the readiness of implementation and validity.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12269483/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144660933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and Response. 具有部分注释处理和响应的平均处理效果的有效和鲁棒半监督估计。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Jue Hou, Rajarshi Mukherjee, Tianxi Cai

A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response cannot be accurately captured by readily available EHR features in many studies and require labor-intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We considered average treatment effect (ATE) estimation when 1) exact treatment and outcome variables are only observed together in a small labeled subset and 2) noisy surrogates of treatment and outcome, such as relevant prescription and diagnosis codes, along with potential confounders are observed for all subjects. We derived the efficient influence function for ATE and used it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We justified that our SMMAL ATE estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We developed the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies demonstrated the validity of our SMMAL method and its superiority over supervised and unsupervised benchmarks. We applied SMMAL to the assessment of targeted therapies for metastatic colorectal cancer in comparison to chemotherapy.

利用电子健康记录(EHR)进行治疗效果评估的一个显著挑战是缺乏关于重要临床变量的精确信息,包括所接受的治疗和反应。在许多研究中,治疗信息和反应都不能通过现成的EHR特征准确捕获,并且需要劳动密集型的手动图表审查来精确注释,这限制了这些关键变量上可用金标准标签的数量。在以下情况下,我们考虑平均治疗效果(ATE)估计:1)仅在一个小的标记子集中同时观察到确切的治疗和结果变量;2)观察到所有受试者的治疗和结果的嘈杂替代变量,如相关处方和诊断代码,以及潜在的混杂因素。我们推导了ATE的有效影响函数,并用它构造了一个半监督多机器学习(SMMAL)估计量。我们证明了我们的SMMAL ATE估计器在低维光滑模型下是半参数有效的b样条回归。我们在高维逻辑倾向评分和结果回归模型下开发了自适应稀疏度/模型双稳健估计。仿真研究的结果证明了我们的SMMAL方法的有效性及其优于有监督和无监督基准的优越性。我们将SMMAL应用于评估转移性结直肠癌的靶向治疗与化疗的比较。
{"title":"Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and Response.","authors":"Jue Hou, Rajarshi Mukherjee, Tianxi Cai","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response cannot be accurately captured by readily available EHR features in many studies and require labor-intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We considered average treatment effect (ATE) estimation when 1) exact treatment and outcome variables are only observed together in a small labeled subset and 2) noisy surrogates of treatment and outcome, such as relevant prescription and diagnosis codes, along with potential confounders are observed for all subjects. We derived the efficient influence function for ATE and used it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We justified that our SMMAL ATE estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We developed the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies demonstrated the validity of our SMMAL method and its superiority over supervised and unsupervised benchmarks. We applied SMMAL to the assessment of targeted therapies for metastatic colorectal cancer in comparison to chemotherapy.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12671556/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145670781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Sparse Gaussian Mixture Model for Clustering in High Dimensions. 高维聚类的贝叶斯稀疏高斯混合模型。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Dapeng Yao, Fangzheng Xie, Yanxun Xu

We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a computationally tractable Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis-clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing data set.

研究了允许簇数随样本量增加而增加的稀疏高维高斯混合模型。建立了参数估计的极大极小下界,并证明了约束极大似然估计达到极大极小下界。然而,由于目标函数高度非凸且可行集涉及离散结构,这种基于优化的估计方法在计算上难以处理。为了解决计算上的挑战,我们提出了一种计算上易于处理的贝叶斯方法来估计高维高斯混合物,其聚类中心使用连续的spike- slab先验来表现稀疏性。进一步证明了贝叶斯方法的后验收缩率是极小极大最优的。利用矩阵摄动理论的工具得到了错误聚类率作为副产物。提出的贝叶斯稀疏高斯混合模型不需要预先指定簇的数量,可以自适应估计簇的数量。通过模拟研究和对真实世界单细胞RNA测序数据集的分析,证明了所提出方法的有效性和实用性。
{"title":"Bayesian Sparse Gaussian Mixture Model for Clustering in High Dimensions.","authors":"Dapeng Yao, Fangzheng Xie, Yanxun Xu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We study the sparse high-dimensional Gaussian mixture model when the number of clusters is allowed to grow with the sample size. A minimax lower bound for parameter estimation is established, and we show that a constrained maximum likelihood estimator achieves the minimax lower bound. However, this optimization-based estimator is computationally intractable because the objective function is highly nonconvex and the feasible set involves discrete structures. To address the computational challenge, we propose a computationally tractable Bayesian approach to estimate high-dimensional Gaussian mixtures whose cluster centers exhibit sparsity using a continuous spike-and-slab prior. We further prove that the posterior contraction rate of the proposed Bayesian method is minimax optimal. The mis-clustering rate is obtained as a by-product using tools from matrix perturbation theory. The proposed Bayesian sparse Gaussian mixture model does not require pre-specifying the number of clusters, which can be adaptively estimated. The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing data set.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12965251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147379220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient and Robust Transfer Learning of Optimal Individualized Treatment Regimes with Right-Censored Survival Data. 基于right - censorship生存数据的最优个体化治疗方案的高效鲁棒迁移学习。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Pan Zhao, Julie Josse, Shu Yang

An individualized treatment regime (ITR) is a decision rule that assigns treatments based on patients' characteristics. The value function of an ITR is the expected outcome in a counterfactual world had this ITR been implemented. Recently, there has been increasing interest in combining heterogeneous data sources, such as leveraging the complementary features of randomized controlled trial (RCT) data and a large observational study (OS). Usually, a covariate shift exists between the source and target population, rendering the source-optimal ITR not optimal for the target population. We present an efficient and robust transfer learning framework for estimating the optimal ITR with right-censored survival data that generalizes well to the target population. The value function accommodates a broad class of functionals of survival distributions, including survival probabilities and restrictive mean survival times (RMSTs). We propose a doubly robust estimator of the value function, and the optimal ITR is learned by maximizing the value function within a pre-specified class of ITRs. We establish the cubic rate of convergence for the estimated parameter indexing the optimal ITR, and show that the proposed optimal value estimator is consistent and asymptotically normal even with flexible machine learning methods for nuisance parameter estimation. We evaluate the empirical performance of the proposed method by simulation studies and a real data application of sodium bicarbonate therapy for patients with severe metabolic acidaemia in the intensive care unit (ICU), combining a RCT and an observational study with heterogeneity.

个体化治疗方案(ITR)是一种根据患者特征分配治疗方案的决策规则。ITR的价值函数是在一个反事实的世界中,如果这个ITR被实施,预期的结果。最近,人们对结合异构数据源越来越感兴趣,例如利用随机对照试验(RCT)数据和大型观察性研究(OS)的互补特征。通常,源和目标群体之间存在协变量移位,使得源最优的ITR对目标群体来说不是最优的。我们提出了一个有效且稳健的迁移学习框架,用于估计具有右截除存活数据的最优ITR,该数据可以很好地推广到目标人群。价值函数容纳了生存分布的广泛函数,包括生存概率和限制性平均生存时间(rmst)。我们提出了价值函数的双鲁棒估计,并通过在预先指定的ITR类别中最大化价值函数来学习最优ITR。我们建立了最优ITR估计参数的三次收敛率,并证明了所提出的最优值估计量是一致的和渐近正态的,即使使用灵活的机器学习方法进行干扰参数估计。我们通过模拟研究和碳酸氢钠治疗重症监护病房(ICU)重症代谢性酸血症患者的实际数据应用,结合随机对照试验和异质性观察研究,评估了所提出方法的经验性能。
{"title":"Efficient and Robust Transfer Learning of Optimal Individualized Treatment Regimes with Right-Censored Survival Data.","authors":"Pan Zhao, Julie Josse, Shu Yang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>An individualized treatment regime (ITR) is a decision rule that assigns treatments based on patients' characteristics. The value function of an ITR is the expected outcome in a counterfactual world had this ITR been implemented. Recently, there has been increasing interest in combining heterogeneous data sources, such as leveraging the complementary features of randomized controlled trial (RCT) data and a large observational study (OS). Usually, a covariate shift exists between the source and target population, rendering the source-optimal ITR not optimal for the target population. We present an efficient and robust transfer learning framework for estimating the optimal ITR with right-censored survival data that generalizes well to the target population. The value function accommodates a broad class of functionals of survival distributions, including survival probabilities and restrictive mean survival times (RMSTs). We propose a doubly robust estimator of the value function, and the optimal ITR is learned by maximizing the value function within a pre-specified class of ITRs. We establish the cubic rate of convergence for the estimated parameter indexing the optimal ITR, and show that the proposed optimal value estimator is consistent and asymptotically normal even with flexible machine learning methods for nuisance parameter estimation. We evaluate the empirical performance of the proposed method by simulation studies and a real data application of sodium bicarbonate therapy for patients with severe metabolic acidaemia in the intensive care unit (ICU), combining a RCT and an observational study with heterogeneity.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12974684/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147437109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Directed Cyclic Graphs for Simultaneous Discovery of Time-Lagged and Instantaneous Causality from Longitudinal Data Using Instrumental Variables. 利用工具变量从纵向数据中同时发现时间滞后和瞬时因果关系的有向循环图。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2025-01-01
Wei Jin, Yang Ni, Amanda B Spence, Leah H Rubin, Yanxun Xu

We consider the problem of causal discovery from longitudinal observational data. We develop a novel framework that simultaneously discovers the time-lagged causality and the possibly cyclic instantaneous causality. Under common causal discovery assumptions, combined with additional instrumental information typically available in longitudinal data, we prove the proposed model is generally identifiable. To the best of our knowledge, this is the first causal identification theory for directed graphs with general cyclic patterns that achieves unique causal identifiability. Structural learning is carried out in a fully Bayesian fashion. Through extensive simulations and an application to the Women's Interagency HIV Study, we demonstrate the identifiability, utility, and superiority of the proposed model against state-of-the-art alternative methods.

我们考虑从纵向观测数据中发现因果关系的问题。我们开发了一个新的框架,可以同时发现时间滞后的因果关系和可能循环的瞬时因果关系。在常见的因果发现假设下,结合通常在纵向数据中可用的额外工具信息,我们证明了所提出的模型通常是可识别的。据我们所知,这是第一个对具有一般循环模式的有向图实现唯一因果可识别性的因果识别理论。结构学习以完全贝叶斯的方式进行。通过广泛的模拟和对妇女跨机构艾滋病毒研究的应用,我们证明了与最先进的替代方法相比,所提出的模型的可识别性、实用性和优越性。
{"title":"Directed Cyclic Graphs for Simultaneous Discovery of Time-Lagged and Instantaneous Causality from Longitudinal Data Using Instrumental Variables.","authors":"Wei Jin, Yang Ni, Amanda B Spence, Leah H Rubin, Yanxun Xu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We consider the problem of causal discovery from longitudinal observational data. We develop a novel framework that simultaneously discovers the time-lagged causality and the possibly cyclic instantaneous causality. Under common causal discovery assumptions, combined with additional instrumental information typically available in longitudinal data, we prove the proposed model is generally identifiable. To the best of our knowledge, this is the first causal identification theory for directed graphs with general cyclic patterns that achieves unique causal identifiability. Structural learning is carried out in a fully Bayesian fashion. Through extensive simulations and an application to the Women's Interagency HIV Study, we demonstrate the identifiability, utility, and superiority of the proposed model against state-of-the-art alternative methods.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"26 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12700356/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible Bayesian Product Mixture Models for Vector Autoregressions. 灵活的贝叶斯向量自回归产品混合物模型
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2024-04-01
Suprateek Kundu, Joshua Lukemire

Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, such methods can face hurdles in heterogeneous settings where objects are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enables independent clustering at multiple scales, which results in varying levels of information sharing across samples. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals biologically interpretable connectivity differences between distinct intelligence groups, while another air pollution application illustrates the superior forecasting accuracy compared to alternate methods.

基于Dirichlet过程混合的贝叶斯非参数方法在各个领域都取得了巨大的成功,并且能够通过聚类共享相同参数的样本来获取信息。然而,这种方法在异质环境中可能面临障碍,在异质环境中,期望对象仅沿着轴的子集聚集,或者样本集群仅共享相同参数的子集。我们通过开发一种新的狄利克雷过程位置尺度混合物的产品来克服这些限制,该产品能够在多个尺度上独立聚类,从而导致不同水平的样本信息共享。首先,我们开发了独立多元数据的方法。随后,我们将其推广到多主体向量自回归(VAR)模型框架下的多变量时间序列数据,这是我们的重点,它超越了参数化的单主体VAR模型。我们建立了后验一致性,并开发了有效的后验计算实现。大量涉及VAR模型的数值研究表明,在估计、聚类和特征选择准确性方面,VAR模型比其他竞争方法有明显的优势。我们对人类连接组项目的静息状态fMRI分析揭示了不同智力群体之间生物学上可解释的连接差异,而另一个空气污染应用表明,与其他方法相比,预测准确性更高。
{"title":"Flexible Bayesian Product Mixture Models for Vector Autoregressions.","authors":"Suprateek Kundu, Joshua Lukemire","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, such methods can face hurdles in heterogeneous settings where objects are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enables independent clustering at multiple scales, which results in varying levels of information sharing across samples. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals biologically interpretable connectivity differences between distinct intelligence groups, while another air pollution application illustrates the superior forecasting accuracy compared to alternate methods.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"25 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11646655/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142830693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Machine Learning Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1