首页 > 最新文献

Journal of Machine Learning Research最新文献

英文 中文
Inference for Gaussian Processes with Matérn Covariogram on Compact Riemannian Manifolds. 紧凑黎曼曼形上具有马特恩协方差的高斯过程推理
IF 6 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-03-01
Didong Li, Wenpin Tang, Sudipto Banerjee

Gaussian processes are widely employed as versatile modelling and predictive tools in spatial statistics, functional data analysis, computer modelling and diverse applications of machine learning. They have been widely studied over Euclidean spaces, where they are specified using covariance functions or covariograms for modelling complex dependencies. There is a growing literature on Gaussian processes over Riemannian manifolds in order to develop richer and more flexible inferential frameworks for non-Euclidean data. While numerical approximations through graph representations have been well studied for the Matérn covariogram and heat kernel, the behaviour of asymptotic inference on the parameters of the covariogram has received relatively scant attention. We focus on asymptotic behaviour for Gaussian processes constructed over compact Riemannian manifolds. Building upon a recently introduced Matérn covariogram on a compact Riemannian manifold, we employ formal notions and conditions for the equivalence of two Matérn Gaussian random measures on compact manifolds to derive the parameter that is identifiable, also known as the microergodic parameter, and formally establish the consistency of the maximum likelihood estimate and the asymptotic optimality of the best linear unbiased predictor. The circle is studied as a specific example of compact Riemannian manifolds with numerical experiments to illustrate and corroborate the theory.

高斯过程是空间统计学、函数数据分析、计算机建模和机器学习各种应用中广泛使用的通用建模和预测工具。人们对欧几里得空间上的高斯过程进行了广泛的研究,利用协方差函数或协方差图对复杂的依赖关系进行建模。关于黎曼流形上的高斯过程的文献越来越多,以便为非欧几里得数据开发更丰富、更灵活的推理框架。虽然通过图形表示对马特恩协方差和热核的数值近似进行了深入研究,但对协方差参数的渐近推断行为的关注却相对较少。我们重点研究在紧凑黎曼流形上构建的高斯过程的渐近行为。以最近引入的紧凑黎曼流形上的马特恩协变图为基础,我们采用紧凑流形上两个马特恩高斯随机度量等价的形式化概念和条件,推导出可识别的参数(也称为微角参数),并正式建立最大似然估计的一致性和最佳线性无偏预测器的渐近最优性。我们将圆作为紧凑黎曼流形的一个具体实例进行研究,并通过数值实验来说明和证实这一理论。
{"title":"Inference for Gaussian Processes with Matérn Covariogram on Compact Riemannian Manifolds.","authors":"Didong Li, Wenpin Tang, Sudipto Banerjee","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Gaussian processes are widely employed as versatile modelling and predictive tools in spatial statistics, functional data analysis, computer modelling and diverse applications of machine learning. They have been widely studied over Euclidean spaces, where they are specified using covariance functions or covariograms for modelling complex dependencies. There is a growing literature on Gaussian processes over Riemannian manifolds in order to develop richer and more flexible inferential frameworks for non-Euclidean data. While numerical approximations through graph representations have been well studied for the Matérn covariogram and heat kernel, the behaviour of asymptotic inference on the parameters of the covariogram has received relatively scant attention. We focus on asymptotic behaviour for Gaussian processes constructed over compact Riemannian manifolds. Building upon a recently introduced Matérn covariogram on a compact Riemannian manifold, we employ formal notions and conditions for the equivalence of two Matérn Gaussian random measures on compact manifolds to derive the parameter that is identifiable, also known as the microergodic parameter, and formally establish the consistency of the maximum likelihood estimate and the asymptotic optimality of the best linear unbiased predictor. The circle is studied as a specific example of compact Riemannian manifolds with numerical experiments to illustrate and corroborate the theory.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10361735/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9876354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data. 多元分类数据的维度分组混合隶属度模型。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-02-01
Yuqi Gu, Elena A Erosheva, Gongjun Xu, David B Dunson

Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of Dimension-Grouped MMMs ( Gro- M 3 s ) for multivariate categorical data, which improve parsimony and interpretability. In Gro- M 3 s , observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grouping structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet Gro- M 3 s to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.

混合隶属度模型(MMMs)是一种流行的复杂多元数据潜在结构模型。mm没有强迫每个主题属于单个集群,而是结合了一个特定主题的权重向量,该权重表示跨集群的部分隶属关系。有了这种灵活性,在唯一地识别、估计和解释参数方面就出现了挑战。在本文中,我们提出了一种新的多维分类数据的维数分组hmm (Gro- m3),它提高了数据的简洁性和可解释性。在Gro- m3中,观察到的变量被划分成组,使得组内变量的潜在隶属度是恒定的,但组间可能不同。传统的潜在类模型是在所有变量都在一组时得到的,而传统的hmm是在每个变量都在自己的组时得到的。新模型对应于一种新的概率张量分解。理论上,我们导出了在一般情况下未知分组结构和模型参数的透明可辨识性条件。在方法上,我们提出了Dirichlet Gro- m3s的贝叶斯方法来推断变量分组结构和估计模型参数。仿真结果显示了良好的计算性能,并从经验上验证了可辨识性结果。我们通过对功能性残疾调查数据集和个性测试数据集的应用来说明新方法。
{"title":"Dimension-Grouped Mixed Membership Models for Multivariate Categorical Data.","authors":"Yuqi Gu, Elena A Erosheva, Gongjun Xu, David B Dunson","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Mixed Membership Models (MMMs) are a popular family of latent structure models for complex multivariate data. Instead of forcing each subject to belong to a single cluster, MMMs incorporate a vector of subject-specific weights characterizing partial membership across clusters. With this flexibility come challenges in uniquely identifying, estimating, and interpreting the parameters. In this article, we propose a new class of <i>Dimension-Grouped</i> MMMs ( <math><mrow><mtext>Gro-</mtext> <msup><mtext>M</mtext> <mn>3</mn></msup> <mtext>s</mtext></mrow> </math> ) for multivariate categorical data, which improve parsimony and interpretability. In <math><mrow><mtext>Gro-</mtext> <msup><mtext>M</mtext> <mn>3</mn></msup> <mtext>s</mtext></mrow> </math> , observed variables are partitioned into groups such that the latent membership is constant for variables within a group but can differ across groups. Traditional latent class models are obtained when all variables are in one group, while traditional MMMs are obtained when each variable is in its own group. The new model corresponds to a novel decomposition of probability tensors. Theoretically, we derive transparent identifiability conditions for both the unknown grouping structure and model parameters in general settings. Methodologically, we propose a Bayesian approach for Dirichlet <math><mrow><mtext>Gro-</mtext> <msup><mtext>M</mtext> <mn>3</mn></msup> <mtext>s</mtext></mrow> </math> to inferring the variable grouping structure and estimating model parameters. Simulation results demonstrate good computational performance and empirically confirm the identifiability results. We illustrate the new methodology through applications to a functional disability survey dataset and a personality test dataset.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12000818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143992849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Data Selection. 贝叶斯数据选择。
IF 6 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Eli N Weinstein, Jeffrey W Miller

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the "Stein volume criterion (SVC)", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.

通过发现与感兴趣的模型匹配或不匹配的数据特征,可以获得对复杂高维数据的洞察。为了形式化这个任务,我们引入了“数据选择”问题:找到一个较低维的统计量——比如变量的子集——它与给定的参数模型很好地拟合。数据选择的完全贝叶斯方法是对统计值进行参数化建模,对数据的剩余“背景”成分进行非参数化建模,并对统计值的选择执行标准贝叶斯模型选择。然而,拟合一个非参数模型到高维数据往往是非常低效的,统计和计算。我们提出了一种用于执行数据选择的新评分,即“Stein体积准则(SVC)”,它不需要拟合非参数模型。SVC采用广义边际似然的形式,用核化的Stein差异代替Kullback-Leibler散度。证明了SVC在数据选择上是一致的,并建立了相应的广义后验在参数上的一致性和渐近正态性。我们使用概率主成分分析和基因调控的自旋玻璃模型将SVC应用于单细胞RNA测序数据集的分析。
{"title":"Bayesian Data Selection.","authors":"Eli N Weinstein,&nbsp;Jeffrey W Miller","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the \"data selection\" problem: finding a lower-dimensional statistic-such as a subset of variables-that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining \"background\" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing data selection, the \"Stein volume criterion (SVC)\", that does not require fitting a nonparametric model. The SVC takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection, and establish consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 23","pages":""},"PeriodicalIF":6.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10194814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9574086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-Supervised Off-Policy Reinforcement Learning and Value Estimation for Dynamic Treatment Regimes. 动态治疗机制的半监督非策略强化学习与值估计。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Aaron Sonabend-W, Nilanjana Laha, Ashwin N Ananthakrishnan, Tianxi Cai, Rajarshi Mukherjee

Reinforcement learning (RL) has shown great promise in estimating dynamic treatment regimes which take into account patient heterogeneity. However, health-outcome information, used as the reward for RL methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource-intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small-sized labeled data set with actual outcomes observed and a large unlabeled data set with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off-policy value estimation. Generalizing SSL to dynamic treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative of the optimal policy or value function. We provide theoretical results for our Q function and value function estimators to understand the degree of efficiency gained from SSL. Our method is at least as efficient as the supervised approach, and robust to bias from mis-specification of the imputation models.

强化学习(RL)在评估考虑患者异质性的动态治疗方案方面显示出很大的希望。然而,作为RL方法奖励的健康结果信息通常没有很好地编码,而是嵌入在临床记录中。提取精确的结果信息是一项资源密集型任务,因此大多数可用的注释良好的队列都很小。为了解决这个问题,我们提出了一种半监督学习(SSL)方法,该方法有效地利用了具有实际观察结果的小型标记数据集和具有结果替代品的大型未标记数据集。特别地,我们提出了一种半监督的、有效的方法来进行q学习和双鲁棒的离策略值估计。将SSL推广到动态处理机制带来了有趣的挑战:1)q学习的特征分布是未知的,因为它包括以前的结果。2)我们在修改后的SSL框架中使用的替代变量可以预测结果,但不能提供最优策略或价值函数的信息。我们为我们的Q函数和值函数估计器提供了理论结果,以了解从SSL获得的效率程度。我们的方法至少与监督方法一样有效,并且对来自输入模型的错误规范的偏差具有鲁棒性。
{"title":"Semi-Supervised Off-Policy Reinforcement Learning and Value Estimation for Dynamic Treatment Regimes.","authors":"Aaron Sonabend-W, Nilanjana Laha, Ashwin N Ananthakrishnan, Tianxi Cai, Rajarshi Mukherjee","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Reinforcement learning (RL) has shown great promise in estimating dynamic treatment regimes which take into account patient heterogeneity. However, health-outcome information, used as the reward for RL methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource-intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small-sized labeled data set with actual outcomes observed and a large unlabeled data set with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to <i>Q</i>-learning and doubly robust off-policy value estimation. Generalizing SSL to dynamic treatment regimes brings interesting challenges: 1) Feature distribution for <i>Q</i>-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative of the optimal policy or value function. We provide theoretical results for our <i>Q</i> function and value function estimators to understand the degree of efficiency gained from SSL. Our method is at least as efficient as the supervised approach, and robust to bias from mis-specification of the imputation models.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146094552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-Based Causal Discovery for Zero-Inflated Count Data. 零膨胀计数数据的基于模型的因果发现。
IF 5.2 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Junsouk Choi, Yang Ni

Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.

零膨胀计数数据出现在广泛的科学领域,如社会科学、生物学和基因组学。很少有因果发现方法可以充分解释过多的零以及多变量计数数据的各种特征,如过分散。本文提出了一种新的零膨胀广义超几何有向无环图(zigg - dag)模型,用于从纯观测的零膨胀计数数据推断因果结构。所提出的zigg - dag利用了广泛的广义超几何概率分布,并且非常灵活地用于建模各种类型的零膨胀计数数据。此外,zigg - dag允许线性和非线性因果关系。我们通过计数数据的一般证明技术证明了所提出的zigg - dag的因果结构是可识别的,该技术适用于研究因果可识别性的所提出的模型之外。基于分数的算法被开发用于因果结构学习。广泛的合成实验以及具有已知地面真相的真实数据集证明了所提出的方法在从观测到的零膨胀计数数据中发现因果结构方面优于最先进的替代方法。从单细胞rna测序数据集逆向工程基因调控网络的应用说明了zigg - dag在实践中的效用。
{"title":"Model-Based Causal Discovery for Zero-Inflated Count Data.","authors":"Junsouk Choi, Yang Ni","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Zero-inflated count data arise in a wide range of scientific areas such as social science, biology, and genomics. Very few causal discovery approaches can adequately account for excessive zeros as well as various features of multivariate count data such as overdispersion. In this paper, we propose a new zero-inflated generalized hypergeometric directed acyclic graph (ZiG-DAG) model for inference of causal structure from purely observational zero-inflated count data. The proposed ZiG-DAGs exploit a broad family of generalized hypergeometric probability distributions and are useful for modeling various types of zero-inflated count data with great flexibility. In addition, ZiG-DAGs allow for both linear and nonlinear causal relationships. We prove that the causal structure is identifiable for the proposed ZiG-DAGs via a general proof technique for count data, which is applicable beyond the proposed model for investigating causal identifiability. Score-based algorithms are developed for causal structure learning. Extensive synthetic experiments as well as a real dataset with known ground truth demonstrate the superior performance of the proposed method against state-of-the-art alternative methods in discovering causal structure from observational zero-inflated count data. An application of reverse-engineering a gene regulatory network from a single-cell RNA-sequencing dataset illustrates the utility of ZiG-DAGs in practice.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":5.2,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144823118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DART: Distance Assisted Recursive Testing. DART:距离辅助递归测试。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Xuechan Li, Anthony D Sung, Jichun Xie

Multiple testing is a commonly used tool in modern data science. Sometimes, the hypotheses are embedded in a space; the distances between the hypotheses reflect their co-null/co-alternative patterns. Properly incorporating the distance information in testing will boost testing power. Hence, we developed a new multiple testing framework named Distance Assisted Recursive Testing (DART). DART features in joint artificial intelligence (AI) and statistics modeling. It has two stages. The first stage uses AI models to construct an aggregation tree that reflects the distance information. The second stage uses statistical models to embed the testing on the tree and control the false discovery rate. Theoretical analysis and numerical experiments demonstrated that DART generates valid, robust, and powerful results. We applied DART to a clinical trial in the allogeneic stem cell transplantation study to identify the gut microbiota whose abundance was impacted by post-transplant care.

多重测试是现代数据科学常用的工具。有时,假设被嵌入一个空间;假设之间的距离反映了它们的共空/共变模式。在测试中适当纳入距离信息将提高测试能力。因此,我们开发了一种新的多重测试框架,名为 "距离辅助递归测试(DART)"。DART 的特点是联合人工智能(AI)和统计建模。它分为两个阶段。第一阶段使用人工智能模型构建反映距离信息的聚合树。第二阶段使用统计模型对聚合树进行嵌入测试并控制误发现率。理论分析和数值实验证明,DART 能生成有效、稳健和强大的结果。我们将 DART 应用于异体干细胞移植研究中的一项临床试验,以确定其丰度受移植后护理影响的肠道微生物群。
{"title":"DART: Distance Assisted Recursive Testing.","authors":"Xuechan Li, Anthony D Sung, Jichun Xie","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Multiple testing is a commonly used tool in modern data science. Sometimes, the hypotheses are embedded in a space; the distances between the hypotheses reflect their co-null/co-alternative patterns. Properly incorporating the distance information in testing will boost testing power. Hence, we developed a new multiple testing framework named Distance Assisted Recursive Testing (DART). DART features in joint artificial intelligence (AI) and statistics modeling. It has two stages. The first stage uses AI models to construct an aggregation tree that reflects the distance information. The second stage uses statistical models to embed the testing on the tree and control the false discovery rate. Theoretical analysis and numerical experiments demonstrated that DART generates valid, robust, and powerful results. We applied DART to a clinical trial in the allogeneic stem cell transplantation study to identify the gut microbiota whose abundance was impacted by post-transplant care.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11636646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142819880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inference for a Large Directed Acyclic Graph with Unspecified Interventions. 具有未指定干预的大有向非循环图的推理。
IF 6 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Chunlin Li, Xiaotong Shen, Wei Pan

Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires to identify the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks. The R implementation is available at https://github.com/chunlinli/intdag.

在给定一些未指明的干预措施(即干预目标未知)的情况下,对定向关系进行统计推断是具有挑战性的。在这篇文章中,我们测试了假设的直接关系与未指明的干预措施。首先,我们导出了产生可识别模型的条件。与经典推理不同,测试定向关系需要识别特定假设的主要变量的祖先和相关干预。为此,我们提出了一种基于节点回归的剥离算法来建立主变量的拓扑顺序。此外,我们证明了剥离算法在低阶多项式时间内产生了一致的估计量。其次,我们提出了一种与数据扰动方案相结合的似然比检验,以解释识别祖先和干预措施的不确定性。此外,我们还证明了数据扰动测试统计量的分布收敛于目标分布。数值例子证明了所提出的方法的实用性和有效性,包括推断基因调控网络的应用。R的实施可在https://github.com/chunlinli/intdag.
{"title":"Inference for a Large Directed Acyclic Graph with Unspecified Interventions.","authors":"Chunlin Li, Xiaotong Shen, Wei Pan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires to identify the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks. The R implementation is available at https://github.com/chunlinli/intdag.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":6.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10497226/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10242964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fair Data Representation for Machine Learning at the Pareto Frontier. 帕累托前沿机器学习的公平数据表示
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Shizhou Xu, Thomas Strohmer

As machine learning powered decision-making becomes increasingly important in our daily lives, it is imperative to strive for fairness in the underlying data processing. We propose a pre-processing algorithm for fair data representation via which L 2 ( ) -objective supervised learning results in estimations of the Pareto frontier between prediction error and statistical disparity. Particularly, the present work applies the optimal affine transport to approach the post-processing Wasserstein barycenter characterization of the optimal fair L 2 -objective supervised learning via a pre-processing data deformation. Furthermore, we show that the Wasserstein geodesics from learning outcome marginals to their barycenter characterizes the Pareto frontier between L 2 -loss and total Wasserstein distance among the marginals. Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.

随着机器学习驱动的决策在我们的日常生活中变得越来越重要,在底层数据处理中力求公平势在必行。我们提出了一种用于公平数据表示的预处理算法,通过这种算法,目标监督学习可以估计预测误差和统计差异之间的帕累托前沿。特别是,本研究应用最优仿射传输,通过预处理数据变形,接近最优公平 L 2 目标监督学习的后处理 Wasserstein barycenter 特性。此外,我们还证明了从学习结果边际到其原点的瓦瑟斯坦大地线表征了边际间的 L 2 -损失和总瓦瑟斯坦距离之间的帕累托前沿。数值模拟证明了该方法的优势:(1)预处理步骤与任意条件期望估计监督学习方法和未见数据具有可比性;(2)公平表示通过限制其余数据相对于敏感数据的推理能力来保护敏感信息;(3)即使对于高维数据,最优仿射图的计算效率也很高。
{"title":"Fair Data Representation for Machine Learning at the Pareto Frontier.","authors":"Shizhou Xu, Thomas Strohmer","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>As machine learning powered decision-making becomes increasingly important in our daily lives, it is imperative to strive for fairness in the underlying data processing. We propose a pre-processing algorithm for fair data representation via which <math> <mrow><msup><mi>L</mi> <mn>2</mn></msup> <mo>(</mo> <mtext>ℙ</mtext> <mo>)</mo></mrow> </math> -objective supervised learning results in estimations of the Pareto frontier between prediction error and statistical disparity. Particularly, the present work applies the optimal affine transport to approach the post-processing Wasserstein barycenter characterization of the optimal fair <math> <mrow><msup><mi>L</mi> <mn>2</mn></msup> </mrow> </math> -objective supervised learning via a pre-processing data deformation. Furthermore, we show that the Wasserstein geodesics from learning outcome marginals to their barycenter characterizes the Pareto frontier between <math> <mrow><msup><mi>L</mi> <mn>2</mn></msup> </mrow> </math> -loss and total Wasserstein distance among the marginals. Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; (3) the optimal affine maps are computationally efficient even for high-dimensional data.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11494318/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142512129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction. 用于高维风险预测的替代物辅助半监督推理。
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Jue Hou, Zijian Guo, Tianxi Cai

Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.

由于无法直接观察疾病结果和高维预测因子,利用电子健康记录(EHR)数据进行风险建模具有挑战性。在本文中,我们开发了一种代用数据辅助的半监督学习方法,该方法利用了带有注释结果的小标签数据以及大量未标签的结果代用数据和高维预测因子。我们建议通过利用结果代理和高维预测因子构建稀疏估算模型来估算未观察到的结果。我们还进一步进行了一步纠偏,以实现风险预测的区间估计。即使估算模型和风险预测模型都被错误地指定,我们的推断程序也是有效的。我们采用新颖的方法来充分利用未标注数据,从而能够在具有高密度风险预测模型的挑战性环境中进行高维统计推断。我们进行了广泛的模拟研究,以证明我们的方法与现有的监督方法相比具有优越性。我们利用电子病历生物库队列将该方法应用于 2 型糖尿病遗传风险预测。
{"title":"Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction.","authors":"Jue Hou, Zijian Guo, Tianxi Cai","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Risk modeling with electronic health records (EHR) data is challenging due to no direct observations of the disease outcome and the high-dimensional predictors. In this paper, we develop a surrogate assisted semi-supervised learning approach, leveraging small labeled data with annotated outcomes and extensive unlabeled data of outcome surrogates and high-dimensional predictors. We propose to impute the unobserved outcomes by constructing a sparse imputation model with outcome surrogates and high-dimensional predictors. We further conduct a one-step bias correction to enable interval estimation for the risk prediction. Our inference procedure is valid even if both the imputation and risk prediction models are misspecified. Our novel way of ultilizing unlabelled data enables the high-dimensional statistical inference for the challenging setting with a dense risk prediction model. We present an extensive simulation study to demonstrate the superiority of our approach compared to existing supervised methods. We apply the method to genetic risk prediction of type-2 diabetes mellitus using an EHR biobank cohort.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10947223/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140159438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training? 个性化联合学习的最小估计:FedAvg 和本地训练之间的替代方案?
IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Pub Date : 2023-01-01
Shuxiao Chen, Qinqing Zheng, Qi Long, Weijie J Su

A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual's perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an approximate alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant. As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.

联合学习中一个公认的难题来自于客户之间的统计异质性:本地数据集通常来自不同但并非完全无关的概率分布,因此,要想从每个人的角度获得最佳结果,就必须实现个性化。在本文中,我们从最小化的角度展示了使用平滑、强凸损失的个性化联合学习的超额风险如何取决于数据异质性,重点关注 FedAvg 算法(McMahan 等人,2017 年)和纯本地训练(即客户在不进行任何交流的情况下解决其本地数据集上的经验风险最小化问题)。我们的主要结果揭示了这两种联合学习基线算法之间的近似替代方案:当数据异质性较小时,前一种算法在实例集合上是最小率最优的,而当数据异质性较大且阈值尖锐到一个常数时,后一种算法是最小率最优的。我们的结果表明,从最坏情况的角度来看,在两种基准算法之间做出选择的二分法策略是速率最优的。另一个含义是,在额外的规则性条件下,流行的 FedAvg 跟随局部微调策略也是最小最优的。我们的分析依赖于一个新的算法稳定性概念,它考虑到了联合学习的本质。
{"title":"Minimax Estimation for Personalized Federated Learning: An Alternative between FedAvg and Local Training?","authors":"Shuxiao Chen, Qinqing Zheng, Qi Long, Weijie J Su","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A widely recognized difficulty in federated learning arises from the statistical heterogeneity among clients: local datasets often originate from distinct yet not entirely unrelated probability distributions, and personalization is, therefore, necessary to achieve optimal results from each individual's perspective. In this paper, we show how the excess risks of personalized federated learning using a smooth, strongly convex loss depend on data heterogeneity from a minimax point of view, with a focus on the FedAvg algorithm (McMahan et al., 2017) and pure local training (i.e., clients solve empirical risk minimization problems on their local datasets without any communication). Our main result reveals an <i>approximate</i> alternative between these two baseline algorithms for federated learning: the former algorithm is minimax rate optimal over a collection of instances when data heterogeneity is small, whereas the latter is minimax rate optimal when data heterogeneity is large, and the threshold is sharp up to a constant. As an implication, our results show that from a worst-case point of view, a dichotomous strategy that makes a choice between the two baseline algorithms is rate-optimal. Another implication is that the popular FedAvg following by local fine tuning strategy is also minimax optimal under additional regularity conditions. Our analysis relies on a new notion of algorithmic stability that takes into account the nature of federated learning.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"24 ","pages":""},"PeriodicalIF":4.3,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11299893/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141895178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Machine Learning Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1