首页 > 最新文献

arXiv - STAT - Machine Learning最新文献

英文 中文
Fair CoVariance Neural Networks 公平共方差神经网络
Pub Date : 2024-09-13 DOI: arxiv-2409.08558
Andrea Cavallo, Madeline Navarro, Santiago Segarra, Elvin Isufi
Covariance-based data processing is widespread across signal processing andmachine learning applications due to its ability to model datainterconnectivities and dependencies. However, harmful biases in the data maybecome encoded in the sample covariance matrix and cause data-driven methods totreat different subpopulations unfairly. Existing works such as fair principalcomponent analysis (PCA) mitigate these effects, but remain unstable in lowsample regimes, which in turn may jeopardize the fairness goal. To address bothbiases and instability, we propose Fair coVariance Neural Networks (FVNNs),which perform graph convolutions on the covariance matrix for both fair andaccurate predictions. Our FVNNs provide a flexible model compatible withseveral existing bias mitigation techniques. In particular, FVNNs allow formitigating the bias in two ways: first, they operate on fair covarianceestimates that remove biases from their principal components; second, they aretrained in an end-to-end fashion via a fairness regularizer in the lossfunction so that the model parameters are tailored to solve the task directlyin a fair manner. We prove that FVNNs are intrinsically fairer than analogousPCA approaches thanks to their stability in low sample regimes. We validate therobustness and fairness of our model on synthetic and real-world data,showcasing the flexibility of FVNNs along with the tradeoff between fair andaccurate performance.
基于协方差的数据处理因其能够模拟数据间的关联性和依赖性而广泛应用于信号处理和机器学习领域。然而,数据中的有害偏差可能会编码在样本协方差矩阵中,导致数据驱动的方法不公平地对待不同的子群。现有的公平主成分分析(PCA)等方法可以减轻这些影响,但在低样本情况下仍不稳定,这反过来又可能危及公平目标。为了解决这些问题和不稳定性,我们提出了公平协方差神经网络(FVNNs),它可以对协方差矩阵进行图卷积,从而实现公平和准确的预测。我们的公平协方差神经网络提供了一个灵活的模型,与现有的各种偏差缓解技术兼容。特别是,FVNNs 可以通过两种方式减轻偏差:首先,它们对公平的协方差估计值进行操作,从其主成分中消除偏差;其次,它们通过损失函数中的公平正则化器以端到端方式进行训练,从而定制模型参数,以公平的方式直接解决任务。我们证明,由于 FVNN 在低采样率下的稳定性,它们本质上比类似的PCA 方法更公平。我们在合成数据和真实世界数据上验证了模型的稳健性和公平性,展示了 FVNN 的灵活性以及公平性和准确性之间的权衡。
{"title":"Fair CoVariance Neural Networks","authors":"Andrea Cavallo, Madeline Navarro, Santiago Segarra, Elvin Isufi","doi":"arxiv-2409.08558","DOIUrl":"https://doi.org/arxiv-2409.08558","url":null,"abstract":"Covariance-based data processing is widespread across signal processing and\u0000machine learning applications due to its ability to model data\u0000interconnectivities and dependencies. However, harmful biases in the data may\u0000become encoded in the sample covariance matrix and cause data-driven methods to\u0000treat different subpopulations unfairly. Existing works such as fair principal\u0000component analysis (PCA) mitigate these effects, but remain unstable in low\u0000sample regimes, which in turn may jeopardize the fairness goal. To address both\u0000biases and instability, we propose Fair coVariance Neural Networks (FVNNs),\u0000which perform graph convolutions on the covariance matrix for both fair and\u0000accurate predictions. Our FVNNs provide a flexible model compatible with\u0000several existing bias mitigation techniques. In particular, FVNNs allow for\u0000mitigating the bias in two ways: first, they operate on fair covariance\u0000estimates that remove biases from their principal components; second, they are\u0000trained in an end-to-end fashion via a fairness regularizer in the loss\u0000function so that the model parameters are tailored to solve the task directly\u0000in a fair manner. We prove that FVNNs are intrinsically fairer than analogous\u0000PCA approaches thanks to their stability in low sample regimes. We validate the\u0000robustness and fairness of our model on synthetic and real-world data,\u0000showcasing the flexibility of FVNNs along with the tradeoff between fair and\u0000accurate performance.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning CausalBench 简介:用于因果分析和机器学习的灵活基准框架
Pub Date : 2024-09-12 DOI: arxiv-2409.08419
Ahmet Kapkiç, Pratanu Mandal, Shu Wan, Paras Sheth, Abhinav Gorantla, Yoonhyuk Choi, Huan Liu, K. Selçuk Candan
While witnessing the exceptional success of machine learning (ML)technologies in many applications, users are starting to notice a criticalshortcoming of ML: correlation is a poor substitute for causation. Theconventional way to discover causal relationships is to use randomizedcontrolled experiments (RCT); in many situations, however, these areimpractical or sometimes unethical. Causal learning from observational dataoffers a promising alternative. While being relatively recent, causal learningaims to go far beyond conventional machine learning, yet several majorchallenges remain. Unfortunately, advances are hampered due to the lack ofunified benchmark datasets, algorithms, metrics, and evaluation serviceinterfaces for causal learning. In this paper, we introduce {em CausalBench},a transparent, fair, and easy-to-use evaluation platform, aiming to (a) enablethe advancement of research in causal learning by facilitating scientificcollaboration in novel algorithms, datasets, and metrics and (b) promotescientific objectivity, reproducibility, fairness, and awareness of bias incausal learning research. CausalBench provides services for benchmarking data,algorithms, models, and metrics, impacting the needs of a broad of scientificand engineering disciplines.
在见证机器学习(ML)技术在许多应用中取得巨大成功的同时,用户也开始注意到 ML 的一个重要缺陷:相关性无法替代因果关系。发现因果关系的传统方法是使用随机对照实验(RCT);但在许多情况下,这种方法不切实际,有时甚至不道德。从观察数据中进行因果学习提供了一种很有前景的替代方法。因果学习虽然相对较新,但其目标远远超出了传统的机器学习,但仍存在一些重大挑战。不幸的是,由于缺乏统一的因果学习基准数据集、算法、度量标准和评估服务接口,因果学习的发展受到了阻碍。在本文中,我们介绍了{/em CausalBench},这是一个透明、公平、易用的评估平台,旨在:(a)通过促进新算法、数据集和度量标准方面的科学合作,推动因果学习研究的发展;(b)促进因果学习研究的科学客观性、可重复性、公平性和偏见意识。CausalBench 提供数据、算法、模型和度量基准测试服务,满足科学和工程学科的广泛需求。
{"title":"Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning","authors":"Ahmet Kapkiç, Pratanu Mandal, Shu Wan, Paras Sheth, Abhinav Gorantla, Yoonhyuk Choi, Huan Liu, K. Selçuk Candan","doi":"arxiv-2409.08419","DOIUrl":"https://doi.org/arxiv-2409.08419","url":null,"abstract":"While witnessing the exceptional success of machine learning (ML)\u0000technologies in many applications, users are starting to notice a critical\u0000shortcoming of ML: correlation is a poor substitute for causation. The\u0000conventional way to discover causal relationships is to use randomized\u0000controlled experiments (RCT); in many situations, however, these are\u0000impractical or sometimes unethical. Causal learning from observational data\u0000offers a promising alternative. While being relatively recent, causal learning\u0000aims to go far beyond conventional machine learning, yet several major\u0000challenges remain. Unfortunately, advances are hampered due to the lack of\u0000unified benchmark datasets, algorithms, metrics, and evaluation service\u0000interfaces for causal learning. In this paper, we introduce {em CausalBench},\u0000a transparent, fair, and easy-to-use evaluation platform, aiming to (a) enable\u0000the advancement of research in causal learning by facilitating scientific\u0000collaboration in novel algorithms, datasets, and metrics and (b) promote\u0000scientific objectivity, reproducibility, fairness, and awareness of bias in\u0000causal learning research. CausalBench provides services for benchmarking data,\u0000algorithms, models, and metrics, impacting the needs of a broad of scientific\u0000and engineering disciplines.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theoretical guarantees in KL for Diffusion Flow Matching 扩散流匹配的 KL 理论保证
Pub Date : 2024-09-12 DOI: arxiv-2409.08311
Marta Gentiloni Silveri, Giovanni Conforti, Alain Durmus
Flow Matching (FM) (also referred to as stochastic interpolants or rectifiedflows) stands out as a class of generative models that aims to bridge in finitetime the target distribution $nu^star$ with an auxiliary distribution $mu$,leveraging a fixed coupling $pi$ and a bridge which can either bedeterministic or stochastic. These two ingredients define a path measure whichcan then be approximated by learning the drift of its Markovian projection. Themain contribution of this paper is to provide relatively mild assumptions on$nu^star$, $mu$ and $pi$ to obtain non-asymptotics guarantees for DiffusionFlow Matching (DFM) models using as bridge the conditional distributionassociated with the Brownian motion. More precisely, we establish bounds on theKullback-Leibler divergence between the target distribution and the onegenerated by such DFM models under moment conditions on the score of$nu^star$, $mu$ and $pi$, and a standard $L^2$-drift-approximation errorassumption.
流匹配(FM)(也称为随机插值或矫正流)是一类生成模型,其目的是在有限时间内将目标分布 $nu^star$ 与辅助分布 $mu$ 桥接起来,利用固定耦合 $pi$ 和桥接,桥接可以是确定的,也可以是随机的。这两个要素定义了一种路径度量,然后可以通过学习其马尔可夫投影的漂移来近似该路径度量。本文的主要贡献在于提供了关于$nu^star$、$mu$和$pi$的相对温和的假设,从而为使用与布朗运动相关的条件分布作为桥梁的扩散流匹配(DFM)模型获得非渐近保证。更准确地说,我们在$nu^star$、$mu$和$pi$的分值的矩条件下,以及标准的$L^2$漂移逼近误差假设下,建立了目标分布与此类DFM模型所产生的一分布之间的库尔贝克-莱伯勒发散(Kullback-Leibler divergence)的边界。
{"title":"Theoretical guarantees in KL for Diffusion Flow Matching","authors":"Marta Gentiloni Silveri, Giovanni Conforti, Alain Durmus","doi":"arxiv-2409.08311","DOIUrl":"https://doi.org/arxiv-2409.08311","url":null,"abstract":"Flow Matching (FM) (also referred to as stochastic interpolants or rectified\u0000flows) stands out as a class of generative models that aims to bridge in finite\u0000time the target distribution $nu^star$ with an auxiliary distribution $mu$,\u0000leveraging a fixed coupling $pi$ and a bridge which can either be\u0000deterministic or stochastic. These two ingredients define a path measure which\u0000can then be approximated by learning the drift of its Markovian projection. The\u0000main contribution of this paper is to provide relatively mild assumptions on\u0000$nu^star$, $mu$ and $pi$ to obtain non-asymptotics guarantees for Diffusion\u0000Flow Matching (DFM) models using as bridge the conditional distribution\u0000associated with the Brownian motion. More precisely, we establish bounds on the\u0000Kullback-Leibler divergence between the target distribution and the one\u0000generated by such DFM models under moment conditions on the score of\u0000$nu^star$, $mu$ and $pi$, and a standard $L^2$-drift-approximation error\u0000assumption.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Localized Schrödinger Bridge Sampler 局部薛定谔桥采样器
Pub Date : 2024-09-12 DOI: arxiv-2409.07968
Georg A. Gottwald, Sebastian Reich
We consider the generative problem of sampling from an unknown distributionfor which only a sufficiently large number of training samples are available.In this paper, we build on previous work combining Schr"odinger bridges andLangevin dynamics. A key bottleneck of this approach is the exponentialdependence of the required training samples on the dimension, $d$, of theambient state space. We propose a localization strategy which exploitsconditional independence of conditional expectation values. Localization thusreplaces a single high-dimensional Schr"odinger bridge problem by $d$low-dimensional Schr"odinger bridge problems over the available trainingsamples. As for the original approach, the localized sampler is stable andgeometric ergodic. The sampler also naturally extends to conditional samplingand to Bayesian inference. We demonstrate the performance of our proposedscheme through experiments on a Gaussian problem with increasing dimensions andon a stochastic subgrid-scale parametrization conditional sampling problem.
我们考虑了从未知分布中采样的生成问题,对于这个问题,只有足够多的训练样本可用。这种方法的一个关键瓶颈是所需的训练样本与环境状态空间的维度 $d$ 呈指数关系。我们提出了一种利用条件期望值的条件独立性的本地化策略。因此,本地化将单个高维薛定谔桥问题替换为可用训练样本上的 $d$ 低维薛定谔桥问题。与原始方法一样,本地化采样器是稳定的,并且具有几何遍历性。该采样器还可以自然地扩展到条件采样和贝叶斯推理。我们通过对维度不断增加的高斯问题和随机子网格尺度参数化条件采样问题的实验,证明了我们提出的方案的性能。
{"title":"Localized Schrödinger Bridge Sampler","authors":"Georg A. Gottwald, Sebastian Reich","doi":"arxiv-2409.07968","DOIUrl":"https://doi.org/arxiv-2409.07968","url":null,"abstract":"We consider the generative problem of sampling from an unknown distribution\u0000for which only a sufficiently large number of training samples are available.\u0000In this paper, we build on previous work combining Schr\"odinger bridges and\u0000Langevin dynamics. A key bottleneck of this approach is the exponential\u0000dependence of the required training samples on the dimension, $d$, of the\u0000ambient state space. We propose a localization strategy which exploits\u0000conditional independence of conditional expectation values. Localization thus\u0000replaces a single high-dimensional Schr\"odinger bridge problem by $d$\u0000low-dimensional Schr\"odinger bridge problems over the available training\u0000samples. As for the original approach, the localized sampler is stable and\u0000geometric ergodic. The sampler also naturally extends to conditional sampling\u0000and to Bayesian inference. We demonstrate the performance of our proposed\u0000scheme through experiments on a Gaussian problem with increasing dimensions and\u0000on a stochastic subgrid-scale parametrization conditional sampling problem.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dataset-Free Weight-Initialization on Restricted Boltzmann Machine 受限玻尔兹曼机上的无数据集权重初始化
Pub Date : 2024-09-12 DOI: arxiv-2409.07708
Muneki Yasuda, Ryosuke Maeno, Chako Takahashi
In feed-forward neural networks, dataset-free weight-initialization methodsuch as LeCun, Xavier (or Glorot), and He initializations have been developed.These methods randomly determine the initial values of weight parameters basedon specific distributions (e.g., Gaussian or uniform distributions) withoutusing training datasets. To the best of the authors' knowledge, such adataset-free weight-initialization method is yet to be developed for restrictedBoltzmann machines (RBMs), which are probabilistic neural networks consistingof two layers, In this study, we derive a dataset-free weight-initializationmethod for Bernoulli--Bernoulli RBMs based on a statistical mechanicalanalysis. In the proposed weight-initialization method, the weight parametersare drawn from a Gaussian distribution with zero mean. The standard deviationof the Gaussian distribution is optimized based on our hypothesis which is thata standard deviation providing a larger layer correlation (LC) between the twolayers improves the learning efficiency. The expression of the LC is derivedbased on a statistical mechanical analysis. The optimal value of the standarddeviation corresponds to the maximum point of the LC. The proposedweight-initialization method is identical to Xavier initialization in aspecific case (i.e., in the case the sizes of the two layers are the same, therandom variables of the layers are ${-1,1}$-binary, and all bias parametersare zero).
在前馈神经网络中,已经开发出了无数据集权重初始化方法,如 LeCun、Xavier(或 Glorot)和 He 初始化方法。这些方法基于特定分布(如高斯分布或均匀分布)随机确定权重参数的初始值,而无需使用训练数据集。在本研究中,我们基于统计力学分析,为伯努利--伯努利 RBM 提出了一种无数据集权重初始化方法。在所提出的权重初始化方法中,权重参数取自均值为零的高斯分布。高斯分布的标准偏差是根据我们的假设进行优化的,即标准偏差在两层之间提供较大的层相关性(LC)可以提高学习效率。LC 的表达式是基于统计力学分析得出的。标准偏差的最佳值对应于 LC 的最大点。所提出的权重初始化方法与特定情况下的 Xavier 初始化方法相同(即两层的大小相同,各层的随机变量为 ${-1,1}$ 二进制,且所有偏置参数为零)。
{"title":"Dataset-Free Weight-Initialization on Restricted Boltzmann Machine","authors":"Muneki Yasuda, Ryosuke Maeno, Chako Takahashi","doi":"arxiv-2409.07708","DOIUrl":"https://doi.org/arxiv-2409.07708","url":null,"abstract":"In feed-forward neural networks, dataset-free weight-initialization method\u0000such as LeCun, Xavier (or Glorot), and He initializations have been developed.\u0000These methods randomly determine the initial values of weight parameters based\u0000on specific distributions (e.g., Gaussian or uniform distributions) without\u0000using training datasets. To the best of the authors' knowledge, such a\u0000dataset-free weight-initialization method is yet to be developed for restricted\u0000Boltzmann machines (RBMs), which are probabilistic neural networks consisting\u0000of two layers, In this study, we derive a dataset-free weight-initialization\u0000method for Bernoulli--Bernoulli RBMs based on a statistical mechanical\u0000analysis. In the proposed weight-initialization method, the weight parameters\u0000are drawn from a Gaussian distribution with zero mean. The standard deviation\u0000of the Gaussian distribution is optimized based on our hypothesis which is that\u0000a standard deviation providing a larger layer correlation (LC) between the two\u0000layers improves the learning efficiency. The expression of the LC is derived\u0000based on a statistical mechanical analysis. The optimal value of the standard\u0000deviation corresponds to the maximum point of the LC. The proposed\u0000weight-initialization method is identical to Xavier initialization in a\u0000specific case (i.e., in the case the sizes of the two layers are the same, the\u0000random variables of the layers are ${-1,1}$-binary, and all bias parameters\u0000are zero).","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated One-Shot Ensemble Clustering 联合单次组合聚类
Pub Date : 2024-09-12 DOI: arxiv-2409.08396
Rui Duan, Xin Xiong, Jueyi Liu, Katherine P. Liao, Tianxi Cai
Cluster analysis across multiple institutions poses significant challengesdue to data-sharing restrictions. To overcome these limitations, we introducethe Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solutiontailored for multi-site analyses under such constraints. FONT requires only asingle round of communication between sites and ensures privacy by exchangingonly fitted model parameters and class labels. The algorithm combines locallyfitted clustering models into a data-adaptive ensemble, making it broadlyapplicable to various clustering techniques and robust to differences incluster proportions across sites. Our theoretical analysis validates theeffectiveness of the data-adaptive weights learned by FONT, and simulationstudies demonstrate its superior performance compared to existing benchmarkmethods. We applied FONT to identify subgroups of patients with rheumatoidarthritis across two health systems, revealing improved consistency of patientclusters across sites, while locally fitted clusters proved less transferable.FONT is particularly well-suited for real-world applications with stringentcommunication and privacy constraints, offering a scalable and practicalsolution for multi-site clustering.
由于数据共享方面的限制,跨机构聚类分析面临着巨大挑战。为了克服这些限制,我们引入了联合单次集合聚类(FONT)算法,这是一种新颖的解决方案,专为这种限制下的多站点分析而设计。FONT 只需要在站点之间进行一轮通信,并通过只交换拟合模型参数和类标签来确保隐私。该算法将本地拟合的聚类模型组合成一个数据适应性集合,使其广泛适用于各种聚类技术,并对不同研究地点的聚类比例差异具有鲁棒性。我们的理论分析验证了 FONT 所学习的数据自适应权重的有效性,而模拟研究则证明了它与现有基准方法相比的卓越性能。我们将 FONT 应用于识别两个医疗系统中的类风湿关节炎患者亚群,结果表明患者聚类在不同地点的一致性得到了改善,而局部拟合聚类的可转移性较差。
{"title":"Federated One-Shot Ensemble Clustering","authors":"Rui Duan, Xin Xiong, Jueyi Liu, Katherine P. Liao, Tianxi Cai","doi":"arxiv-2409.08396","DOIUrl":"https://doi.org/arxiv-2409.08396","url":null,"abstract":"Cluster analysis across multiple institutions poses significant challenges\u0000due to data-sharing restrictions. To overcome these limitations, we introduce\u0000the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution\u0000tailored for multi-site analyses under such constraints. FONT requires only a\u0000single round of communication between sites and ensures privacy by exchanging\u0000only fitted model parameters and class labels. The algorithm combines locally\u0000fitted clustering models into a data-adaptive ensemble, making it broadly\u0000applicable to various clustering techniques and robust to differences in\u0000cluster proportions across sites. Our theoretical analysis validates the\u0000effectiveness of the data-adaptive weights learned by FONT, and simulation\u0000studies demonstrate its superior performance compared to existing benchmark\u0000methods. We applied FONT to identify subgroups of patients with rheumatoid\u0000arthritis across two health systems, revealing improved consistency of patient\u0000clusters across sites, while locally fitted clusters proved less transferable.\u0000FONT is particularly well-suited for real-world applications with stringent\u0000communication and privacy constraints, offering a scalable and practical\u0000solution for multi-site clustering.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wasserstein Distributionally Robust Multiclass Support Vector Machine 瓦瑟斯坦分布式鲁棒多类支持向量机
Pub Date : 2024-09-12 DOI: arxiv-2409.08409
Michael Ibrahim, Heraldo Rozas, Nagi Gebraeel
We study the problem of multiclass classification for settings where datafeatures $mathbf{x}$ and their labels $mathbf{y}$ are uncertain. We identifythat distributionally robust one-vs-all (OVA) classifiers often struggle insettings with imbalanced data. To address this issue, we use Wassersteindistributionally robust optimization to develop a robust version of themulticlass support vector machine (SVM) characterized by the Crammer-Singer(CS) loss. First, we prove that the CS loss is bounded from above by aLipschitz continuous function for all $mathbf{x} in mathcal{X}$ and$mathbf{y} in mathcal{Y}$, then we exploit strong duality results to expressthe dual of the worst-case risk problem, and we show that the worst-case riskminimization problem admits a tractable convex reformulation due to theregularity of the CS loss. Moreover, we develop a kernel version of ourproposed model to account for nonlinear class separation, and we show that itadmits a tractable convex upper bound. We also propose a projected subgradientmethod algorithm for a special case of our proposed linear model to improvescalability. Our numerical experiments demonstrate that our model outperformsstate-of-the art OVA models in settings where the training data is highlyimbalanced. We also show through experiments on popular real-world datasetsthat our proposed model often outperforms its regularized counterpart as thefirst accounts for uncertain labels unlike the latter.
我们研究了在数据特征 $mathbf{x}$ 及其标签 $mathbf{y}$ 不确定的情况下的多类分类问题。我们发现,在数据不平衡的情况下,具有分布稳健性的 "一视同仁"(OVA)分类器往往会陷入困境。为了解决这个问题,我们使用 Wassersteindistributionally robust optimization(瓦瑟斯特分布稳健优化)开发了以 Crammer-Singer(CS)损失为特征的多类支持向量机(SVM)的稳健版本。首先,我们证明 CS 损失对于所有 $mathbf{x} 都是由一个 Lipschitz 连续函数从上而下限定的。和 $mathbf{y} 的利普西兹连续函数的边界。那么我们就可以利用强对偶性结果来解释最坏情况风险问题的对偶性,我们还证明了由于 CS 损失的奇异性,最坏情况风险最小化问题允许一个可控的凸重拟。此外,我们还开发了一个核版本的拟议模型,以考虑非线性类分离,并证明它包含一个可处理的凸上界。我们还针对我们提出的线性模型的一个特例提出了一种投影子梯度法算法,以提高可计算性。我们的数值实验证明,在训练数据高度不平衡的情况下,我们的模型优于现有的 OVA 模型。我们还通过在流行的真实世界数据集上的实验表明,我们提出的模型往往优于其正则化的对应模型,因为前者考虑了不确定的标签,而后者则不同。
{"title":"Wasserstein Distributionally Robust Multiclass Support Vector Machine","authors":"Michael Ibrahim, Heraldo Rozas, Nagi Gebraeel","doi":"arxiv-2409.08409","DOIUrl":"https://doi.org/arxiv-2409.08409","url":null,"abstract":"We study the problem of multiclass classification for settings where data\u0000features $mathbf{x}$ and their labels $mathbf{y}$ are uncertain. We identify\u0000that distributionally robust one-vs-all (OVA) classifiers often struggle in\u0000settings with imbalanced data. To address this issue, we use Wasserstein\u0000distributionally robust optimization to develop a robust version of the\u0000multiclass support vector machine (SVM) characterized by the Crammer-Singer\u0000(CS) loss. First, we prove that the CS loss is bounded from above by a\u0000Lipschitz continuous function for all $mathbf{x} in mathcal{X}$ and\u0000$mathbf{y} in mathcal{Y}$, then we exploit strong duality results to express\u0000the dual of the worst-case risk problem, and we show that the worst-case risk\u0000minimization problem admits a tractable convex reformulation due to the\u0000regularity of the CS loss. Moreover, we develop a kernel version of our\u0000proposed model to account for nonlinear class separation, and we show that it\u0000admits a tractable convex upper bound. We also propose a projected subgradient\u0000method algorithm for a special case of our proposed linear model to improve\u0000scalability. Our numerical experiments demonstrate that our model outperforms\u0000state-of-the art OVA models in settings where the training data is highly\u0000imbalanced. We also show through experiments on popular real-world datasets\u0000that our proposed model often outperforms its regularized counterpart as the\u0000first accounts for uncertain labels unlike the latter.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synthetic continued pretraining 合成持续预培训
Pub Date : 2024-09-11 DOI: arxiv-2409.07431
Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto
Pretraining on large-scale, unstructured internet text has enabled languagemodels to acquire a significant amount of world knowledge. However, thisknowledge acquisition is data-inefficient -- to learn a given fact, models mustbe trained on hundreds to thousands of diverse representations of it. Thisposes a challenge when adapting a pretrained model to a small corpus ofdomain-specific documents, where each fact may appear rarely or only once. Wepropose to bridge this gap with synthetic continued pretraining: using thesmall domain-specific corpus to synthesize a large corpus more amenable tolearning, and then performing continued pretraining on the synthesized corpus.We instantiate this proposal with EntiGraph, a synthetic data augmentationalgorithm that extracts salient entities from the source documents and thengenerates diverse text by drawing connections between the sampled entities.Synthetic continued pretraining using EntiGraph enables a language model toanswer questions and follow generic instructions related to the sourcedocuments without access to them. If instead, the source documents areavailable at inference time, we show that the knowledge acquired through ourapproach compounds with retrieval-augmented generation. To better understandthese results, we build a simple mathematical model of EntiGraph, and show howsynthetic data augmentation can "rearrange" knowledge to enable moredata-efficient learning.
在大规模、非结构化的互联网文本上进行预训练使语言模型获得了大量的世界知识。然而,这种知识获取的数据效率很低--要学习一个给定的事实,模型必须在成百上千个不同的表征上进行训练。当把预先训练好的模型应用于小型特定领域文档语料库时,这就带来了挑战,因为在这些语料库中,每个事实可能很少出现或只出现一次。我们建议通过合成持续预训练来弥补这一差距:使用小型特定领域语料库合成更适合学习的大型语料库,然后在合成语料库上进行持续预训练。我们用 EntiGraph 实现了这一建议,它是一种合成数据增强算法,可以从源文档中提取突出实体,并通过绘制采样实体之间的联系生成多样化文本。使用 EntiGraph 进行合成持续预训练,可以让语言模型在无法访问源文档的情况下回答问题并遵循与源文档相关的通用指令。如果源文档在推理时可用,我们就会发现通过我们的方法获得的知识与检索增强生成的知识相辅相成。为了更好地理解这些结果,我们建立了一个简单的 EntiGraph 数学模型,并展示了合成数据增强如何 "重新排列 "知识,以实现更高效的学习。
{"title":"Synthetic continued pretraining","authors":"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto","doi":"arxiv-2409.07431","DOIUrl":"https://doi.org/arxiv-2409.07431","url":null,"abstract":"Pretraining on large-scale, unstructured internet text has enabled language\u0000models to acquire a significant amount of world knowledge. However, this\u0000knowledge acquisition is data-inefficient -- to learn a given fact, models must\u0000be trained on hundreds to thousands of diverse representations of it. This\u0000poses a challenge when adapting a pretrained model to a small corpus of\u0000domain-specific documents, where each fact may appear rarely or only once. We\u0000propose to bridge this gap with synthetic continued pretraining: using the\u0000small domain-specific corpus to synthesize a large corpus more amenable to\u0000learning, and then performing continued pretraining on the synthesized corpus.\u0000We instantiate this proposal with EntiGraph, a synthetic data augmentation\u0000algorithm that extracts salient entities from the source documents and then\u0000generates diverse text by drawing connections between the sampled entities.\u0000Synthetic continued pretraining using EntiGraph enables a language model to\u0000answer questions and follow generic instructions related to the source\u0000documents without access to them. If instead, the source documents are\u0000available at inference time, we show that the knowledge acquired through our\u0000approach compounds with retrieval-augmented generation. To better understand\u0000these results, we build a simple mathematical model of EntiGraph, and show how\u0000synthetic data augmentation can \"rearrange\" knowledge to enable more\u0000data-efficient learning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions 利用数据驱动的信号区域实现新物理的模型诊断式探测
Pub Date : 2024-09-11 DOI: arxiv-2409.06960
Soheun Yi, John Alison, Mikael Kuusela
In the search for new particles in high-energy physics, it is crucial toselect the Signal Region (SR) in such a way that it is enriched with signalevents if they are present. While most existing search methods set the regionrelying on prior domain knowledge, it may be unavailable for a completely novelparticle that falls outside the current scope of understanding. We address thisissue by proposing a method built upon a model-agnostic but often realisticassumption about the localized topology of the signal events, in which they areconcentrated in a certain area of the feature space. Considering the signalcomponent as a localized high-frequency feature, our approach employs thenotion of a low-pass filter. We define the SR as an area which is most affectedwhen the observed events are smeared with additive random noise. We overcomechallenges in density estimation in the high-dimensional feature space bylearning the density ratio of events that potentially include a signal to thecomplementary observation of events that closely resemble the target events butare free of any signals. By applying our method to simulated $mathrm{HH}rightarrow 4b$ events, we demonstrate that the method can efficiently identifya data-driven SR in a high-dimensional feature space in which a high portion ofsignal events concentrate.
在寻找高能物理中的新粒子时,关键是要选择信号区域(SR),以便在出现信号事件时能使其丰富起来。虽然现有的大多数搜索方法都是根据先前的领域知识来设定区域,但对于超出当前理解范围的全新粒子来说,这种方法可能是不可用的。为了解决这个问题,我们提出了一种方法,该方法建立在与模型无关但通常符合实际的信号事件局部拓扑假设之上,即信号事件集中在特征空间的某个区域。考虑到信号分量是局部高频特征,我们的方法采用了低通滤波器。我们将 SR 定义为当观测到的事件被加性随机噪声玷污时受影响最大的区域。我们通过学习可能包含信号的事件与与目标事件非常相似但没有任何信号的事件的互补观测密度比,克服了在高维特征空间中进行密度估计的挑战。通过将我们的方法应用于模拟的$mathrm{HH}rightarrow 4b$事件,我们证明了该方法可以在信号事件高度集中的高维特征空间中有效识别数据驱动的SR。
{"title":"Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions","authors":"Soheun Yi, John Alison, Mikael Kuusela","doi":"arxiv-2409.06960","DOIUrl":"https://doi.org/arxiv-2409.06960","url":null,"abstract":"In the search for new particles in high-energy physics, it is crucial to\u0000select the Signal Region (SR) in such a way that it is enriched with signal\u0000events if they are present. While most existing search methods set the region\u0000relying on prior domain knowledge, it may be unavailable for a completely novel\u0000particle that falls outside the current scope of understanding. We address this\u0000issue by proposing a method built upon a model-agnostic but often realistic\u0000assumption about the localized topology of the signal events, in which they are\u0000concentrated in a certain area of the feature space. Considering the signal\u0000component as a localized high-frequency feature, our approach employs the\u0000notion of a low-pass filter. We define the SR as an area which is most affected\u0000when the observed events are smeared with additive random noise. We overcome\u0000challenges in density estimation in the high-dimensional feature space by\u0000learning the density ratio of events that potentially include a signal to the\u0000complementary observation of events that closely resemble the target events but\u0000are free of any signals. By applying our method to simulated $mathrm{HH}\u0000rightarrow 4b$ events, we demonstrate that the method can efficiently identify\u0000a data-driven SR in a high-dimensional feature space in which a high portion of\u0000signal events concentrate.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Practical Theory of Generalization in Selectivity Learning 选择性学习中的泛化实用理论
Pub Date : 2024-09-11 DOI: arxiv-2409.07014
Peizhi Wu, Haoshu Xu, Ryan Marcus, Zachary G. Ives
Query-driven machine learning models have emerged as a promising estimationtechnique for query selectivities. Yet, surprisingly little is known about theefficacy of these techniques from a theoretical perspective, as there existsubstantial gaps between practical solutions and state-of-the-art (SOTA) theorybased on the Probably Approximately Correct (PAC) learning framework. In thispaper, we aim to bridge the gaps between theory and practice. First, wedemonstrate that selectivity predictors induced by signed measures arelearnable, which relaxes the reliance on probability measures in SOTA theory.More importantly, beyond the PAC learning framework (which only allows us tocharacterize how the model behaves when both training and test workloads aredrawn from the same distribution), we establish, under mild assumptions, thatselectivity predictors from this class exhibit favorable out-of-distribution(OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both thein-distribution and OOD generalization capabilities of query-driven selectivitylearning, and facilitate the design of two general strategies to improve OODgeneralization for existing query-driven selectivity models. We empiricallyverify that our techniques help query-driven selectivity models generalizesignificantly better to OOD queries both in terms of prediction accuracy andquery latency performance, while maintaining their superior in-distributiongeneralization performance.
查询驱动的机器学习模型已成为一种很有前途的查询选择性估算技术。然而,从理论角度来看,人们对这些技术的有效性却知之甚少,因为实际解决方案与基于 "大概正确"(PAC)学习框架的最新(SOTA)理论之间存在巨大差距。本文旨在弥合理论与实践之间的差距。更重要的是,除了 PAC 学习框架(该框架只允许我们描述模型在训练和测试工作量均取自同一分布时的表现)之外,我们还在温和的假设条件下确定了该类选择性预测器表现出有利的分布外(OOD)泛化误差边界。这些理论进展让我们更好地理解了查询驱动选择性学习的分布内泛化和分布外泛化能力,并促进了两种通用策略的设计,以提高现有查询驱动选择性模型的分布外泛化能力。我们通过实证验证了我们的技术有助于查询驱动选择性模型在预测准确性和查询延迟性能方面显著改善对 OOD 查询的泛化,同时保持其卓越的分布内泛化性能。
{"title":"A Practical Theory of Generalization in Selectivity Learning","authors":"Peizhi Wu, Haoshu Xu, Ryan Marcus, Zachary G. Ives","doi":"arxiv-2409.07014","DOIUrl":"https://doi.org/arxiv-2409.07014","url":null,"abstract":"Query-driven machine learning models have emerged as a promising estimation\u0000technique for query selectivities. Yet, surprisingly little is known about the\u0000efficacy of these techniques from a theoretical perspective, as there exist\u0000substantial gaps between practical solutions and state-of-the-art (SOTA) theory\u0000based on the Probably Approximately Correct (PAC) learning framework. In this\u0000paper, we aim to bridge the gaps between theory and practice. First, we\u0000demonstrate that selectivity predictors induced by signed measures are\u0000learnable, which relaxes the reliance on probability measures in SOTA theory.\u0000More importantly, beyond the PAC learning framework (which only allows us to\u0000characterize how the model behaves when both training and test workloads are\u0000drawn from the same distribution), we establish, under mild assumptions, that\u0000selectivity predictors from this class exhibit favorable out-of-distribution\u0000(OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the\u0000in-distribution and OOD generalization capabilities of query-driven selectivity\u0000learning, and facilitate the design of two general strategies to improve OOD\u0000generalization for existing query-driven selectivity models. We empirically\u0000verify that our techniques help query-driven selectivity models generalize\u0000significantly better to OOD queries both in terms of prediction accuracy and\u0000query latency performance, while maintaining their superior in-distribution\u0000generalization performance.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - STAT - Machine Learning
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1