Andrea Cavallo, Madeline Navarro, Santiago Segarra, Elvin Isufi
Covariance-based data processing is widespread across signal processing and machine learning applications due to its ability to model data interconnectivities and dependencies. However, harmful biases in the data may become encoded in the sample covariance matrix and cause data-driven methods to treat different subpopulations unfairly. Existing works such as fair principal component analysis (PCA) mitigate these effects, but remain unstable in low sample regimes, which in turn may jeopardize the fairness goal. To address both biases and instability, we propose Fair coVariance Neural Networks (FVNNs), which perform graph convolutions on the covariance matrix for both fair and accurate predictions. Our FVNNs provide a flexible model compatible with several existing bias mitigation techniques. In particular, FVNNs allow for mitigating the bias in two ways: first, they operate on fair covariance estimates that remove biases from their principal components; second, they are trained in an end-to-end fashion via a fairness regularizer in the loss function so that the model parameters are tailored to solve the task directly in a fair manner. We prove that FVNNs are intrinsically fairer than analogous PCA approaches thanks to their stability in low sample regimes. We validate the robustness and fairness of our model on synthetic and real-world data, showcasing the flexibility of FVNNs along with the tradeoff between fair and accurate performance.
{"title":"Fair CoVariance Neural Networks","authors":"Andrea Cavallo, Madeline Navarro, Santiago Segarra, Elvin Isufi","doi":"arxiv-2409.08558","DOIUrl":"https://doi.org/arxiv-2409.08558","url":null,"abstract":"Covariance-based data processing is widespread across signal processing and\u0000machine learning applications due to its ability to model data\u0000interconnectivities and dependencies. However, harmful biases in the data may\u0000become encoded in the sample covariance matrix and cause data-driven methods to\u0000treat different subpopulations unfairly. Existing works such as fair principal\u0000component analysis (PCA) mitigate these effects, but remain unstable in low\u0000sample regimes, which in turn may jeopardize the fairness goal. To address both\u0000biases and instability, we propose Fair coVariance Neural Networks (FVNNs),\u0000which perform graph convolutions on the covariance matrix for both fair and\u0000accurate predictions. Our FVNNs provide a flexible model compatible with\u0000several existing bias mitigation techniques. In particular, FVNNs allow for\u0000mitigating the bias in two ways: first, they operate on fair covariance\u0000estimates that remove biases from their principal components; second, they are\u0000trained in an end-to-end fashion via a fairness regularizer in the loss\u0000function so that the model parameters are tailored to solve the task directly\u0000in a fair manner. We prove that FVNNs are intrinsically fairer than analogous\u0000PCA approaches thanks to their stability in low sample regimes. We validate the\u0000robustness and fairness of our model on synthetic and real-world data,\u0000showcasing the flexibility of FVNNs along with the tradeoff between fair and\u0000accurate performance.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ahmet Kapkiç, Pratanu Mandal, Shu Wan, Paras Sheth, Abhinav Gorantla, Yoonhyuk Choi, Huan Liu, K. Selçuk Candan
While witnessing the exceptional success of machine learning (ML) technologies in many applications, users are starting to notice a critical shortcoming of ML: correlation is a poor substitute for causation. The conventional way to discover causal relationships is to use randomized controlled experiments (RCT); in many situations, however, these are impractical or sometimes unethical. Causal learning from observational data offers a promising alternative. While being relatively recent, causal learning aims to go far beyond conventional machine learning, yet several major challenges remain. Unfortunately, advances are hampered due to the lack of unified benchmark datasets, algorithms, metrics, and evaluation service interfaces for causal learning. In this paper, we introduce {em CausalBench}, a transparent, fair, and easy-to-use evaluation platform, aiming to (a) enable the advancement of research in causal learning by facilitating scientific collaboration in novel algorithms, datasets, and metrics and (b) promote scientific objectivity, reproducibility, fairness, and awareness of bias in causal learning research. CausalBench provides services for benchmarking data, algorithms, models, and metrics, impacting the needs of a broad of scientific and engineering disciplines.
在见证机器学习(ML)技术在许多应用中取得巨大成功的同时,用户也开始注意到 ML 的一个重要缺陷:相关性无法替代因果关系。发现因果关系的传统方法是使用随机对照实验(RCT);但在许多情况下,这种方法不切实际,有时甚至不道德。从观察数据中进行因果学习提供了一种很有前景的替代方法。因果学习虽然相对较新,但其目标远远超出了传统的机器学习,但仍存在一些重大挑战。不幸的是,由于缺乏统一的因果学习基准数据集、算法、度量标准和评估服务接口,因果学习的发展受到了阻碍。在本文中,我们介绍了{/em CausalBench},这是一个透明、公平、易用的评估平台,旨在:(a)通过促进新算法、数据集和度量标准方面的科学合作,推动因果学习研究的发展;(b)促进因果学习研究的科学客观性、可重复性、公平性和偏见意识。CausalBench 提供数据、算法、模型和度量基准测试服务,满足科学和工程学科的广泛需求。
{"title":"Introducing CausalBench: A Flexible Benchmark Framework for Causal Analysis and Machine Learning","authors":"Ahmet Kapkiç, Pratanu Mandal, Shu Wan, Paras Sheth, Abhinav Gorantla, Yoonhyuk Choi, Huan Liu, K. Selçuk Candan","doi":"arxiv-2409.08419","DOIUrl":"https://doi.org/arxiv-2409.08419","url":null,"abstract":"While witnessing the exceptional success of machine learning (ML)\u0000technologies in many applications, users are starting to notice a critical\u0000shortcoming of ML: correlation is a poor substitute for causation. The\u0000conventional way to discover causal relationships is to use randomized\u0000controlled experiments (RCT); in many situations, however, these are\u0000impractical or sometimes unethical. Causal learning from observational data\u0000offers a promising alternative. While being relatively recent, causal learning\u0000aims to go far beyond conventional machine learning, yet several major\u0000challenges remain. Unfortunately, advances are hampered due to the lack of\u0000unified benchmark datasets, algorithms, metrics, and evaluation service\u0000interfaces for causal learning. In this paper, we introduce {em CausalBench},\u0000a transparent, fair, and easy-to-use evaluation platform, aiming to (a) enable\u0000the advancement of research in causal learning by facilitating scientific\u0000collaboration in novel algorithms, datasets, and metrics and (b) promote\u0000scientific objectivity, reproducibility, fairness, and awareness of bias in\u0000causal learning research. CausalBench provides services for benchmarking data,\u0000algorithms, models, and metrics, impacting the needs of a broad of scientific\u0000and engineering disciplines.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marta Gentiloni Silveri, Giovanni Conforti, Alain Durmus
Flow Matching (FM) (also referred to as stochastic interpolants or rectified flows) stands out as a class of generative models that aims to bridge in finite time the target distribution $nu^star$ with an auxiliary distribution $mu$, leveraging a fixed coupling $pi$ and a bridge which can either be deterministic or stochastic. These two ingredients define a path measure which can then be approximated by learning the drift of its Markovian projection. The main contribution of this paper is to provide relatively mild assumptions on $nu^star$, $mu$ and $pi$ to obtain non-asymptotics guarantees for Diffusion Flow Matching (DFM) models using as bridge the conditional distribution associated with the Brownian motion. More precisely, we establish bounds on the Kullback-Leibler divergence between the target distribution and the one generated by such DFM models under moment conditions on the score of $nu^star$, $mu$ and $pi$, and a standard $L^2$-drift-approximation error assumption.
{"title":"Theoretical guarantees in KL for Diffusion Flow Matching","authors":"Marta Gentiloni Silveri, Giovanni Conforti, Alain Durmus","doi":"arxiv-2409.08311","DOIUrl":"https://doi.org/arxiv-2409.08311","url":null,"abstract":"Flow Matching (FM) (also referred to as stochastic interpolants or rectified\u0000flows) stands out as a class of generative models that aims to bridge in finite\u0000time the target distribution $nu^star$ with an auxiliary distribution $mu$,\u0000leveraging a fixed coupling $pi$ and a bridge which can either be\u0000deterministic or stochastic. These two ingredients define a path measure which\u0000can then be approximated by learning the drift of its Markovian projection. The\u0000main contribution of this paper is to provide relatively mild assumptions on\u0000$nu^star$, $mu$ and $pi$ to obtain non-asymptotics guarantees for Diffusion\u0000Flow Matching (DFM) models using as bridge the conditional distribution\u0000associated with the Brownian motion. More precisely, we establish bounds on the\u0000Kullback-Leibler divergence between the target distribution and the one\u0000generated by such DFM models under moment conditions on the score of\u0000$nu^star$, $mu$ and $pi$, and a standard $L^2$-drift-approximation error\u0000assumption.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the generative problem of sampling from an unknown distribution for which only a sufficiently large number of training samples are available. In this paper, we build on previous work combining Schr"odinger bridges and Langevin dynamics. A key bottleneck of this approach is the exponential dependence of the required training samples on the dimension, $d$, of the ambient state space. We propose a localization strategy which exploits conditional independence of conditional expectation values. Localization thus replaces a single high-dimensional Schr"odinger bridge problem by $d$ low-dimensional Schr"odinger bridge problems over the available training samples. As for the original approach, the localized sampler is stable and geometric ergodic. The sampler also naturally extends to conditional sampling and to Bayesian inference. We demonstrate the performance of our proposed scheme through experiments on a Gaussian problem with increasing dimensions and on a stochastic subgrid-scale parametrization conditional sampling problem.
{"title":"Localized Schrödinger Bridge Sampler","authors":"Georg A. Gottwald, Sebastian Reich","doi":"arxiv-2409.07968","DOIUrl":"https://doi.org/arxiv-2409.07968","url":null,"abstract":"We consider the generative problem of sampling from an unknown distribution\u0000for which only a sufficiently large number of training samples are available.\u0000In this paper, we build on previous work combining Schr\"odinger bridges and\u0000Langevin dynamics. A key bottleneck of this approach is the exponential\u0000dependence of the required training samples on the dimension, $d$, of the\u0000ambient state space. We propose a localization strategy which exploits\u0000conditional independence of conditional expectation values. Localization thus\u0000replaces a single high-dimensional Schr\"odinger bridge problem by $d$\u0000low-dimensional Schr\"odinger bridge problems over the available training\u0000samples. As for the original approach, the localized sampler is stable and\u0000geometric ergodic. The sampler also naturally extends to conditional sampling\u0000and to Bayesian inference. We demonstrate the performance of our proposed\u0000scheme through experiments on a Gaussian problem with increasing dimensions and\u0000on a stochastic subgrid-scale parametrization conditional sampling problem.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In feed-forward neural networks, dataset-free weight-initialization method such as LeCun, Xavier (or Glorot), and He initializations have been developed. These methods randomly determine the initial values of weight parameters based on specific distributions (e.g., Gaussian or uniform distributions) without using training datasets. To the best of the authors' knowledge, such a dataset-free weight-initialization method is yet to be developed for restricted Boltzmann machines (RBMs), which are probabilistic neural networks consisting of two layers, In this study, we derive a dataset-free weight-initialization method for Bernoulli--Bernoulli RBMs based on a statistical mechanical analysis. In the proposed weight-initialization method, the weight parameters are drawn from a Gaussian distribution with zero mean. The standard deviation of the Gaussian distribution is optimized based on our hypothesis which is that a standard deviation providing a larger layer correlation (LC) between the two layers improves the learning efficiency. The expression of the LC is derived based on a statistical mechanical analysis. The optimal value of the standard deviation corresponds to the maximum point of the LC. The proposed weight-initialization method is identical to Xavier initialization in a specific case (i.e., in the case the sizes of the two layers are the same, the random variables of the layers are ${-1,1}$-binary, and all bias parameters are zero).
{"title":"Dataset-Free Weight-Initialization on Restricted Boltzmann Machine","authors":"Muneki Yasuda, Ryosuke Maeno, Chako Takahashi","doi":"arxiv-2409.07708","DOIUrl":"https://doi.org/arxiv-2409.07708","url":null,"abstract":"In feed-forward neural networks, dataset-free weight-initialization method\u0000such as LeCun, Xavier (or Glorot), and He initializations have been developed.\u0000These methods randomly determine the initial values of weight parameters based\u0000on specific distributions (e.g., Gaussian or uniform distributions) without\u0000using training datasets. To the best of the authors' knowledge, such a\u0000dataset-free weight-initialization method is yet to be developed for restricted\u0000Boltzmann machines (RBMs), which are probabilistic neural networks consisting\u0000of two layers, In this study, we derive a dataset-free weight-initialization\u0000method for Bernoulli--Bernoulli RBMs based on a statistical mechanical\u0000analysis. In the proposed weight-initialization method, the weight parameters\u0000are drawn from a Gaussian distribution with zero mean. The standard deviation\u0000of the Gaussian distribution is optimized based on our hypothesis which is that\u0000a standard deviation providing a larger layer correlation (LC) between the two\u0000layers improves the learning efficiency. The expression of the LC is derived\u0000based on a statistical mechanical analysis. The optimal value of the standard\u0000deviation corresponds to the maximum point of the LC. The proposed\u0000weight-initialization method is identical to Xavier initialization in a\u0000specific case (i.e., in the case the sizes of the two layers are the same, the\u0000random variables of the layers are ${-1,1}$-binary, and all bias parameters\u0000are zero).","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Duan, Xin Xiong, Jueyi Liu, Katherine P. Liao, Tianxi Cai
Cluster analysis across multiple institutions poses significant challenges due to data-sharing restrictions. To overcome these limitations, we introduce the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution tailored for multi-site analyses under such constraints. FONT requires only a single round of communication between sites and ensures privacy by exchanging only fitted model parameters and class labels. The algorithm combines locally fitted clustering models into a data-adaptive ensemble, making it broadly applicable to various clustering techniques and robust to differences in cluster proportions across sites. Our theoretical analysis validates the effectiveness of the data-adaptive weights learned by FONT, and simulation studies demonstrate its superior performance compared to existing benchmark methods. We applied FONT to identify subgroups of patients with rheumatoid arthritis across two health systems, revealing improved consistency of patient clusters across sites, while locally fitted clusters proved less transferable. FONT is particularly well-suited for real-world applications with stringent communication and privacy constraints, offering a scalable and practical solution for multi-site clustering.
由于数据共享方面的限制,跨机构聚类分析面临着巨大挑战。为了克服这些限制,我们引入了联合单次集合聚类(FONT)算法,这是一种新颖的解决方案,专为这种限制下的多站点分析而设计。FONT 只需要在站点之间进行一轮通信,并通过只交换拟合模型参数和类标签来确保隐私。该算法将本地拟合的聚类模型组合成一个数据适应性集合,使其广泛适用于各种聚类技术,并对不同研究地点的聚类比例差异具有鲁棒性。我们的理论分析验证了 FONT 所学习的数据自适应权重的有效性,而模拟研究则证明了它与现有基准方法相比的卓越性能。我们将 FONT 应用于识别两个医疗系统中的类风湿关节炎患者亚群,结果表明患者聚类在不同地点的一致性得到了改善,而局部拟合聚类的可转移性较差。
{"title":"Federated One-Shot Ensemble Clustering","authors":"Rui Duan, Xin Xiong, Jueyi Liu, Katherine P. Liao, Tianxi Cai","doi":"arxiv-2409.08396","DOIUrl":"https://doi.org/arxiv-2409.08396","url":null,"abstract":"Cluster analysis across multiple institutions poses significant challenges\u0000due to data-sharing restrictions. To overcome these limitations, we introduce\u0000the Federated One-shot Ensemble Clustering (FONT) algorithm, a novel solution\u0000tailored for multi-site analyses under such constraints. FONT requires only a\u0000single round of communication between sites and ensures privacy by exchanging\u0000only fitted model parameters and class labels. The algorithm combines locally\u0000fitted clustering models into a data-adaptive ensemble, making it broadly\u0000applicable to various clustering techniques and robust to differences in\u0000cluster proportions across sites. Our theoretical analysis validates the\u0000effectiveness of the data-adaptive weights learned by FONT, and simulation\u0000studies demonstrate its superior performance compared to existing benchmark\u0000methods. We applied FONT to identify subgroups of patients with rheumatoid\u0000arthritis across two health systems, revealing improved consistency of patient\u0000clusters across sites, while locally fitted clusters proved less transferable.\u0000FONT is particularly well-suited for real-world applications with stringent\u0000communication and privacy constraints, offering a scalable and practical\u0000solution for multi-site clustering.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the problem of multiclass classification for settings where data features $mathbf{x}$ and their labels $mathbf{y}$ are uncertain. We identify that distributionally robust one-vs-all (OVA) classifiers often struggle in settings with imbalanced data. To address this issue, we use Wasserstein distributionally robust optimization to develop a robust version of the multiclass support vector machine (SVM) characterized by the Crammer-Singer (CS) loss. First, we prove that the CS loss is bounded from above by a Lipschitz continuous function for all $mathbf{x} in mathcal{X}$ and $mathbf{y} in mathcal{Y}$, then we exploit strong duality results to express the dual of the worst-case risk problem, and we show that the worst-case risk minimization problem admits a tractable convex reformulation due to the regularity of the CS loss. Moreover, we develop a kernel version of our proposed model to account for nonlinear class separation, and we show that it admits a tractable convex upper bound. We also propose a projected subgradient method algorithm for a special case of our proposed linear model to improve scalability. Our numerical experiments demonstrate that our model outperforms state-of-the art OVA models in settings where the training data is highly imbalanced. We also show through experiments on popular real-world datasets that our proposed model often outperforms its regularized counterpart as the first accounts for uncertain labels unlike the latter.
{"title":"Wasserstein Distributionally Robust Multiclass Support Vector Machine","authors":"Michael Ibrahim, Heraldo Rozas, Nagi Gebraeel","doi":"arxiv-2409.08409","DOIUrl":"https://doi.org/arxiv-2409.08409","url":null,"abstract":"We study the problem of multiclass classification for settings where data\u0000features $mathbf{x}$ and their labels $mathbf{y}$ are uncertain. We identify\u0000that distributionally robust one-vs-all (OVA) classifiers often struggle in\u0000settings with imbalanced data. To address this issue, we use Wasserstein\u0000distributionally robust optimization to develop a robust version of the\u0000multiclass support vector machine (SVM) characterized by the Crammer-Singer\u0000(CS) loss. First, we prove that the CS loss is bounded from above by a\u0000Lipschitz continuous function for all $mathbf{x} in mathcal{X}$ and\u0000$mathbf{y} in mathcal{Y}$, then we exploit strong duality results to express\u0000the dual of the worst-case risk problem, and we show that the worst-case risk\u0000minimization problem admits a tractable convex reformulation due to the\u0000regularity of the CS loss. Moreover, we develop a kernel version of our\u0000proposed model to account for nonlinear class separation, and we show that it\u0000admits a tractable convex upper bound. We also propose a projected subgradient\u0000method algorithm for a special case of our proposed linear model to improve\u0000scalability. Our numerical experiments demonstrate that our model outperforms\u0000state-of-the art OVA models in settings where the training data is highly\u0000imbalanced. We also show through experiments on popular real-world datasets\u0000that our proposed model often outperforms its regularized counterpart as the\u0000first accounts for uncertain labels unlike the latter.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto
Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient -- to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
{"title":"Synthetic continued pretraining","authors":"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto","doi":"arxiv-2409.07431","DOIUrl":"https://doi.org/arxiv-2409.07431","url":null,"abstract":"Pretraining on large-scale, unstructured internet text has enabled language\u0000models to acquire a significant amount of world knowledge. However, this\u0000knowledge acquisition is data-inefficient -- to learn a given fact, models must\u0000be trained on hundreds to thousands of diverse representations of it. This\u0000poses a challenge when adapting a pretrained model to a small corpus of\u0000domain-specific documents, where each fact may appear rarely or only once. We\u0000propose to bridge this gap with synthetic continued pretraining: using the\u0000small domain-specific corpus to synthesize a large corpus more amenable to\u0000learning, and then performing continued pretraining on the synthesized corpus.\u0000We instantiate this proposal with EntiGraph, a synthetic data augmentation\u0000algorithm that extracts salient entities from the source documents and then\u0000generates diverse text by drawing connections between the sampled entities.\u0000Synthetic continued pretraining using EntiGraph enables a language model to\u0000answer questions and follow generic instructions related to the source\u0000documents without access to them. If instead, the source documents are\u0000available at inference time, we show that the knowledge acquired through our\u0000approach compounds with retrieval-augmented generation. To better understand\u0000these results, we build a simple mathematical model of EntiGraph, and show how\u0000synthetic data augmentation can \"rearrange\" knowledge to enable more\u0000data-efficient learning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated $mathrm{HH} rightarrow 4b$ events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.
在寻找高能物理中的新粒子时,关键是要选择信号区域(SR),以便在出现信号事件时能使其丰富起来。虽然现有的大多数搜索方法都是根据先前的领域知识来设定区域,但对于超出当前理解范围的全新粒子来说,这种方法可能是不可用的。为了解决这个问题,我们提出了一种方法,该方法建立在与模型无关但通常符合实际的信号事件局部拓扑假设之上,即信号事件集中在特征空间的某个区域。考虑到信号分量是局部高频特征,我们的方法采用了低通滤波器。我们将 SR 定义为当观测到的事件被加性随机噪声玷污时受影响最大的区域。我们通过学习可能包含信号的事件与与目标事件非常相似但没有任何信号的事件的互补观测密度比,克服了在高维特征空间中进行密度估计的挑战。通过将我们的方法应用于模拟的$mathrm{HH}rightarrow 4b$事件,我们证明了该方法可以在信号事件高度集中的高维特征空间中有效识别数据驱动的SR。
{"title":"Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions","authors":"Soheun Yi, John Alison, Mikael Kuusela","doi":"arxiv-2409.06960","DOIUrl":"https://doi.org/arxiv-2409.06960","url":null,"abstract":"In the search for new particles in high-energy physics, it is crucial to\u0000select the Signal Region (SR) in such a way that it is enriched with signal\u0000events if they are present. While most existing search methods set the region\u0000relying on prior domain knowledge, it may be unavailable for a completely novel\u0000particle that falls outside the current scope of understanding. We address this\u0000issue by proposing a method built upon a model-agnostic but often realistic\u0000assumption about the localized topology of the signal events, in which they are\u0000concentrated in a certain area of the feature space. Considering the signal\u0000component as a localized high-frequency feature, our approach employs the\u0000notion of a low-pass filter. We define the SR as an area which is most affected\u0000when the observed events are smeared with additive random noise. We overcome\u0000challenges in density estimation in the high-dimensional feature space by\u0000learning the density ratio of events that potentially include a signal to the\u0000complementary observation of events that closely resemble the target events but\u0000are free of any signals. By applying our method to simulated $mathrm{HH}\u0000rightarrow 4b$ events, we demonstrate that the method can efficiently identify\u0000a data-driven SR in a high-dimensional feature space in which a high portion of\u0000signal events concentrate.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peizhi Wu, Haoshu Xu, Ryan Marcus, Zachary G. Ives
Query-driven machine learning models have emerged as a promising estimation technique for query selectivities. Yet, surprisingly little is known about the efficacy of these techniques from a theoretical perspective, as there exist substantial gaps between practical solutions and state-of-the-art (SOTA) theory based on the Probably Approximately Correct (PAC) learning framework. In this paper, we aim to bridge the gaps between theory and practice. First, we demonstrate that selectivity predictors induced by signed measures are learnable, which relaxes the reliance on probability measures in SOTA theory. More importantly, beyond the PAC learning framework (which only allows us to characterize how the model behaves when both training and test workloads are drawn from the same distribution), we establish, under mild assumptions, that selectivity predictors from this class exhibit favorable out-of-distribution (OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the in-distribution and OOD generalization capabilities of query-driven selectivity learning, and facilitate the design of two general strategies to improve OOD generalization for existing query-driven selectivity models. We empirically verify that our techniques help query-driven selectivity models generalize significantly better to OOD queries both in terms of prediction accuracy and query latency performance, while maintaining their superior in-distribution generalization performance.
{"title":"A Practical Theory of Generalization in Selectivity Learning","authors":"Peizhi Wu, Haoshu Xu, Ryan Marcus, Zachary G. Ives","doi":"arxiv-2409.07014","DOIUrl":"https://doi.org/arxiv-2409.07014","url":null,"abstract":"Query-driven machine learning models have emerged as a promising estimation\u0000technique for query selectivities. Yet, surprisingly little is known about the\u0000efficacy of these techniques from a theoretical perspective, as there exist\u0000substantial gaps between practical solutions and state-of-the-art (SOTA) theory\u0000based on the Probably Approximately Correct (PAC) learning framework. In this\u0000paper, we aim to bridge the gaps between theory and practice. First, we\u0000demonstrate that selectivity predictors induced by signed measures are\u0000learnable, which relaxes the reliance on probability measures in SOTA theory.\u0000More importantly, beyond the PAC learning framework (which only allows us to\u0000characterize how the model behaves when both training and test workloads are\u0000drawn from the same distribution), we establish, under mild assumptions, that\u0000selectivity predictors from this class exhibit favorable out-of-distribution\u0000(OOD) generalization error bounds. These theoretical advances provide us with a better understanding of both the\u0000in-distribution and OOD generalization capabilities of query-driven selectivity\u0000learning, and facilitate the design of two general strategies to improve OOD\u0000generalization for existing query-driven selectivity models. We empirically\u0000verify that our techniques help query-driven selectivity models generalize\u0000significantly better to OOD queries both in terms of prediction accuracy and\u0000query latency performance, while maintaining their superior in-distribution\u0000generalization performance.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142206636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}