Partitioning a data set by one or more of its attributes and computing an aggregate for each part is one of the most common operations in data analyses. There are use cases where the partitioning is determined dynamically by collapsing smaller subsets into larger ones, to ensure sufficient support for the computed aggregate. These use cases are not supported by software implementing split-apply-combine types of operations. This paper presents the texttt{R} package texttt{accumulate} that offers convenient interfaces for defining grouped aggregation where the grouping itself is dynamically determined, based on user-defined conditions on subsets, and a user-defined subset collapsing scheme. The formal underlying algorithm is described and analyzed as well.
{"title":"Split-Apply-Combine with Dynamic Grouping","authors":"Mark P. J. van der Loo","doi":"arxiv-2406.09887","DOIUrl":"https://doi.org/arxiv-2406.09887","url":null,"abstract":"Partitioning a data set by one or more of its attributes and computing an\u0000aggregate for each part is one of the most common operations in data analyses.\u0000There are use cases where the partitioning is determined dynamically by\u0000collapsing smaller subsets into larger ones, to ensure sufficient support for\u0000the computed aggregate. These use cases are not supported by software\u0000implementing split-apply-combine types of operations. This paper presents the\u0000texttt{R} package texttt{accumulate} that offers convenient interfaces for\u0000defining grouped aggregation where the grouping itself is dynamically\u0000determined, based on user-defined conditions on subsets, and a user-defined\u0000subset collapsing scheme. The formal underlying algorithm is described and\u0000analyzed as well.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Latent variable models are widely used in social and behavioural sciences, such as education, psychology, and political science. In recent years, high-dimensional latent variable models have become increasingly common for analysing large and complex data. Estimating high-dimensional latent variable models using marginal maximum likelihood is computationally demanding due to the complexity of integrals involved. To address this challenge, stochastic optimisation, which combines stochastic approximation and sampling techniques, has been shown to be effective. This method iterates between two steps -- (1) sampling the latent variables from their posterior distribution based on the current parameter estimate, and (2) updating the fixed parameters using an approximate stochastic gradient constructed from the latent variable samples. In this paper, we propose a computationally more efficient stochastic optimisation algorithm. This improvement is achieved through the use of a minibatch of observations when sampling latent variables and constructing stochastic gradients, and an unadjusted Langevin sampler that utilises the gradient of the negative complete-data log-likelihood to sample latent variables. Theoretical results are established for the proposed algorithm, showing that the iterative parameter update converges to the marginal maximum likelihood estimate as the number of iterations goes to infinity. Furthermore, the proposed algorithm is shown to scale well to high-dimensional settings through simulation studies and a personality test application with 30,000 respondents, 300 items, and 30 latent dimensions.
{"title":"Learning High-dimensional Latent Variable Models via Doubly Stochastic Optimisation by Unadjusted Langevin","authors":"Motonori Oka, Yunxiao Chen, Irini Mounstaki","doi":"arxiv-2406.09311","DOIUrl":"https://doi.org/arxiv-2406.09311","url":null,"abstract":"Latent variable models are widely used in social and behavioural sciences,\u0000such as education, psychology, and political science. In recent years,\u0000high-dimensional latent variable models have become increasingly common for\u0000analysing large and complex data. Estimating high-dimensional latent variable\u0000models using marginal maximum likelihood is computationally demanding due to\u0000the complexity of integrals involved. To address this challenge, stochastic\u0000optimisation, which combines stochastic approximation and sampling techniques,\u0000has been shown to be effective. This method iterates between two steps -- (1)\u0000sampling the latent variables from their posterior distribution based on the\u0000current parameter estimate, and (2) updating the fixed parameters using an\u0000approximate stochastic gradient constructed from the latent variable samples.\u0000In this paper, we propose a computationally more efficient stochastic\u0000optimisation algorithm. This improvement is achieved through the use of a\u0000minibatch of observations when sampling latent variables and constructing\u0000stochastic gradients, and an unadjusted Langevin sampler that utilises the\u0000gradient of the negative complete-data log-likelihood to sample latent\u0000variables. Theoretical results are established for the proposed algorithm,\u0000showing that the iterative parameter update converges to the marginal maximum\u0000likelihood estimate as the number of iterations goes to infinity. Furthermore,\u0000the proposed algorithm is shown to scale well to high-dimensional settings\u0000through simulation studies and a personality test application with 30,000\u0000respondents, 300 items, and 30 latent dimensions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In two-sided marketplaces such as online flea markets, recommender systems for providing consumers with personalized item rankings play a key role in promoting transactions between providers and consumers. Meanwhile, two-sided marketplaces face the problem of balancing consumer satisfaction and fairness among items to stimulate activity of item providers. Saito and Joachims (2022) devised an impact-based fair ranking method for maximizing the Nash social welfare based on fair division; however, this method, which requires solving a large-scale constrained nonlinear optimization problem, is very difficult to apply to practical-scale recommender systems. We thus propose a fast solution to the impact-based fair ranking problem. We first transform the fair ranking problem into an unconstrained optimization problem and then design a gradient ascent method that repeatedly executes the Sinkhorn algorithm. Experimental results demonstrate that our algorithm provides fair rankings of high quality and is about 1000 times faster than application of commercial optimization software.
{"title":"Fast solution to the fair ranking problem using the Sinkhorn algorithm","authors":"Yuki Uehara, Shunnosuke Ikeda, Naoki Nishimura, Koya Ohashi, Yilin Li, Jie Yang, Deddy Jobson, Xingxia Zha, Takeshi Matsumoto, Noriyoshi Sukegawa, Yuichi Takano","doi":"arxiv-2406.10262","DOIUrl":"https://doi.org/arxiv-2406.10262","url":null,"abstract":"In two-sided marketplaces such as online flea markets, recommender systems\u0000for providing consumers with personalized item rankings play a key role in\u0000promoting transactions between providers and consumers. Meanwhile, two-sided\u0000marketplaces face the problem of balancing consumer satisfaction and fairness\u0000among items to stimulate activity of item providers. Saito and Joachims (2022)\u0000devised an impact-based fair ranking method for maximizing the Nash social\u0000welfare based on fair division; however, this method, which requires solving a\u0000large-scale constrained nonlinear optimization problem, is very difficult to\u0000apply to practical-scale recommender systems. We thus propose a fast solution\u0000to the impact-based fair ranking problem. We first transform the fair ranking\u0000problem into an unconstrained optimization problem and then design a gradient\u0000ascent method that repeatedly executes the Sinkhorn algorithm. Experimental\u0000results demonstrate that our algorithm provides fair rankings of high quality\u0000and is about 1000 times faster than application of commercial optimization\u0000software.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-parametric two-sample tests based on energy distance or maximum mean discrepancy are widely used statistical tests for comparing multivariate data from two populations. While these tests enjoy desirable statistical properties, their test statistics can be expensive to compute as they require the computation of 3 distinct Euclidean distance (or kernel) matrices between samples, where the time complexity of each of these computations (namely, $O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically with the number of samples ($n_x$, $n_y$) and linearly with the number of variables ($p$). Since the standard permutation test requires repeated re-computations of these expensive statistics it's application to large datasets can become unfeasible. While several statistical approaches have been proposed to mitigate this issue, they all sacrifice desirable statistical properties to decrease the computational cost (e.g., trade computation speed by a decrease in statistical power). A better computational strategy is to first pre-compute the Euclidean distance (kernel) matrix of the concatenated data, and then permute indexes and retrieve the corresponding elements to compute the re-sampled statistics. While this strategy can reduce the computation cost relative to the standard permutation test, it relies on the computation of a larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$. In this paper, we present a novel computationally efficient permutation algorithm which only requires the pre-computation of the 3 smaller matrices and achieves large computational speedups without sacrificing finite-sample validity or statistical power. We illustrate its computational gains in a series of experiments and compare its statistical power to the current state-of-the-art approach for balancing computational cost and statistical performance.
{"title":"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":"https://doi.org/arxiv-2406.06488","url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\u0000discrepancy are widely used statistical tests for comparing multivariate data\u0000from two populations. While these tests enjoy desirable statistical properties,\u0000their test statistics can be expensive to compute as they require the\u0000computation of 3 distinct Euclidean distance (or kernel) matrices between\u0000samples, where the time complexity of each of these computations (namely,\u0000$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\u0000with the number of samples ($n_x$, $n_y$) and linearly with the number of\u0000variables ($p$). Since the standard permutation test requires repeated\u0000re-computations of these expensive statistics it's application to large\u0000datasets can become unfeasible. While several statistical approaches have been\u0000proposed to mitigate this issue, they all sacrifice desirable statistical\u0000properties to decrease the computational cost (e.g., trade computation speed by\u0000a decrease in statistical power). A better computational strategy is to first\u0000pre-compute the Euclidean distance (kernel) matrix of the concatenated data,\u0000and then permute indexes and retrieve the corresponding elements to compute the\u0000re-sampled statistics. While this strategy can reduce the computation cost\u0000relative to the standard permutation test, it relies on the computation of a\u0000larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\u0000In this paper, we present a novel computationally efficient permutation\u0000algorithm which only requires the pre-computation of the 3 smaller matrices and\u0000achieves large computational speedups without sacrificing finite-sample\u0000validity or statistical power. We illustrate its computational gains in a\u0000series of experiments and compare its statistical power to the current\u0000state-of-the-art approach for balancing computational cost and statistical\u0000performance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset and code are publicly available to support the development of novel ML-based methods for molecular docking to advance scientific research in this field.
对接是药物发现中的一个重要组成部分,旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法,在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而,用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架,将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接,得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型,涵盖了多种生物学相关化合物,使研究人员能够对基于 ML 的所有主要对接方法(如基于 Graph、Transformer 和 CNN 的方法)进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构,并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的,以支持开发基于ML的新型分子对接方法,从而推动该领域的科学研究。
{"title":"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":"https://doi.org/arxiv-2406.05738","url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\u0000binding conformation and affinity between small molecules and target proteins.\u0000ML-based docking has recently emerged as a prominent approach, outpacing\u0000traditional methods like DOCK and AutoDock Vina in handling the growing scale\u0000and complexity of molecular libraries. However, the availability of\u0000comprehensive and user-friendly datasets for training and benchmarking ML-based\u0000docking algorithms remains limited. We introduce Smiles2Dock, an open\u0000large-scale multi-task dataset for molecular docking. We created a framework\u0000combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\u0000database against 15 AlphaFold proteins, giving us more than 25 million\u0000protein-ligand binding scores. The dataset leverages a wide range of\u0000high-accuracy AlphaFold protein models, encompasses a diverse set of\u0000biologically relevant compounds and enables researchers to benchmark all major\u0000approaches for ML-based docking such as Graph, Transformer and CNN-based\u0000methods. We also introduce a novel Transformer-based architecture for docking\u0000scores prediction and set it as an initial benchmark for our dataset. Our\u0000dataset and code are publicly available to support the development of novel\u0000ML-based methods for molecular docking to advance scientific research in this\u0000field.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To obtain high-resolution images of subsurface structures from seismic data, seismic imaging techniques such as Full Waveform Inversion (FWI) serve as crucial tools. However, FWI involves solving a nonlinear and often non-unique inverse problem, presenting challenges such as local minima trapping and inadequate handling of inherent uncertainties. In addressing these challenges, we propose leveraging deep generative models as the prior distribution of geophysical parameters for stochastic Bayesian inversion. This approach integrates the adjoint state gradient for efficient back-propagation from the numerical solution of partial differential equations. Additionally, we introduce explicit and implicit variational Bayesian inference methods. The explicit method computes variational distribution density using a normalizing flow-based neural network, enabling computation of the Bayesian posterior of parameters. Conversely, the implicit method employs an inference network attached to a pretrained generative model to estimate density, incorporating an entropy estimator. Furthermore, we also experimented with the Stein Variational Gradient Descent (SVGD) method as another variational inference technique, using particles. We compare these variational Bayesian inference methods with conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to quantify uncertainties and to generate seismic data-conditioned realizations of subsurface geophysical parameters. This framework provides insights into subsurface structures while accounting for inherent uncertainties.
{"title":"Stochastic full waveform inversion with deep generative prior for uncertainty quantification","authors":"Yuke Xie, Hervé Chauris, Nicolas Desassis","doi":"arxiv-2406.04859","DOIUrl":"https://doi.org/arxiv-2406.04859","url":null,"abstract":"To obtain high-resolution images of subsurface structures from seismic data,\u0000seismic imaging techniques such as Full Waveform Inversion (FWI) serve as\u0000crucial tools. However, FWI involves solving a nonlinear and often non-unique\u0000inverse problem, presenting challenges such as local minima trapping and\u0000inadequate handling of inherent uncertainties. In addressing these challenges,\u0000we propose leveraging deep generative models as the prior distribution of\u0000geophysical parameters for stochastic Bayesian inversion. This approach\u0000integrates the adjoint state gradient for efficient back-propagation from the\u0000numerical solution of partial differential equations. Additionally, we\u0000introduce explicit and implicit variational Bayesian inference methods. The\u0000explicit method computes variational distribution density using a normalizing\u0000flow-based neural network, enabling computation of the Bayesian posterior of\u0000parameters. Conversely, the implicit method employs an inference network\u0000attached to a pretrained generative model to estimate density, incorporating an\u0000entropy estimator. Furthermore, we also experimented with the Stein Variational\u0000Gradient Descent (SVGD) method as another variational inference technique,\u0000using particles. We compare these variational Bayesian inference methods with\u0000conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to\u0000quantify uncertainties and to generate seismic data-conditioned realizations of\u0000subsurface geophysical parameters. This framework provides insights into\u0000subsurface structures while accounting for inherent uncertainties.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we provide a multiscale perspective on the problem of maximum marginal likelihood estimation. We consider and analyse a diffusion-based maximum marginal likelihood estimation scheme using ideas from multiscale dynamics. Our perspective is based on stochastic averaging; we make an explicit connection between ideas in applied probability and parameter inference in computational statistics. In particular, we consider a general class of coupled Langevin diffusions for joint inference of latent variables and parameters in statistical models, where the latent variables are sampled from a fast Langevin process (which acts as a sampler), and the parameters are updated using a slow Langevin process (which acts as an optimiser). We show that the resulting system of stochastic differential equations (SDEs) can be viewed as a two-time scale system. To demonstrate the utility of such a perspective, we show that the averaged parameter dynamics obtained in the limit of scale separation can be used to estimate the optimal parameter, within the strongly convex setting. We do this by using recent uniform-in-time non-asymptotic averaging bounds. Finally, we conclude by showing that the slow-fast algorithm we consider here, termed Slow-Fast Langevin Algorithm, performs on par with state-of-the-art methods on a variety of examples. We believe that the stochastic averaging approach we provide in this paper enables us to look at these algorithms from a fresh angle, as well as unlocking the path to develop and analyse new methods using well-established averaging principles.
{"title":"A Multiscale Perspective on Maximum Marginal Likelihood Estimation","authors":"O. Deniz Akyildiz, Iain Souttar, Michela Ottobre","doi":"arxiv-2406.04187","DOIUrl":"https://doi.org/arxiv-2406.04187","url":null,"abstract":"In this paper, we provide a multiscale perspective on the problem of maximum\u0000marginal likelihood estimation. We consider and analyse a diffusion-based\u0000maximum marginal likelihood estimation scheme using ideas from multiscale\u0000dynamics. Our perspective is based on stochastic averaging; we make an explicit\u0000connection between ideas in applied probability and parameter inference in\u0000computational statistics. In particular, we consider a general class of coupled\u0000Langevin diffusions for joint inference of latent variables and parameters in\u0000statistical models, where the latent variables are sampled from a fast Langevin\u0000process (which acts as a sampler), and the parameters are updated using a slow\u0000Langevin process (which acts as an optimiser). We show that the resulting\u0000system of stochastic differential equations (SDEs) can be viewed as a two-time\u0000scale system. To demonstrate the utility of such a perspective, we show that\u0000the averaged parameter dynamics obtained in the limit of scale separation can\u0000be used to estimate the optimal parameter, within the strongly convex setting.\u0000We do this by using recent uniform-in-time non-asymptotic averaging bounds.\u0000Finally, we conclude by showing that the slow-fast algorithm we consider here,\u0000termed Slow-Fast Langevin Algorithm, performs on par with state-of-the-art\u0000methods on a variety of examples. We believe that the stochastic averaging\u0000approach we provide in this paper enables us to look at these algorithms from a\u0000fresh angle, as well as unlocking the path to develop and analyse new methods\u0000using well-established averaging principles.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er
Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.
{"title":"Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics","authors":"Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er","doi":"arxiv-2406.03242","DOIUrl":"https://doi.org/arxiv-2406.03242","url":null,"abstract":"Reconstructing jets, which provide vital insights into the properties and\u0000histories of subatomic particles produced in high-energy collisions, is a main\u0000problem in data analyses in collider physics. This intricate task deals with\u0000estimating the latent structure of a jet (binary tree) and involves parameters\u0000such as particle energy, momentum, and types. While Bayesian methods offer a\u0000natural approach for handling uncertainty and leveraging prior knowledge, they\u0000face significant challenges due to the super-exponential growth of potential\u0000jet topologies as the number of observed particles increases. To address this,\u0000we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet\u0000latent structures. As a second contribution, we leverage the resulting\u0000estimator to develop a variational inference algorithm for parameter learning.\u0000Building on this, we introduce a variational family using a pseudo-marginal\u0000framework for a fully Bayesian treatment of all variables, unifying the\u0000generative model with the inference process. We illustrate our method's\u0000effectiveness through experiments using data generated with a collider physics\u0000generative model, highlighting superior speed and accuracy across a range of\u0000tasks.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The sampling importance resampling method is widely utilized in various fields, such as numerical integration and statistical simulation. In this paper, two modified methods are presented by incorporating two variance reduction techniques commonly used in Monte Carlo simulation, namely antithetic sampling and Latin hypercube sampling, into the process of sampling importance resampling method respectively. Theoretical evidence is provided to demonstrate that the proposed methods significantly reduce estimation errors compared to the original approach. Furthermore, the effectiveness and advantages of the proposed methods are validated through both numerical studies and real data analysis.
{"title":"Variance-reduced sampling importance resampling","authors":"Yao Xiao, Kang Fu, Kun Li","doi":"arxiv-2406.01864","DOIUrl":"https://doi.org/arxiv-2406.01864","url":null,"abstract":"The sampling importance resampling method is widely utilized in various\u0000fields, such as numerical integration and statistical simulation. In this\u0000paper, two modified methods are presented by incorporating two variance\u0000reduction techniques commonly used in Monte Carlo simulation, namely antithetic\u0000sampling and Latin hypercube sampling, into the process of sampling importance\u0000resampling method respectively. Theoretical evidence is provided to demonstrate\u0000that the proposed methods significantly reduce estimation errors compared to\u0000the original approach. Furthermore, the effectiveness and advantages of the\u0000proposed methods are validated through both numerical studies and real data\u0000analysis.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141255858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton
Computational statistics has traditionally utilized double-precision (64-bit) data structures and full-precision operations, resulting in higher-than-necessary accuracy for certain applications. Recently, there has been a growing interest in exploring low-precision options that could reduce computational complexity while still achieving the required level of accuracy. This trend has been amplified by new hardware such as NVIDIA's Tensor Cores in their V100, A100, and H100 GPUs, which are optimized for mixed-precision computations, Intel CPUs with Deep Learning (DL) boost, Google Tensor Processing Units (TPUs), Field Programmable Gate Arrays (FPGAs), ARM CPUs, and others. However, using lower precision may introduce numerical instabilities and accuracy issues. Nevertheless, some applications have shown robustness to low-precision computations, leading to new multi- and mixed-precision algorithms that balance accuracy and computational cost. To address this need, we introduce MPCR, a novel R package that supports three different precision types (16-, 32-, and 64-bit) and their combinations, along with its usage in commonly-used Frequentist/Bayesian statistical examples. The MPCR package is written in C++ and integrated into R through the pkg{Rcpp} package, enabling highly optimized operations in various precisions.
计算统计传统上使用双精度(64 位)数据结构和全精度运算,导致某些应用的精度高于所需的精度。最近,人们对探索低精度方案的兴趣日益浓厚,这种方案既能降低计算复杂度,又能达到所需的精度水平。英伟达™(NVIDIA®)V100、A100 和 H100 GPU(针对混合精度计算进行了优化)中的张量内核、具有深度学习(DL)功能的英特尔 CPU、谷歌张量处理单元(TPU)、现场可编程门阵列(FPGA)、ARM CPU 等新硬件的出现放大了这一趋势。然而,使用较低精度可能会带来数值不稳定性和精度问题。尽管如此,一些应用已经显示出对低精度计算的鲁棒性,从而产生了兼顾精度和计算成本的新型多精度和混合精度算法。为了满足这一需求,我们介绍了 MPCR,这是一个支持三种不同精度类型(16 位、32 位和 64 位)及其组合的新型 R 软件包,并介绍了它在常用的 Frequentist/Bayesian 统计示例中的用法。MPCR 软件包用 C++ 编写,并通过 pkg{Rcpp} 软件包集成到 R 中,实现了各种精度下的高度优化操作。
{"title":"MPCR: Multi- and Mixed-Precision Computations Package in R","authors":"Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton","doi":"arxiv-2406.02701","DOIUrl":"https://doi.org/arxiv-2406.02701","url":null,"abstract":"Computational statistics has traditionally utilized double-precision (64-bit)\u0000data structures and full-precision operations, resulting in\u0000higher-than-necessary accuracy for certain applications. Recently, there has\u0000been a growing interest in exploring low-precision options that could reduce\u0000computational complexity while still achieving the required level of accuracy.\u0000This trend has been amplified by new hardware such as NVIDIA's Tensor Cores in\u0000their V100, A100, and H100 GPUs, which are optimized for mixed-precision\u0000computations, Intel CPUs with Deep Learning (DL) boost, Google Tensor\u0000Processing Units (TPUs), Field Programmable Gate Arrays (FPGAs), ARM CPUs, and\u0000others. However, using lower precision may introduce numerical instabilities\u0000and accuracy issues. Nevertheless, some applications have shown robustness to\u0000low-precision computations, leading to new multi- and mixed-precision\u0000algorithms that balance accuracy and computational cost. To address this need,\u0000we introduce MPCR, a novel R package that supports three different precision\u0000types (16-, 32-, and 64-bit) and their combinations, along with its usage in\u0000commonly-used Frequentist/Bayesian statistical examples. The MPCR package is\u0000written in C++ and integrated into R through the pkg{Rcpp} package, enabling\u0000highly optimized operations in various precisions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}