arXiv - STAT - Computation最新文献_第9页

Split-Apply-Combine with Dynamic Grouping 拆分-应用-合并，动态分组

arXiv - STAT - Computation

Pub Date : 2024-06-14 DOI: arxiv-2406.09887

Mark P. J. van der Loo

Partitioning a data set by one or more of its attributes and computing anaggregate for each part is one of the most common operations in data analyses.There are use cases where the partitioning is determined dynamically bycollapsing smaller subsets into larger ones, to ensure sufficient support forthe computed aggregate. These use cases are not supported by softwareimplementing split-apply-combine types of operations. This paper presents thetexttt{R} package texttt{accumulate} that offers convenient interfaces fordefining grouped aggregation where the grouping itself is dynamicallydetermined, based on user-defined conditions on subsets, and a user-definedsubset collapsing scheme. The formal underlying algorithm is described andanalyzed as well.

根据数据集的一个或多个属性对数据集进行分区，并计算每个部分的合计数，是数据分析中最常见的操作之一。在一些使用案例中，分区是通过将较小的子集折叠成较大的子集来动态确定的，以确保为计算的合计数提供足够的支持。实现拆分-应用-合并类型操作的软件不支持这些用例。本文介绍了 texttt{R}包 texttt{accumulate}，它为定义分组聚合提供了方便的接口，分组本身是根据用户定义的子集条件和用户定义的子集折叠方案动态决定的。此外，还对正式的基础算法进行了描述和分析。

引用次数: 0

Learning High-dimensional Latent Variable Models via Doubly Stochastic Optimisation by Unadjusted Langevin 通过未调整朗文双重随机优化学习高维潜变量模型

arXiv - STAT - Computation

Pub Date : 2024-06-13 DOI: arxiv-2406.09311

Motonori Oka, Yunxiao Chen, Irini Mounstaki

Latent variable models are widely used in social and behavioural sciences,such as education, psychology, and political science. In recent years,high-dimensional latent variable models have become increasingly common foranalysing large and complex data. Estimating high-dimensional latent variablemodels using marginal maximum likelihood is computationally demanding due tothe complexity of integrals involved. To address this challenge, stochasticoptimisation, which combines stochastic approximation and sampling techniques,has been shown to be effective. This method iterates between two steps -- (1)sampling the latent variables from their posterior distribution based on thecurrent parameter estimate, and (2) updating the fixed parameters using anapproximate stochastic gradient constructed from the latent variable samples.In this paper, we propose a computationally more efficient stochasticoptimisation algorithm. This improvement is achieved through the use of aminibatch of observations when sampling latent variables and constructingstochastic gradients, and an unadjusted Langevin sampler that utilises thegradient of the negative complete-data log-likelihood to sample latentvariables. Theoretical results are established for the proposed algorithm,showing that the iterative parameter update converges to the marginal maximumlikelihood estimate as the number of iterations goes to infinity. Furthermore,the proposed algorithm is shown to scale well to high-dimensional settingsthrough simulation studies and a personality test application with 30,000respondents, 300 items, and 30 latent dimensions.

潜变量模型广泛应用于社会和行为科学领域，如教育学、心理学和政治学。近年来，高维潜变量模型在分析大量复杂数据时变得越来越常见。由于涉及复杂的积分，使用边际最大似然估计高维潜变量模型的计算要求很高。为了应对这一挑战，结合了随机逼近和抽样技术的随机优化方法被证明是有效的。这种方法在两个步骤之间进行迭代--（1）根据当前的参数估计，从潜在变量的后验分布中抽取样本；（2）使用从潜在变量样本构建的近似随机梯度更新固定参数。我们提出了一种计算效率更高的随机优化算法。这种改进是通过在抽取潜变量样本和构建随机梯度时使用非批量观测值，以及利用负完整数据对数似然梯度抽取潜变量样本的未调整朗之文抽样器实现的。所提出算法的理论结果表明，当迭代次数达到无穷大时，参数迭代更新会收敛到边际最大似然估计值。此外，通过模拟研究和具有 30,000 名应答者、300 个项目和 30 个潜在维度的人格测试应用，证明了所提出的算法能够很好地扩展到高维环境。

{"title":"Learning High-dimensional Latent Variable Models via Doubly Stochastic Optimisation by Unadjusted Langevin","authors":"Motonori Oka, Yunxiao Chen, Irini Mounstaki","doi":"arxiv-2406.09311","DOIUrl":"https://doi.org/arxiv-2406.09311","url":null,"abstract":"Latent variable models are widely used in social and behavioural sciences,\u0000such as education, psychology, and political science. In recent years,\u0000high-dimensional latent variable models have become increasingly common for\u0000analysing large and complex data. Estimating high-dimensional latent variable\u0000models using marginal maximum likelihood is computationally demanding due to\u0000the complexity of integrals involved. To address this challenge, stochastic\u0000optimisation, which combines stochastic approximation and sampling techniques,\u0000has been shown to be effective. This method iterates between two steps -- (1)\u0000sampling the latent variables from their posterior distribution based on the\u0000current parameter estimate, and (2) updating the fixed parameters using an\u0000approximate stochastic gradient constructed from the latent variable samples.\u0000In this paper, we propose a computationally more efficient stochastic\u0000optimisation algorithm. This improvement is achieved through the use of a\u0000minibatch of observations when sampling latent variables and constructing\u0000stochastic gradients, and an unadjusted Langevin sampler that utilises the\u0000gradient of the negative complete-data log-likelihood to sample latent\u0000variables. Theoretical results are established for the proposed algorithm,\u0000showing that the iterative parameter update converges to the marginal maximum\u0000likelihood estimate as the number of iterations goes to infinity. Furthermore,\u0000the proposed algorithm is shown to scale well to high-dimensional settings\u0000through simulation studies and a personality test application with 30,000\u0000respondents, 300 items, and 30 latent dimensions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast solution to the fair ranking problem using the Sinkhorn algorithm 使用 Sinkhorn 算法快速解决公平排序问题

arXiv - STAT - Computation

Pub Date : 2024-06-11 DOI: arxiv-2406.10262

Yuki Uehara, Shunnosuke Ikeda, Naoki Nishimura, Koya Ohashi, Yilin Li, Jie Yang, Deddy Jobson, Xingxia Zha, Takeshi Matsumoto, Noriyoshi Sukegawa, Yuichi Takano

In two-sided marketplaces such as online flea markets, recommender systemsfor providing consumers with personalized item rankings play a key role inpromoting transactions between providers and consumers. Meanwhile, two-sidedmarketplaces face the problem of balancing consumer satisfaction and fairnessamong items to stimulate activity of item providers. Saito and Joachims (2022)devised an impact-based fair ranking method for maximizing the Nash socialwelfare based on fair division; however, this method, which requires solving alarge-scale constrained nonlinear optimization problem, is very difficult toapply to practical-scale recommender systems. We thus propose a fast solutionto the impact-based fair ranking problem. We first transform the fair rankingproblem into an unconstrained optimization problem and then design a gradientascent method that repeatedly executes the Sinkhorn algorithm. Experimentalresults demonstrate that our algorithm provides fair rankings of high qualityand is about 1000 times faster than application of commercial optimizationsoftware.

在网上跳蚤市场等双面市场中，为消费者提供个性化商品排名的推荐系统在促进供应商和消费者之间的交易中发挥着关键作用。同时，双面市场面临着如何平衡消费者满意度和物品间公平性以刺激物品提供者活跃度的问题。Saito 和 Joachims（2022）设计了一种基于影响的公平排名方法，用于在公平划分的基础上最大化纳什社会福利；然而，这种方法需要求解一个大规模的受限非线性优化问题，很难应用于实际规模的推荐系统。因此，我们提出了一种快速解决基于影响的公平排名问题的方法。我们首先将公平排名问题转化为无约束优化问题，然后设计了一种梯度上升方法，该方法可重复执行 Sinkhorn 算法。实验结果表明，我们的算法能提供高质量的公平排名，而且比应用商业优化软件快 1000 倍左右。

{"title":"Fast solution to the fair ranking problem using the Sinkhorn algorithm","authors":"Yuki Uehara, Shunnosuke Ikeda, Naoki Nishimura, Koya Ohashi, Yilin Li, Jie Yang, Deddy Jobson, Xingxia Zha, Takeshi Matsumoto, Noriyoshi Sukegawa, Yuichi Takano","doi":"arxiv-2406.10262","DOIUrl":"https://doi.org/arxiv-2406.10262","url":null,"abstract":"In two-sided marketplaces such as online flea markets, recommender systems\u0000for providing consumers with personalized item rankings play a key role in\u0000promoting transactions between providers and consumers. Meanwhile, two-sided\u0000marketplaces face the problem of balancing consumer satisfaction and fairness\u0000among items to stimulate activity of item providers. Saito and Joachims (2022)\u0000devised an impact-based fair ranking method for maximizing the Nash social\u0000welfare based on fair division; however, this method, which requires solving a\u0000large-scale constrained nonlinear optimization problem, is very difficult to\u0000apply to practical-scale recommender systems. We thus propose a fast solution\u0000to the impact-based fair ranking problem. We first transform the fair ranking\u0000problem into an unconstrained optimization problem and then design a gradient\u0000ascent method that repeatedly executes the Sinkhorn algorithm. Experimental\u0000results demonstrate that our algorithm provides fair rankings of high quality\u0000and is about 1000 times faster than application of commercial optimization\u0000software.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics 基于能量距离或最大均值差异统计的多变量双样本问题计算效率高的置换检验

arXiv - STAT - Computation

Pub Date : 2024-06-10 DOI: arxiv-2406.06488

Elias Chaibub Neto

Non-parametric two-sample tests based on energy distance or maximum meandiscrepancy are widely used statistical tests for comparing multivariate datafrom two populations. While these tests enjoy desirable statistical properties,their test statistics can be expensive to compute as they require thecomputation of 3 distinct Euclidean distance (or kernel) matrices betweensamples, where the time complexity of each of these computations (namely,$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadraticallywith the number of samples ($n_x$, $n_y$) and linearly with the number ofvariables ($p$). Since the standard permutation test requires repeatedre-computations of these expensive statistics it's application to largedatasets can become unfeasible. While several statistical approaches have beenproposed to mitigate this issue, they all sacrifice desirable statisticalproperties to decrease the computational cost (e.g., trade computation speed bya decrease in statistical power). A better computational strategy is to firstpre-compute the Euclidean distance (kernel) matrix of the concatenated data,and then permute indexes and retrieve the corresponding elements to compute there-sampled statistics. While this strategy can reduce the computation costrelative to the standard permutation test, it relies on the computation of alarger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.In this paper, we present a novel computationally efficient permutationalgorithm which only requires the pre-computation of the 3 smaller matrices andachieves large computational speedups without sacrificing finite-samplevalidity or statistical power. We illustrate its computational gains in aseries of experiments and compare its statistical power to the currentstate-of-the-art approach for balancing computational cost and statisticalperformance.

基于能量距离或最大我差的非参数双样本检验是广泛使用的统计检验方法，用于比较来自两个群体的多元数据。虽然这些检验具有理想的统计特性，但由于需要计算样本间 3 个不同的欧氏距离（或核）矩阵，其检验统计量的计算成本可能很高、其中每个计算的时间复杂度（即 $O(n_{x}^2 p)$、$O(n_{y}^2 p)$ 和 $O(n_{x} n_{y} p)$）与样本数（$n_x$、$n_y$）成二次方关系，与变量数（$p$）成线性关系。由于标准置换检验需要反复重新计算这些昂贵的统计量，因此应用于大型数据集可能变得不可行。虽然已经提出了几种统计方法来缓解这一问题，但它们都牺牲了理想的统计特性来降低计算成本（例如，以降低统计能力来换取计算速度）。一种更好的计算策略是，首先预先计算串联数据的欧氏距离（核）矩阵，然后对索引进行置换并检索相应的元素，以计算其采样统计量。在本文中，我们提出了一种新型计算高效的置换算法，它只需要预先计算 3 个较小的矩阵，并在不牺牲有限样本有效性或统计能力的前提下实现了较快的计算速度。我们在一系列实验中说明了该算法的计算收益，并将其统计能力与当前最先进的方法进行了比较，以平衡计算成本和统计性能。

{"title":"Computationally efficient permutation tests for the multivariate two-sample problem based on energy distance or maximum mean discrepancy statistics","authors":"Elias Chaibub Neto","doi":"arxiv-2406.06488","DOIUrl":"https://doi.org/arxiv-2406.06488","url":null,"abstract":"Non-parametric two-sample tests based on energy distance or maximum mean\u0000discrepancy are widely used statistical tests for comparing multivariate data\u0000from two populations. While these tests enjoy desirable statistical properties,\u0000their test statistics can be expensive to compute as they require the\u0000computation of 3 distinct Euclidean distance (or kernel) matrices between\u0000samples, where the time complexity of each of these computations (namely,\u0000$O(n_{x}^2 p)$, $O(n_{y}^2 p)$, and $O(n_{x} n_{y} p)$) scales quadratically\u0000with the number of samples ($n_x$, $n_y$) and linearly with the number of\u0000variables ($p$). Since the standard permutation test requires repeated\u0000re-computations of these expensive statistics it's application to large\u0000datasets can become unfeasible. While several statistical approaches have been\u0000proposed to mitigate this issue, they all sacrifice desirable statistical\u0000properties to decrease the computational cost (e.g., trade computation speed by\u0000a decrease in statistical power). A better computational strategy is to first\u0000pre-compute the Euclidean distance (kernel) matrix of the concatenated data,\u0000and then permute indexes and retrieve the corresponding elements to compute the\u0000re-sampled statistics. While this strategy can reduce the computation cost\u0000relative to the standard permutation test, it relies on the computation of a\u0000larger Euclidean distance (kernel) matrix with complexity $O((n_x + n_y)^2 p)$.\u0000In this paper, we present a novel computationally efficient permutation\u0000algorithm which only requires the pre-computation of the 3 smaller matrices and\u0000achieves large computational speedups without sacrificing finite-sample\u0000validity or statistical power. We illustrate its computational gains in a\u0000series of experiments and compare its statistical power to the current\u0000state-of-the-art approach for balancing computational cost and statistical\u0000performance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking Smiles2Dock：基于 ML 的分子对接的开放式大规模多任务数据集

arXiv - STAT - Computation

Pub Date : 2024-06-09 DOI: arxiv-2406.05738

Thomas Le Menestrel, Manuel Rivas

Docking is a crucial component in drug discovery aimed at predicting thebinding conformation and affinity between small molecules and target proteins.ML-based docking has recently emerged as a prominent approach, outpacingtraditional methods like DOCK and AutoDock Vina in handling the growing scaleand complexity of molecular libraries. However, the availability ofcomprehensive and user-friendly datasets for training and benchmarking ML-baseddocking algorithms remains limited. We introduce Smiles2Dock, an openlarge-scale multi-task dataset for molecular docking. We created a frameworkcombining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBLdatabase against 15 AlphaFold proteins, giving us more than 25 millionprotein-ligand binding scores. The dataset leverages a wide range ofhigh-accuracy AlphaFold protein models, encompasses a diverse set ofbiologically relevant compounds and enables researchers to benchmark all majorapproaches for ML-based docking such as Graph, Transformer and CNN-basedmethods. We also introduce a novel Transformer-based architecture for dockingscores prediction and set it as an initial benchmark for our dataset. Ourdataset and code are publicly available to support the development of novelML-based methods for molecular docking to advance scientific research in thisfield.

对接是药物发现中的一个重要组成部分，旨在预测小分子与靶蛋白之间的结合构象和亲和力。基于 ML 的对接最近已成为一种突出的方法，在处理规模和复杂性不断增加的分子库方面已超过 DOCK 和 AutoDock Vina 等传统方法。然而，用于基于 ML 的对接算法的训练和基准测试的全面且用户友好的数据集仍然有限。我们介绍了用于分子对接的开放式大规模多任务数据集 Smiles2Dock。我们创建了一个结合 P2Rank 和 AutoDock Vina 的框架，将 ChEMBL 数据库中的 170 万配体与 15 种 AlphaFold 蛋白进行对接，得到了超过 2,500 万个蛋白质-配体结合得分。该数据集利用了广泛的高精度 AlphaFold 蛋白模型，涵盖了多种生物学相关化合物，使研究人员能够对基于 ML 的所有主要对接方法（如基于 Graph、Transformer 和 CNN 的方法）进行基准测试。我们还介绍了一种基于 Transformer 的新型对接分数预测架构，并将其设定为我们数据集的初始基准。我们的数据集和代码是公开的，以支持开发基于ML的新型分子对接方法，从而推动该领域的科学研究。

{"title":"Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking","authors":"Thomas Le Menestrel, Manuel Rivas","doi":"arxiv-2406.05738","DOIUrl":"https://doi.org/arxiv-2406.05738","url":null,"abstract":"Docking is a crucial component in drug discovery aimed at predicting the\u0000binding conformation and affinity between small molecules and target proteins.\u0000ML-based docking has recently emerged as a prominent approach, outpacing\u0000traditional methods like DOCK and AutoDock Vina in handling the growing scale\u0000and complexity of molecular libraries. However, the availability of\u0000comprehensive and user-friendly datasets for training and benchmarking ML-based\u0000docking algorithms remains limited. We introduce Smiles2Dock, an open\u0000large-scale multi-task dataset for molecular docking. We created a framework\u0000combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL\u0000database against 15 AlphaFold proteins, giving us more than 25 million\u0000protein-ligand binding scores. The dataset leverages a wide range of\u0000high-accuracy AlphaFold protein models, encompasses a diverse set of\u0000biologically relevant compounds and enables researchers to benchmark all major\u0000approaches for ML-based docking such as Graph, Transformer and CNN-based\u0000methods. We also introduce a novel Transformer-based architecture for docking\u0000scores prediction and set it as an initial benchmark for our dataset. Our\u0000dataset and code are publicly available to support the development of novel\u0000ML-based methods for molecular docking to advance scientific research in this\u0000field.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stochastic full waveform inversion with deep generative prior for uncertainty quantification 利用深度生成先验进行随机全波形反演以量化不确定性

arXiv - STAT - Computation

Pub Date : 2024-06-07 DOI: arxiv-2406.04859

Yuke Xie, Hervé Chauris, Nicolas Desassis

To obtain high-resolution images of subsurface structures from seismic data,seismic imaging techniques such as Full Waveform Inversion (FWI) serve ascrucial tools. However, FWI involves solving a nonlinear and often non-uniqueinverse problem, presenting challenges such as local minima trapping andinadequate handling of inherent uncertainties. In addressing these challenges,we propose leveraging deep generative models as the prior distribution ofgeophysical parameters for stochastic Bayesian inversion. This approachintegrates the adjoint state gradient for efficient back-propagation from thenumerical solution of partial differential equations. Additionally, weintroduce explicit and implicit variational Bayesian inference methods. Theexplicit method computes variational distribution density using a normalizingflow-based neural network, enabling computation of the Bayesian posterior ofparameters. Conversely, the implicit method employs an inference networkattached to a pretrained generative model to estimate density, incorporating anentropy estimator. Furthermore, we also experimented with the Stein VariationalGradient Descent (SVGD) method as another variational inference technique,using particles. We compare these variational Bayesian inference methods withconventional Markov chain Monte Carlo (McMC) sampling. Each method is able toquantify uncertainties and to generate seismic data-conditioned realizations ofsubsurface geophysical parameters. This framework provides insights intosubsurface structures while accounting for inherent uncertainties.

为了从地震数据中获得地下结构的高分辨率图像，全波形反演（FWI）等地震成像技术成为重要工具。然而，全波形反演涉及求解一个非线性且往往是非唯一的反演问题，带来了诸如局部最小值陷阱和对固有不确定性处理不当等挑战。为了应对这些挑战，我们提出利用深度生成模型作为随机贝叶斯反演的地球物理参数的先验分布。这种方法整合了从偏微分方程数值解中高效反向传播的临界状态梯度。此外，我们还引入了显式和隐式变分贝叶斯推理方法。显式方法使用基于归一化流的神经网络计算变分分布密度，从而计算参数的贝叶斯后验。与此相反，隐式推理方法采用了一个与预训练生成模型相连的推理网络来估计密度，并结合了一个熵估计器。此外，我们还试验了斯坦因变异梯度下降法（SVGD），作为另一种使用粒子的变异推理技术。我们将这些变异贝叶斯推断方法与传统的马尔可夫链蒙特卡罗（McMC）采样进行了比较。每种方法都能量化不确定性，并生成以地震数据为条件的次表层地球物理参数现实。这一框架在考虑固有不确定性的同时，提供了对次表层结构的见解。

{"title":"Stochastic full waveform inversion with deep generative prior for uncertainty quantification","authors":"Yuke Xie, Hervé Chauris, Nicolas Desassis","doi":"arxiv-2406.04859","DOIUrl":"https://doi.org/arxiv-2406.04859","url":null,"abstract":"To obtain high-resolution images of subsurface structures from seismic data,\u0000seismic imaging techniques such as Full Waveform Inversion (FWI) serve as\u0000crucial tools. However, FWI involves solving a nonlinear and often non-unique\u0000inverse problem, presenting challenges such as local minima trapping and\u0000inadequate handling of inherent uncertainties. In addressing these challenges,\u0000we propose leveraging deep generative models as the prior distribution of\u0000geophysical parameters for stochastic Bayesian inversion. This approach\u0000integrates the adjoint state gradient for efficient back-propagation from the\u0000numerical solution of partial differential equations. Additionally, we\u0000introduce explicit and implicit variational Bayesian inference methods. The\u0000explicit method computes variational distribution density using a normalizing\u0000flow-based neural network, enabling computation of the Bayesian posterior of\u0000parameters. Conversely, the implicit method employs an inference network\u0000attached to a pretrained generative model to estimate density, incorporating an\u0000entropy estimator. Furthermore, we also experimented with the Stein Variational\u0000Gradient Descent (SVGD) method as another variational inference technique,\u0000using particles. We compare these variational Bayesian inference methods with\u0000conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to\u0000quantify uncertainties and to generate seismic data-conditioned realizations of\u0000subsurface geophysical parameters. This framework provides insights into\u0000subsurface structures while accounting for inherent uncertainties.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Multiscale Perspective on Maximum Marginal Likelihood Estimation 最大边际似然估计的多尺度视角

arXiv - STAT - Computation

Pub Date : 2024-06-06 DOI: arxiv-2406.04187

O. Deniz Akyildiz, Iain Souttar, Michela Ottobre

In this paper, we provide a multiscale perspective on the problem of maximummarginal likelihood estimation. We consider and analyse a diffusion-basedmaximum marginal likelihood estimation scheme using ideas from multiscaledynamics. Our perspective is based on stochastic averaging; we make an explicitconnection between ideas in applied probability and parameter inference incomputational statistics. In particular, we consider a general class of coupledLangevin diffusions for joint inference of latent variables and parameters instatistical models, where the latent variables are sampled from a fast Langevinprocess (which acts as a sampler), and the parameters are updated using a slowLangevin process (which acts as an optimiser). We show that the resultingsystem of stochastic differential equations (SDEs) can be viewed as a two-timescale system. To demonstrate the utility of such a perspective, we show thatthe averaged parameter dynamics obtained in the limit of scale separation canbe used to estimate the optimal parameter, within the strongly convex setting.We do this by using recent uniform-in-time non-asymptotic averaging bounds.Finally, we conclude by showing that the slow-fast algorithm we consider here,termed Slow-Fast Langevin Algorithm, performs on par with state-of-the-artmethods on a variety of examples. We believe that the stochastic averagingapproach we provide in this paper enables us to look at these algorithms from afresh angle, as well as unlocking the path to develop and analyse new methodsusing well-established averaging principles.

本文从多尺度的角度探讨了最大边际似然估计问题。我们利用多尺度动力学的思想，考虑并分析了基于扩散的最大边际似然估计方案。我们的视角基于随机平均；我们将应用概率和计算统计中的参数推断的思想明确地联系起来。特别是，我们考虑了用于统计模型中潜在变量和参数联合推断的一类耦合朗文扩散，其中潜在变量从快速朗文过程（充当采样器）中采样，而参数则使用慢速朗文过程（充当优化器）更新。我们证明，由此产生的随机微分方程（SDE）系统可被视为一个双时间尺度系统。为了证明这种观点的实用性，我们展示了在尺度分离极限下获得的平均参数动态，可以用来估计强凸设置下的最优参数。最后，我们通过展示我们在这里考虑的慢速算法（称为慢速朗文算法），在各种示例中的表现与最新方法相当，从而得出结论。我们相信，我们在本文中提供的随机平均方法使我们能够从一个全新的角度来看待这些算法，同时也为我们利用成熟的平均原理来开发和分析新方法开辟了道路。

{"title":"A Multiscale Perspective on Maximum Marginal Likelihood Estimation","authors":"O. Deniz Akyildiz, Iain Souttar, Michela Ottobre","doi":"arxiv-2406.04187","DOIUrl":"https://doi.org/arxiv-2406.04187","url":null,"abstract":"In this paper, we provide a multiscale perspective on the problem of maximum\u0000marginal likelihood estimation. We consider and analyse a diffusion-based\u0000maximum marginal likelihood estimation scheme using ideas from multiscale\u0000dynamics. Our perspective is based on stochastic averaging; we make an explicit\u0000connection between ideas in applied probability and parameter inference in\u0000computational statistics. In particular, we consider a general class of coupled\u0000Langevin diffusions for joint inference of latent variables and parameters in\u0000statistical models, where the latent variables are sampled from a fast Langevin\u0000process (which acts as a sampler), and the parameters are updated using a slow\u0000Langevin process (which acts as an optimiser). We show that the resulting\u0000system of stochastic differential equations (SDEs) can be viewed as a two-time\u0000scale system. To demonstrate the utility of such a perspective, we show that\u0000the averaged parameter dynamics obtained in the limit of scale separation can\u0000be used to estimate the optimal parameter, within the strongly convex setting.\u0000We do this by using recent uniform-in-time non-asymptotic averaging bounds.\u0000Finally, we conclude by showing that the slow-fast algorithm we consider here,\u0000termed Slow-Fast Langevin Algorithm, performs on par with state-of-the-art\u0000methods on a variety of examples. We believe that the stochastic averaging\u0000approach we provide in this paper enables us to look at these algorithms from a\u0000fresh angle, as well as unlocking the path to develop and analyse new methods\u0000using well-established averaging principles.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"161 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141531892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics 用于粒子物理学射流重构的变分伪边际方法

arXiv - STAT - Computation

Pub Date : 2024-06-05 DOI: arxiv-2406.03242

Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er

Reconstructing jets, which provide vital insights into the properties andhistories of subatomic particles produced in high-energy collisions, is a mainproblem in data analyses in collider physics. This intricate task deals withestimating the latent structure of a jet (binary tree) and involves parameterssuch as particle energy, momentum, and types. While Bayesian methods offer anatural approach for handling uncertainty and leveraging prior knowledge, theyface significant challenges due to the super-exponential growth of potentialjet topologies as the number of observed particles increases. To address this,we introduce a Combinatorial Sequential Monte Carlo approach for inferring jetlatent structures. As a second contribution, we leverage the resultingestimator to develop a variational inference algorithm for parameter learning.Building on this, we introduce a variational family using a pseudo-marginalframework for a fully Bayesian treatment of all variables, unifying thegenerative model with the inference process. We illustrate our method'seffectiveness through experiments using data generated with a collider physicsgenerative model, highlighting superior speed and accuracy across a range oftasks.

喷流是对撞机物理数据分析中的一个主要问题，它为了解高能对撞中产生的亚原子粒子的性质和历史提供了重要线索。这项复杂的任务涉及估计喷流（二叉树）的潜在结构，并涉及粒子能量、动量和类型等参数。虽然贝叶斯方法为处理不确定性和利用先验知识提供了一种自然的方法，但由于潜在射流拓扑结构随着观测粒子数量的增加而呈超指数增长，贝叶斯方法面临着巨大的挑战。为了解决这个问题，我们引入了一种组合序列蒙特卡洛方法来推断喷气孔结构。在此基础上，我们引入了一个使用伪边际框架的变分系列，对所有变量进行完全贝叶斯处理，将生成模型与推理过程统一起来。我们通过使用对撞机物理生成模型生成的数据进行实验来说明我们的方法的有效性，在一系列任务中凸显了卓越的速度和准确性。

{"title":"Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics","authors":"Hanming Yang, Antonio Khalil Moretti, Sebastian Macaluso, Philippe Chlenski, Christian A. Naesseth, Itsik Pe'er","doi":"arxiv-2406.03242","DOIUrl":"https://doi.org/arxiv-2406.03242","url":null,"abstract":"Reconstructing jets, which provide vital insights into the properties and\u0000histories of subatomic particles produced in high-energy collisions, is a main\u0000problem in data analyses in collider physics. This intricate task deals with\u0000estimating the latent structure of a jet (binary tree) and involves parameters\u0000such as particle energy, momentum, and types. While Bayesian methods offer a\u0000natural approach for handling uncertainty and leveraging prior knowledge, they\u0000face significant challenges due to the super-exponential growth of potential\u0000jet topologies as the number of observed particles increases. To address this,\u0000we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet\u0000latent structures. As a second contribution, we leverage the resulting\u0000estimator to develop a variational inference algorithm for parameter learning.\u0000Building on this, we introduce a variational family using a pseudo-marginal\u0000framework for a fully Bayesian treatment of all variables, unifying the\u0000generative model with the inference process. We illustrate our method's\u0000effectiveness through experiments using data generated with a collider physics\u0000generative model, highlighting superior speed and accuracy across a range of\u0000tasks.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variance-reduced sampling importance resampling 降低方差抽样重要性重抽样

arXiv - STAT - Computation

Pub Date : 2024-06-04 DOI: arxiv-2406.01864

Yao Xiao, Kang Fu, Kun Li

The sampling importance resampling method is widely utilized in variousfields, such as numerical integration and statistical simulation. In thispaper, two modified methods are presented by incorporating two variancereduction techniques commonly used in Monte Carlo simulation, namely antitheticsampling and Latin hypercube sampling, into the process of sampling importanceresampling method respectively. Theoretical evidence is provided to demonstratethat the proposed methods significantly reduce estimation errors compared tothe original approach. Furthermore, the effectiveness and advantages of theproposed methods are validated through both numerical studies and real dataanalysis.

采样重要性重采样法被广泛应用于数值积分和统计仿真等多个领域。本文提出了两种改进方法，分别将蒙特卡罗仿真中常用的两种降低方差的技术，即反采样和拉丁超立方采样，融入到抽样重要性重采样方法的过程中。理论证据表明，与原始方法相比，所提出的方法大大减少了估计误差。此外，还通过数值研究和实际数据分析验证了所提方法的有效性和优势。

引用次数: 0

MPCR: Multi- and Mixed-Precision Computations Package in R MPCR：R 中的多精度和混合精度计算软件包

arXiv - STAT - Computation

Pub Date : 2024-06-04 DOI: arxiv-2406.02701

Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton

Computational statistics has traditionally utilized double-precision (64-bit)data structures and full-precision operations, resulting inhigher-than-necessary accuracy for certain applications. Recently, there hasbeen a growing interest in exploring low-precision options that could reducecomputational complexity while still achieving the required level of accuracy.This trend has been amplified by new hardware such as NVIDIA's Tensor Cores intheir V100, A100, and H100 GPUs, which are optimized for mixed-precisioncomputations, Intel CPUs with Deep Learning (DL) boost, Google TensorProcessing Units (TPUs), Field Programmable Gate Arrays (FPGAs), ARM CPUs, andothers. However, using lower precision may introduce numerical instabilitiesand accuracy issues. Nevertheless, some applications have shown robustness tolow-precision computations, leading to new multi- and mixed-precisionalgorithms that balance accuracy and computational cost. To address this need,we introduce MPCR, a novel R package that supports three different precisiontypes (16-, 32-, and 64-bit) and their combinations, along with its usage incommonly-used Frequentist/Bayesian statistical examples. The MPCR package iswritten in C++ and integrated into R through the pkg{Rcpp} package, enablinghighly optimized operations in various precisions.

计算统计传统上使用双精度（64 位）数据结构和全精度运算，导致某些应用的精度高于所需的精度。最近，人们对探索低精度方案的兴趣日益浓厚，这种方案既能降低计算复杂度，又能达到所需的精度水平。英伟达™（NVIDIA®）V100、A100 和 H100 GPU（针对混合精度计算进行了优化）中的张量内核、具有深度学习（DL）功能的英特尔 CPU、谷歌张量处理单元（TPU）、现场可编程门阵列（FPGA）、ARM CPU 等新硬件的出现放大了这一趋势。然而，使用较低精度可能会带来数值不稳定性和精度问题。尽管如此，一些应用已经显示出对低精度计算的鲁棒性，从而产生了兼顾精度和计算成本的新型多精度和混合精度算法。为了满足这一需求，我们介绍了 MPCR，这是一个支持三种不同精度类型（16 位、32 位和 64 位）及其组合的新型 R 软件包，并介绍了它在常用的 Frequentist/Bayesian 统计示例中的用法。MPCR 软件包用 C++ 编写，并通过 pkg{Rcpp} 软件包集成到 R 中，实现了各种精度下的高度优化操作。

{"title":"MPCR: Multi- and Mixed-Precision Computations Package in R","authors":"Mary Lai O. Salvana, Sameh Abdulah, Minwoo Kim, David Helmy, Ying Sun, Marc G. Genton","doi":"arxiv-2406.02701","DOIUrl":"https://doi.org/arxiv-2406.02701","url":null,"abstract":"Computational statistics has traditionally utilized double-precision (64-bit)\u0000data structures and full-precision operations, resulting in\u0000higher-than-necessary accuracy for certain applications. Recently, there has\u0000been a growing interest in exploring low-precision options that could reduce\u0000computational complexity while still achieving the required level of accuracy.\u0000This trend has been amplified by new hardware such as NVIDIA's Tensor Cores in\u0000their V100, A100, and H100 GPUs, which are optimized for mixed-precision\u0000computations, Intel CPUs with Deep Learning (DL) boost, Google Tensor\u0000Processing Units (TPUs), Field Programmable Gate Arrays (FPGAs), ARM CPUs, and\u0000others. However, using lower precision may introduce numerical instabilities\u0000and accuracy issues. Nevertheless, some applications have shown robustness to\u0000low-precision computations, leading to new multi- and mixed-precision\u0000algorithms that balance accuracy and computational cost. To address this need,\u0000we introduce MPCR, a novel R package that supports three different precision\u0000types (16-, 32-, and 64-bit) and their combinations, along with its usage in\u0000commonly-used Frequentist/Bayesian statistical examples. The MPCR package is\u0000written in C++ and integrated into R through the pkg{Rcpp} package, enabling\u0000highly optimized operations in various precisions.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141523196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0