arXiv - STAT - Machine Learning最新文献

英文中文

A clustering adaptive Gaussian process regression method: response patterns based real-time prediction for nonlinear solid mechanics problems 聚类自适应高斯过程回归方法：基于响应模式的非线性固体力学问题实时预测

arXiv - STAT - Machine Learning

Pub Date : 2024-09-15 DOI: arxiv-2409.10572

Ming-Jian Li, Yanping Lian, Zhanshan Cheng, Lehui Li, Zhidong Wang, Ruxin Gao, Daining Fang

Numerical simulation is powerful to study nonlinear solid mechanics problems.However, mesh-based or particle-based numerical methods suffer from the commonshortcoming of being time-consuming, particularly for complex problems withreal-time analysis requirements. This study presents a clustering adaptiveGaussian process regression (CAG) method aiming for real-time prediction fornonlinear structural responses in solid mechanics. It is a data-driven machinelearning method featuring a small sample size, high accuracy, and highefficiency, leveraging nonlinear structural response patterns. Similar to thetraditional Gaussian process regression (GPR) method, it operates in offlineand online stages. In the offline stage, an adaptive sample generationtechnique is introduced to cluster datasets into distinct patterns fordemand-driven sample allocation. This ensures comprehensive coverage of thecritical samples for the solution space of interest. In the online stage,following the divide-and-conquer strategy, a pre-prediction classificationcategorizes problems into predefined patterns sequentially predicted by thetrained multi-pattern Gaussian process regressor. In addition, dimensionreduction and restoration techniques are employed in the proposed method toenhance its efficiency. A set of problems involving material, geometric, andboundary condition nonlinearities is presented to demonstrate the CAG method'sabilities. The proposed method can offer predictions within a second and attainhigh precision with only about 20 samples within the context of this study,outperforming the traditional GPR using uniformly distributed samples for errorreductions ranging from 1 to 3 orders of magnitude. The CAG method is expectedto offer a powerful tool for real-time prediction of nonlinear solid mechanicalproblems and shed light on the complex nonlinear structural response pattern.

数值模拟是研究非线性固体力学问题的有力工具。然而，基于网格或粒子的数值方法普遍存在耗时长的缺点，特别是对于有实时分析要求的复杂问题。本研究提出了一种聚类自适应高斯过程回归（CAG）方法，旨在对固体力学中的非线性结构响应进行实时预测。这是一种数据驱动的机器学习方法，利用非线性结构响应模式，具有样本量小、精度高、效率高等特点。与传统的高斯过程回归（GPR）方法类似，它分为离线和在线两个阶段。在离线阶段，引入了自适应样本生成技术，将数据集聚成不同的模式，以便按需分配样本。这确保了关键样本对所关注解空间的全面覆盖。在在线阶段，按照分而治之的策略，预预测分类将问题分为预定义的模式，由训练好的多模式高斯过程回归器依次预测。此外，该方法还采用了降维和还原技术来提高效率。为了证明 CAG 方法的能力，介绍了一组涉及材料、几何和边界条件非线性的问题。在本研究中，所提出的方法只需约 20 个样本就能在一秒内提供预测并达到很高的精度，在误差减少 1 到 3 个数量级方面优于使用均匀分布样本的传统 GPR 方法。CAG 方法有望为非线性固体力学问题的实时预测提供强有力的工具，并揭示复杂的非线性结构响应模式。

{"title":"A clustering adaptive Gaussian process regression method: response patterns based real-time prediction for nonlinear solid mechanics problems","authors":"Ming-Jian Li, Yanping Lian, Zhanshan Cheng, Lehui Li, Zhidong Wang, Ruxin Gao, Daining Fang","doi":"arxiv-2409.10572","DOIUrl":"https://doi.org/arxiv-2409.10572","url":null,"abstract":"Numerical simulation is powerful to study nonlinear solid mechanics problems.\u0000However, mesh-based or particle-based numerical methods suffer from the common\u0000shortcoming of being time-consuming, particularly for complex problems with\u0000real-time analysis requirements. This study presents a clustering adaptive\u0000Gaussian process regression (CAG) method aiming for real-time prediction for\u0000nonlinear structural responses in solid mechanics. It is a data-driven machine\u0000learning method featuring a small sample size, high accuracy, and high\u0000efficiency, leveraging nonlinear structural response patterns. Similar to the\u0000traditional Gaussian process regression (GPR) method, it operates in offline\u0000and online stages. In the offline stage, an adaptive sample generation\u0000technique is introduced to cluster datasets into distinct patterns for\u0000demand-driven sample allocation. This ensures comprehensive coverage of the\u0000critical samples for the solution space of interest. In the online stage,\u0000following the divide-and-conquer strategy, a pre-prediction classification\u0000categorizes problems into predefined patterns sequentially predicted by the\u0000trained multi-pattern Gaussian process regressor. In addition, dimension\u0000reduction and restoration techniques are employed in the proposed method to\u0000enhance its efficiency. A set of problems involving material, geometric, and\u0000boundary condition nonlinearities is presented to demonstrate the CAG method's\u0000abilities. The proposed method can offer predictions within a second and attain\u0000high precision with only about 20 samples within the context of this study,\u0000outperforming the traditional GPR using uniformly distributed samples for error\u0000reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected\u0000to offer a powerful tool for real-time prediction of nonlinear solid mechanical\u0000problems and shed light on the complex nonlinear structural response pattern.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Consistent Spectral Clustering in Hyperbolic Spaces 双曲空间中的一致谱聚类

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09304

Sagar Ghosh, Swagatam Das

Clustering, as an unsupervised technique, plays a pivotal role in variousdata analysis applications. Among clustering algorithms, Spectral Clustering onEuclidean Spaces has been extensively studied. However, with the rapidevolution of data complexity, Euclidean Space is proving to be inefficient forrepresenting and learning algorithms. Although Deep Neural Networks onhyperbolic spaces have gained recent traction, clustering algorithms ornon-deep machine learning models on non-Euclidean Spaces remain underexplored.In this paper, we propose a spectral clustering algorithm on Hyperbolic Spacesto address this gap. Hyperbolic Spaces offer advantages in representing complexdata structures like hierarchical and tree-like structures, which cannot beembedded efficiently in Euclidean Spaces. Our proposed algorithm replaces theEuclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix,demonstrating improved efficiency compared to clustering in Euclidean Spaces.Our contributions include the development of the spectral clustering algorithmon Hyperbolic Spaces and the proof of its weak consistency. We show that ouralgorithm converges at least as fast as Spectral Clustering on EuclideanSpaces. To illustrate the efficacy of our approach, we present experimentalresults on the Wisconsin Breast Cancer Dataset, highlighting the superiorperformance of Hyperbolic Spectral Clustering over its Euclidean counterpart.This work opens up avenues for utilizing non-Euclidean Spaces in clusteringalgorithms, offering new perspectives for handling complex data structures andimproving clustering efficiency.

聚类作为一种无监督技术，在各种数据分析应用中发挥着举足轻重的作用。在聚类算法中，欧几里得空间谱聚类算法已被广泛研究。然而，随着数据复杂性的快速发展，欧几里得空间在表示和学习算法方面已被证明是低效的。尽管双曲空间上的深度神经网络最近获得了广泛关注，但非欧几里得空间上的聚类算法或非深度机器学习模型仍未得到充分探索。双曲空间在表示复杂数据结构（如层次结构和树状结构）方面具有优势，而这些结构无法有效嵌入欧几里得空间。我们提出的算法用适当的双曲相似矩阵取代了欧几里得相似矩阵，与在欧几里得空间中进行聚类相比，效率有所提高。我们证明，我们的算法收敛速度至少与欧几里得空间的光谱聚类算法一样快。为了说明我们方法的有效性，我们展示了在威斯康星乳腺癌数据集上的实验结果，凸显了超曲谱聚类比其欧几里得对应算法更优越的性能。这项工作开辟了在聚类算法中利用非欧几里得空间的途径，为处理复杂数据结构和提高聚类效率提供了新的视角。

{"title":"Consistent Spectral Clustering in Hyperbolic Spaces","authors":"Sagar Ghosh, Swagatam Das","doi":"arxiv-2409.09304","DOIUrl":"https://doi.org/arxiv-2409.09304","url":null,"abstract":"Clustering, as an unsupervised technique, plays a pivotal role in various\u0000data analysis applications. Among clustering algorithms, Spectral Clustering on\u0000Euclidean Spaces has been extensively studied. However, with the rapid\u0000evolution of data complexity, Euclidean Space is proving to be inefficient for\u0000representing and learning algorithms. Although Deep Neural Networks on\u0000hyperbolic spaces have gained recent traction, clustering algorithms or\u0000non-deep machine learning models on non-Euclidean Spaces remain underexplored.\u0000In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces\u0000to address this gap. Hyperbolic Spaces offer advantages in representing complex\u0000data structures like hierarchical and tree-like structures, which cannot be\u0000embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the\u0000Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix,\u0000demonstrating improved efficiency compared to clustering in Euclidean Spaces.\u0000Our contributions include the development of the spectral clustering algorithm\u0000on Hyperbolic Spaces and the proof of its weak consistency. We show that our\u0000algorithm converges at least as fast as Spectral Clustering on Euclidean\u0000Spaces. To illustrate the efficacy of our approach, we present experimental\u0000results on the Wisconsin Breast Cancer Dataset, highlighting the superior\u0000performance of Hyperbolic Spectral Clustering over its Euclidean counterpart.\u0000This work opens up avenues for utilizing non-Euclidean Spaces in clustering\u0000algorithms, offering new perspectives for handling complex data structures and\u0000improving clustering efficiency.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Active Learning to Guide Labeling Efforts for Question Difficulty Estimation 通过主动学习引导问题难度估算的标记工作

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09258

Arthur Thuy, Ekaterina Loginova, Dries F. Benoit

In recent years, there has been a surge in research on Question DifficultyEstimation (QDE) using natural language processing techniques.Transformer-based neural networks achieve state-of-the-art performance,primarily through supervised methods but with an isolated study in unsupervisedlearning. While supervised methods focus on predictive performance, theyrequire abundant labeled data. On the other hand, unsupervised methods do notrequire labeled data but rely on a different evaluation metric that is alsocomputationally expensive in practice. This work bridges the research gap byexploring active learning for QDE, a supervised human-in-the-loop approachstriving to minimize the labeling efforts while matching the performance ofstate-of-the-art models. The active learning process iteratively trains on alabeled subset, acquiring labels from human experts only for the mostinformative unlabeled data points. Furthermore, we propose a novel acquisitionfunction PowerVariance to add the most informative samples to the labeled set,a regression extension to the PowerBALD function popular in classification. Weemploy DistilBERT for QDE and identify informative samples by applying MonteCarlo dropout to capture epistemic uncertainty in unlabeled samples. Theexperiments demonstrate that active learning with PowerVariance acquisitionachieves a performance close to fully supervised models after labeling only 10%of the training data. The proposed methodology promotes the responsible use ofeducational resources, makes QDE tools more accessible to course instructors,and is promising for other applications such as personalized support systemsand question-answering tools.

近年来，利用自然语言处理技术进行问题难度估计（QDE）的研究激增。基于变压器的神经网络主要通过有监督的方法实现最先进的性能，但在无监督学习方面也有个别研究。有监督方法侧重于预测性能，但需要大量标记数据。另一方面，无监督方法不需要标注数据，但依赖于不同的评估指标，在实践中计算成本也很高。本研究通过探索 QDE 的主动学习弥合了这一研究空白，这是一种有监督的人在回路中的方法，旨在最大限度地减少标注工作，同时与最先进模型的性能相匹配。主动学习过程在已标注的子集上进行迭代训练，只针对最有信息量的未标注数据点从人类专家那里获取标签。此外，我们还提出了一种新颖的获取函数 PowerVariance，用于将信息量最大的样本添加到标注集，这是对分类中常用的 PowerBALD 函数的回归扩展。我们在 QDE 中使用了 DistilBERT，并通过应用 MonteCarlo dropout 来识别信息样本，以捕捉未标记样本中的认识不确定性。实验证明，使用 PowerVariance 获取的主动学习只需标注 10% 的训练数据，就能获得接近完全监督模型的性能。所提出的方法促进了对教育资源的负责任使用，使课程讲师更容易获得 QDE 工具，并有望应用于其他领域，如个性化支持系统和问题解答工具。

{"title":"Active Learning to Guide Labeling Efforts for Question Difficulty Estimation","authors":"Arthur Thuy, Ekaterina Loginova, Dries F. Benoit","doi":"arxiv-2409.09258","DOIUrl":"https://doi.org/arxiv-2409.09258","url":null,"abstract":"In recent years, there has been a surge in research on Question Difficulty\u0000Estimation (QDE) using natural language processing techniques.\u0000Transformer-based neural networks achieve state-of-the-art performance,\u0000primarily through supervised methods but with an isolated study in unsupervised\u0000learning. While supervised methods focus on predictive performance, they\u0000require abundant labeled data. On the other hand, unsupervised methods do not\u0000require labeled data but rely on a different evaluation metric that is also\u0000computationally expensive in practice. This work bridges the research gap by\u0000exploring active learning for QDE, a supervised human-in-the-loop approach\u0000striving to minimize the labeling efforts while matching the performance of\u0000state-of-the-art models. The active learning process iteratively trains on a\u0000labeled subset, acquiring labels from human experts only for the most\u0000informative unlabeled data points. Furthermore, we propose a novel acquisition\u0000function PowerVariance to add the most informative samples to the labeled set,\u0000a regression extension to the PowerBALD function popular in classification. We\u0000employ DistilBERT for QDE and identify informative samples by applying Monte\u0000Carlo dropout to capture epistemic uncertainty in unlabeled samples. The\u0000experiments demonstrate that active learning with PowerVariance acquisition\u0000achieves a performance close to fully supervised models after labeling only 10%\u0000of the training data. The proposed methodology promotes the responsible use of\u0000educational resources, makes QDE tools more accessible to course instructors,\u0000and is promising for other applications such as personalized support systems\u0000and question-answering tools.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BM$^2$: Coupled Schrödinger Bridge Matching BM$^2$：耦合薛定谔桥匹配

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09376

Stefano Peluchetti

A Schr"{o}dinger bridge establishes a dynamic transport map between twotarget distributions via a reference process, simultaneously solving anassociated entropic optimal transport problem. We consider the setting wheresamples from the target distributions are available, and the referencediffusion process admits tractable dynamics. We thus introduce Coupled BridgeMatching (BM$^2$), a simple emph{non-iterative} approach for learningSchr"{o}dinger bridges with neural networks. A preliminary theoreticalanalysis of the convergence properties of BM$^2$ is carried out, supported bynumerical experiments that demonstrate the effectiveness of our proposal.

薛定谔桥通过参考过程在两个目标分布之间建立动态传输映射，同时求解相关的熵优化传输问题。我们考虑的情况是，目标分布的样本是可用的，并且参考扩散过程具有可控的动态性。因此，我们引入了耦合桥匹配（Coupled BridgeMatching，BM$^2$），这是一种利用神经网络学习施罗丁格桥的简单迭代方法。我们对 BM$^2$ 的收敛特性进行了初步的理论分析，并通过数值实验证明了我们建议的有效性。

引用次数: 0

Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder 贝塔-西格玛 VAE：分离高斯变异自动编码器中的贝塔方差和解码器方差

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09361

Seunghwan Kim, Seungkyu Lee

Variational autoencoder (VAE) is an established generative model but isnotorious for its blurriness. In this work, we investigate the blurry outputproblem of VAE and resolve it, exploiting the variance of Gaussian decoder and$beta$ of beta-VAE. Specifically, we reveal that the indistinguishability ofdecoder variance and $beta$ hinders appropriate analysis of the model byrandom likelihood value, and limits performance improvement by omitting thegain from $beta$. To address the problem, we propose Beta-Sigma VAE (BS-VAE)that explicitly separates $beta$ and decoder variance $sigma^2_x$ in themodel. Our method demonstrates not only superior performance in natural imagesynthesis but also controllable parameters and predictable analysis compared toconventional VAE. In our experimental evaluation, we employ the analysis ofrate-distortion curve and proxy metrics on computer vision datasets. The codeis available on https://github.com/overnap/BS-VAE

变异自动编码器（VAE）是一种成熟的生成模型，但因其模糊性而臭名昭著。在这项工作中，我们利用高斯解码器的方差和贝塔自编码器的贝塔值，研究并解决了自编码器输出模糊的问题。具体来说，我们发现解码器方差和 $beta$ 的不可分性阻碍了通过随机似然值对模型进行适当的分析，并限制了通过省略 $beta$ 的增益来提高性能。为了解决这个问题，我们提出了 Beta-Sigma VAE（BS-VAE），它在模型中明确分离了 $beta$ 和解码器方差 $sigma^2_x$。与传统的 VAE 相比，我们的方法不仅在自然图像合成中表现出卓越的性能，而且参数可控、分析可预测。在实验评估中，我们采用了计算机视觉数据集上的rate-distortion 曲线分析和代理度量。代码见 https://github.com/overnap/BS-VAE

引用次数: 0

Topological Tensor Eigenvalue Theorems in Data Fusion 数据融合中的拓扑张量特征值定理

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09392

Ronald Katende

This paper introduces a novel framework for tensor eigenvalue analysis in thecontext of multi-modal data fusion, leveraging topological invariants such asBetti numbers. While traditional approaches to tensor eigenvalues rely onalgebraic extensions of matrix theory, this work provides a topologicalperspective that enriches the understanding of tensor structures. Byestablishing new theorems linking eigenvalues to topological features, theproposed framework offers deeper insights into the latent structure of data,enhancing both interpretability and robustness. Applications to data fusionillustrate the theoretical and practical significance of the approach,demonstrating its potential for broad impact across machine learning and datascience domains.

本文介绍了在多模态数据融合背景下利用贝蒂数等拓扑不变式进行张量特征值分析的新框架。传统的张量特征值分析方法依赖于矩阵理论的代数扩展，而本文则提供了拓扑视角，丰富了对张量结构的理解。通过建立将特征值与拓扑特征联系起来的新定理，所提出的框架为数据的潜在结构提供了更深入的见解，增强了可解释性和鲁棒性。数据融合的应用证明了这种方法的理论和实践意义，展示了它在机器学习和数据科学领域产生广泛影响的潜力。

引用次数: 0

Schrödinger Bridge Flow for Unpaired Data Translation 用于非配对数据转换的薛定谔桥流

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09347

Valentin De Bortoli, Iryna Korshunova, Andriy Mnih, Arnaud Doucet

Mass transport problems arise in many areas of machine learning whereby onewants to compute a map transporting one distribution to another. Generativemodeling techniques like Generative Adversarial Networks (GANs) and DenoisingDiffusion Models (DDMs) have been successfully adapted to solve such transportproblems, resulting in CycleGAN and Bridge Matching respectively. However,these methods do not approximate Optimal Transport (OT) maps, which are knownto have desirable properties. Existing techniques approximating OT maps forhigh-dimensional data-rich problems, such as DDM-based Rectified Flow andSchr"odinger Bridge procedures, require fully training a DDM-type model ateach iteration, or use mini-batch techniques which can introduce significanterrors. We propose a novel algorithm to compute the Schr"odinger Bridge, adynamic entropy-regularised version of OT, that eliminates the need to trainmultiple DDM-like models. This algorithm corresponds to a discretisation of aflow of path measures, which we call the Schr"odinger Bridge Flow, whose onlystationary point is the Schr"odinger Bridge. We demonstrate the performance ofour algorithm on a variety of unpaired data translation tasks.

在机器学习的许多领域都会出现大规模传输问题，即人们希望计算将一个分布传输到另一个分布的地图。生成模型技术，如生成对抗网络（GAN）和去噪扩散模型（DDM），已被成功地用于解决此类传输问题，分别产生了循环生成对抗网络（CycleGAN）和桥匹配（Bridge Matching）。然而，这些方法并不能逼近已知具有理想特性的最优传输（OT）图。现有的针对高维数据丰富问题的近似 OT 地图的技术，如基于 DDM 的整流程序和薛定谔桥程序，需要在每次迭代时完全训练一个 DDM 类型的模型，或者使用可能引入重大误差的迷你批处理技术。我们提出了一种计算 Schr"odinger Bridge 的新算法，即 OT 的动态熵规则化版本，它无需训练多个类似于 DDM 的模型。该算法对应于路径度量流的离散化，我们称之为薛定谔桥流，其唯一的静止点就是薛定谔桥。我们在各种非配对数据转换任务中演示了我们算法的性能。

{"title":"Schrödinger Bridge Flow for Unpaired Data Translation","authors":"Valentin De Bortoli, Iryna Korshunova, Andriy Mnih, Arnaud Doucet","doi":"arxiv-2409.09347","DOIUrl":"https://doi.org/arxiv-2409.09347","url":null,"abstract":"Mass transport problems arise in many areas of machine learning whereby one\u0000wants to compute a map transporting one distribution to another. Generative\u0000modeling techniques like Generative Adversarial Networks (GANs) and Denoising\u0000Diffusion Models (DDMs) have been successfully adapted to solve such transport\u0000problems, resulting in CycleGAN and Bridge Matching respectively. However,\u0000these methods do not approximate Optimal Transport (OT) maps, which are known\u0000to have desirable properties. Existing techniques approximating OT maps for\u0000high-dimensional data-rich problems, such as DDM-based Rectified Flow and\u0000Schr\"odinger Bridge procedures, require fully training a DDM-type model at\u0000each iteration, or use mini-batch techniques which can introduce significant\u0000errors. We propose a novel algorithm to compute the Schr\"odinger Bridge, a\u0000dynamic entropy-regularised version of OT, that eliminates the need to train\u0000multiple DDM-like models. This algorithm corresponds to a discretisation of a\u0000flow of path measures, which we call the Schr\"odinger Bridge Flow, whose only\u0000stationary point is the Schr\"odinger Bridge. We demonstrate the performance of\u0000our algorithm on a variety of unpaired data translation tasks.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Clustering based on Distributional Kernel 基于分布核的分布式聚类

arXiv - STAT - Machine Learning

Pub Date : 2024-09-14 DOI: arxiv-2409.09418

Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting

This paper introduces a new framework for clustering in a distributed networkcalled Distributed Clustering based on Distributional Kernel (K) or KDC thatproduces the final clusters based on the similarity with respect to thedistributions of initial clusters, as measured by K. It is the only frameworkthat satisfies all three of the following properties. First, KDC guaranteesthat the combined clustering outcome from all sites is equivalent to theclustering outcome of its centralized counterpart from the combined datasetfrom all sites. Second, the maximum runtime cost of any site in distributedmode is smaller than the runtime cost in centralized mode. Third, it isdesigned to discover clusters of arbitrary shapes, sizes and densities. To thebest of our knowledge, this is the first distributed clustering framework thatemploys a distributional kernel. The distribution-based clustering leadsdirectly to significantly better clustering outcomes than existing methods ofdistributed clustering. In addition, we introduce a new clustering algorithmcalled Kernel Bounded Cluster Cores, which is the best clustering algorithmapplied to KDC among existing clustering algorithms. We also show that KDC is ageneric framework that enables a quadratic time clustering algorithm to dealwith large datasets that would otherwise be impossible.

本文介绍了一种在分布式网络中进行聚类的新框架，称为基于分布核（K）的分布式聚类（Distributed Clustering based on Distributional Kernel (K)）或 KDC。首先，KDC 保证来自所有站点的组合聚类结果等同于来自所有站点组合数据集的集中式聚类结果。第二，任何站点在分布式模式下的最大运行时间成本都小于集中式模式下的运行时间成本。第三，它可以发现任意形状、大小和密度的聚类。据我们所知，这是第一个采用分布式内核的分布式聚类框架。与现有的分布式聚类方法相比，基于分布的聚类方法能直接带来更好的聚类结果。此外，我们还引入了一种新的聚类算法，称为 "内核有界聚类内核"（Kernel Bounded Cluster Cores），这是现有聚类算法中应用于 KDC 的最佳聚类算法。我们还证明，KDC 是一个通用框架，它能让二次时间聚类算法处理大型数据集，而这在其他情况下是不可能实现的。

{"title":"Distributed Clustering based on Distributional Kernel","authors":"Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting","doi":"arxiv-2409.09418","DOIUrl":"https://doi.org/arxiv-2409.09418","url":null,"abstract":"This paper introduces a new framework for clustering in a distributed network\u0000called Distributed Clustering based on Distributional Kernel (K) or KDC that\u0000produces the final clusters based on the similarity with respect to the\u0000distributions of initial clusters, as measured by K. It is the only framework\u0000that satisfies all three of the following properties. First, KDC guarantees\u0000that the combined clustering outcome from all sites is equivalent to the\u0000clustering outcome of its centralized counterpart from the combined dataset\u0000from all sites. Second, the maximum runtime cost of any site in distributed\u0000mode is smaller than the runtime cost in centralized mode. Third, it is\u0000designed to discover clusters of arbitrary shapes, sizes and densities. To the\u0000best of our knowledge, this is the first distributed clustering framework that\u0000employs a distributional kernel. The distribution-based clustering leads\u0000directly to significantly better clustering outcomes than existing methods of\u0000distributed clustering. In addition, we introduce a new clustering algorithm\u0000called Kernel Bounded Cluster Cores, which is the best clustering algorithm\u0000applied to KDC among existing clustering algorithms. We also show that KDC is a\u0000generic framework that enables a quadratic time clustering algorithm to deal\u0000with large datasets that would otherwise be impossible.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model-independent variable selection via the rule-based variable priorit 通过基于规则的变量优先级选择与模型无关的变量

arXiv - STAT - Machine Learning

Pub Date : 2024-09-13 DOI: arxiv-2409.09003

Min Lu, Hemant Ishwaran

While achieving high prediction accuracy is a fundamental goal in machinelearning, an equally important task is finding a small number of features withhigh explanatory power. One popular selection technique is permutationimportance, which assesses a variable's impact by measuring the change inprediction error after permuting the variable. However, this can be problematicdue to the need to create artificial data, a problem shared by other methods aswell. Another problem is that variable selection methods can be limited bybeing model-specific. We introduce a new model-independent approach, VariablePriority (VarPro), which works by utilizing rules without the need to generateartificial data or evaluate prediction error. The method is relatively easy touse, requiring only the calculation of sample averages of simple statistics,and can be applied to many data settings, including regression, classification,and survival. We investigate the asymptotic properties of VarPro and show,among other things, that VarPro has a consistent filtering property for noisevariables. Empirical studies using synthetic and real-world data show themethod achieves a balanced performance and compares favorably to manystate-of-the-art procedures currently used for variable selection.

虽然实现高预测准确率是机器学习的基本目标，但同样重要的任务是找到少量具有高解释力的特征。一种流行的选择技术是置换重要性（permutationimportance），它通过测量变量置换后预测误差的变化来评估变量的影响。然而，由于需要创建人工数据，这可能会产生问题，这也是其他方法共同面临的问题。另一个问题是，变量选择方法可能会受到特定模型的限制。我们引入了一种独立于模型的新方法 VariablePriority (VarPro)，它利用规则进行工作，无需生成人工数据或评估预测误差。这种方法相对简单，只需计算简单统计量的样本平均值，可应用于多种数据设置，包括回归、分类和生存。我们对 VarPro 的渐近特性进行了研究，结果表明 VarPro 对噪声变量具有一致的过滤特性。使用合成数据和真实世界数据进行的实证研究表明，该方法实现了均衡的性能，与目前用于变量选择的许多最先进的程序相比毫不逊色。

{"title":"Model-independent variable selection via the rule-based variable priorit","authors":"Min Lu, Hemant Ishwaran","doi":"arxiv-2409.09003","DOIUrl":"https://doi.org/arxiv-2409.09003","url":null,"abstract":"While achieving high prediction accuracy is a fundamental goal in machine\u0000learning, an equally important task is finding a small number of features with\u0000high explanatory power. One popular selection technique is permutation\u0000importance, which assesses a variable's impact by measuring the change in\u0000prediction error after permuting the variable. However, this can be problematic\u0000due to the need to create artificial data, a problem shared by other methods as\u0000well. Another problem is that variable selection methods can be limited by\u0000being model-specific. We introduce a new model-independent approach, Variable\u0000Priority (VarPro), which works by utilizing rules without the need to generate\u0000artificial data or evaluate prediction error. The method is relatively easy to\u0000use, requiring only the calculation of sample averages of simple statistics,\u0000and can be applied to many data settings, including regression, classification,\u0000and survival. We investigate the asymptotic properties of VarPro and show,\u0000among other things, that VarPro has a consistent filtering property for noise\u0000variables. Empirical studies using synthetic and real-world data show the\u0000method achieves a balanced performance and compares favorably to many\u0000state-of-the-art procedures currently used for variable selection.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features 连续包含特征的分批在线上下文稀疏匪帮

arXiv - STAT - Machine Learning

Pub Date : 2024-09-13 DOI: arxiv-2409.09199

Rowan Swiers, Subash Prabanantham, Andrew Maher

Multi-armed Bandits (MABs) are increasingly employed in online platforms ande-commerce to optimize decision making for personalized user experiences. Inthis work, we focus on the Contextual Bandit problem with linear rewards, underconditions of sparsity and batched data. We address the challenge of fairnessby excluding irrelevant features from decision-making processes using a novelalgorithm, Online Batched Sequential Inclusion (OBSI), which sequentiallyincludes features as confidence in their impact on the reward increases. Ourexperiments on synthetic data show the superior performance of OBSI compared toother algorithms in terms of regret, relevance of features used, and compute.

多臂匪徒（MABs）被越来越多地应用于在线平台和电子商务中，以优化个性化用户体验的决策制定。在这项工作中，我们重点研究了线性奖励、稀疏性和批量数据条件下的情境匪帮问题。我们使用一种新颖的算法--在线分批连续包容（OBSI）--将无关特征排除在决策过程之外，从而解决了公平性的难题。在合成数据上进行的实验表明，与其他算法相比，OBSI 在遗憾度、使用特征的相关性和计算量方面都表现出色。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - STAT - Machine Learning

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀