Numerical simulation is powerful to study nonlinear solid mechanics problems. However, mesh-based or particle-based numerical methods suffer from the common shortcoming of being time-consuming, particularly for complex problems with real-time analysis requirements. This study presents a clustering adaptive Gaussian process regression (CAG) method aiming for real-time prediction for nonlinear structural responses in solid mechanics. It is a data-driven machine learning method featuring a small sample size, high accuracy, and high efficiency, leveraging nonlinear structural response patterns. Similar to the traditional Gaussian process regression (GPR) method, it operates in offline and online stages. In the offline stage, an adaptive sample generation technique is introduced to cluster datasets into distinct patterns for demand-driven sample allocation. This ensures comprehensive coverage of the critical samples for the solution space of interest. In the online stage, following the divide-and-conquer strategy, a pre-prediction classification categorizes problems into predefined patterns sequentially predicted by the trained multi-pattern Gaussian process regressor. In addition, dimension reduction and restoration techniques are employed in the proposed method to enhance its efficiency. A set of problems involving material, geometric, and boundary condition nonlinearities is presented to demonstrate the CAG method's abilities. The proposed method can offer predictions within a second and attain high precision with only about 20 samples within the context of this study, outperforming the traditional GPR using uniformly distributed samples for error reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected to offer a powerful tool for real-time prediction of nonlinear solid mechanical problems and shed light on the complex nonlinear structural response pattern.
{"title":"A clustering adaptive Gaussian process regression method: response patterns based real-time prediction for nonlinear solid mechanics problems","authors":"Ming-Jian Li, Yanping Lian, Zhanshan Cheng, Lehui Li, Zhidong Wang, Ruxin Gao, Daining Fang","doi":"arxiv-2409.10572","DOIUrl":"https://doi.org/arxiv-2409.10572","url":null,"abstract":"Numerical simulation is powerful to study nonlinear solid mechanics problems.\u0000However, mesh-based or particle-based numerical methods suffer from the common\u0000shortcoming of being time-consuming, particularly for complex problems with\u0000real-time analysis requirements. This study presents a clustering adaptive\u0000Gaussian process regression (CAG) method aiming for real-time prediction for\u0000nonlinear structural responses in solid mechanics. It is a data-driven machine\u0000learning method featuring a small sample size, high accuracy, and high\u0000efficiency, leveraging nonlinear structural response patterns. Similar to the\u0000traditional Gaussian process regression (GPR) method, it operates in offline\u0000and online stages. In the offline stage, an adaptive sample generation\u0000technique is introduced to cluster datasets into distinct patterns for\u0000demand-driven sample allocation. This ensures comprehensive coverage of the\u0000critical samples for the solution space of interest. In the online stage,\u0000following the divide-and-conquer strategy, a pre-prediction classification\u0000categorizes problems into predefined patterns sequentially predicted by the\u0000trained multi-pattern Gaussian process regressor. In addition, dimension\u0000reduction and restoration techniques are employed in the proposed method to\u0000enhance its efficiency. A set of problems involving material, geometric, and\u0000boundary condition nonlinearities is presented to demonstrate the CAG method's\u0000abilities. The proposed method can offer predictions within a second and attain\u0000high precision with only about 20 samples within the context of this study,\u0000outperforming the traditional GPR using uniformly distributed samples for error\u0000reductions ranging from 1 to 3 orders of magnitude. The CAG method is expected\u0000to offer a powerful tool for real-time prediction of nonlinear solid mechanical\u0000problems and shed light on the complex nonlinear structural response pattern.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering, as an unsupervised technique, plays a pivotal role in various data analysis applications. Among clustering algorithms, Spectral Clustering on Euclidean Spaces has been extensively studied. However, with the rapid evolution of data complexity, Euclidean Space is proving to be inefficient for representing and learning algorithms. Although Deep Neural Networks on hyperbolic spaces have gained recent traction, clustering algorithms or non-deep machine learning models on non-Euclidean Spaces remain underexplored. In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces to address this gap. Hyperbolic Spaces offer advantages in representing complex data structures like hierarchical and tree-like structures, which cannot be embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix, demonstrating improved efficiency compared to clustering in Euclidean Spaces. Our contributions include the development of the spectral clustering algorithm on Hyperbolic Spaces and the proof of its weak consistency. We show that our algorithm converges at least as fast as Spectral Clustering on Euclidean Spaces. To illustrate the efficacy of our approach, we present experimental results on the Wisconsin Breast Cancer Dataset, highlighting the superior performance of Hyperbolic Spectral Clustering over its Euclidean counterpart. This work opens up avenues for utilizing non-Euclidean Spaces in clustering algorithms, offering new perspectives for handling complex data structures and improving clustering efficiency.
{"title":"Consistent Spectral Clustering in Hyperbolic Spaces","authors":"Sagar Ghosh, Swagatam Das","doi":"arxiv-2409.09304","DOIUrl":"https://doi.org/arxiv-2409.09304","url":null,"abstract":"Clustering, as an unsupervised technique, plays a pivotal role in various\u0000data analysis applications. Among clustering algorithms, Spectral Clustering on\u0000Euclidean Spaces has been extensively studied. However, with the rapid\u0000evolution of data complexity, Euclidean Space is proving to be inefficient for\u0000representing and learning algorithms. Although Deep Neural Networks on\u0000hyperbolic spaces have gained recent traction, clustering algorithms or\u0000non-deep machine learning models on non-Euclidean Spaces remain underexplored.\u0000In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces\u0000to address this gap. Hyperbolic Spaces offer advantages in representing complex\u0000data structures like hierarchical and tree-like structures, which cannot be\u0000embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the\u0000Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix,\u0000demonstrating improved efficiency compared to clustering in Euclidean Spaces.\u0000Our contributions include the development of the spectral clustering algorithm\u0000on Hyperbolic Spaces and the proof of its weak consistency. We show that our\u0000algorithm converges at least as fast as Spectral Clustering on Euclidean\u0000Spaces. To illustrate the efficacy of our approach, we present experimental\u0000results on the Wisconsin Breast Cancer Dataset, highlighting the superior\u0000performance of Hyperbolic Spectral Clustering over its Euclidean counterpart.\u0000This work opens up avenues for utilizing non-Euclidean Spaces in clustering\u0000algorithms, offering new perspectives for handling complex data structures and\u0000improving clustering efficiency.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, there has been a surge in research on Question Difficulty Estimation (QDE) using natural language processing techniques. Transformer-based neural networks achieve state-of-the-art performance, primarily through supervised methods but with an isolated study in unsupervised learning. While supervised methods focus on predictive performance, they require abundant labeled data. On the other hand, unsupervised methods do not require labeled data but rely on a different evaluation metric that is also computationally expensive in practice. This work bridges the research gap by exploring active learning for QDE, a supervised human-in-the-loop approach striving to minimize the labeling efforts while matching the performance of state-of-the-art models. The active learning process iteratively trains on a labeled subset, acquiring labels from human experts only for the most informative unlabeled data points. Furthermore, we propose a novel acquisition function PowerVariance to add the most informative samples to the labeled set, a regression extension to the PowerBALD function popular in classification. We employ DistilBERT for QDE and identify informative samples by applying Monte Carlo dropout to capture epistemic uncertainty in unlabeled samples. The experiments demonstrate that active learning with PowerVariance acquisition achieves a performance close to fully supervised models after labeling only 10% of the training data. The proposed methodology promotes the responsible use of educational resources, makes QDE tools more accessible to course instructors, and is promising for other applications such as personalized support systems and question-answering tools.
{"title":"Active Learning to Guide Labeling Efforts for Question Difficulty Estimation","authors":"Arthur Thuy, Ekaterina Loginova, Dries F. Benoit","doi":"arxiv-2409.09258","DOIUrl":"https://doi.org/arxiv-2409.09258","url":null,"abstract":"In recent years, there has been a surge in research on Question Difficulty\u0000Estimation (QDE) using natural language processing techniques.\u0000Transformer-based neural networks achieve state-of-the-art performance,\u0000primarily through supervised methods but with an isolated study in unsupervised\u0000learning. While supervised methods focus on predictive performance, they\u0000require abundant labeled data. On the other hand, unsupervised methods do not\u0000require labeled data but rely on a different evaluation metric that is also\u0000computationally expensive in practice. This work bridges the research gap by\u0000exploring active learning for QDE, a supervised human-in-the-loop approach\u0000striving to minimize the labeling efforts while matching the performance of\u0000state-of-the-art models. The active learning process iteratively trains on a\u0000labeled subset, acquiring labels from human experts only for the most\u0000informative unlabeled data points. Furthermore, we propose a novel acquisition\u0000function PowerVariance to add the most informative samples to the labeled set,\u0000a regression extension to the PowerBALD function popular in classification. We\u0000employ DistilBERT for QDE and identify informative samples by applying Monte\u0000Carlo dropout to capture epistemic uncertainty in unlabeled samples. The\u0000experiments demonstrate that active learning with PowerVariance acquisition\u0000achieves a performance close to fully supervised models after labeling only 10%\u0000of the training data. The proposed methodology promotes the responsible use of\u0000educational resources, makes QDE tools more accessible to course instructors,\u0000and is promising for other applications such as personalized support systems\u0000and question-answering tools.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A Schr"{o}dinger bridge establishes a dynamic transport map between two target distributions via a reference process, simultaneously solving an associated entropic optimal transport problem. We consider the setting where samples from the target distributions are available, and the reference diffusion process admits tractable dynamics. We thus introduce Coupled Bridge Matching (BM$^2$), a simple emph{non-iterative} approach for learning Schr"{o}dinger bridges with neural networks. A preliminary theoretical analysis of the convergence properties of BM$^2$ is carried out, supported by numerical experiments that demonstrate the effectiveness of our proposal.
{"title":"BM$^2$: Coupled Schrödinger Bridge Matching","authors":"Stefano Peluchetti","doi":"arxiv-2409.09376","DOIUrl":"https://doi.org/arxiv-2409.09376","url":null,"abstract":"A Schr\"{o}dinger bridge establishes a dynamic transport map between two\u0000target distributions via a reference process, simultaneously solving an\u0000associated entropic optimal transport problem. We consider the setting where\u0000samples from the target distributions are available, and the reference\u0000diffusion process admits tractable dynamics. We thus introduce Coupled Bridge\u0000Matching (BM$^2$), a simple emph{non-iterative} approach for learning\u0000Schr\"{o}dinger bridges with neural networks. A preliminary theoretical\u0000analysis of the convergence properties of BM$^2$ is carried out, supported by\u0000numerical experiments that demonstrate the effectiveness of our proposal.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Variational autoencoder (VAE) is an established generative model but is notorious for its blurriness. In this work, we investigate the blurry output problem of VAE and resolve it, exploiting the variance of Gaussian decoder and $beta$ of beta-VAE. Specifically, we reveal that the indistinguishability of decoder variance and $beta$ hinders appropriate analysis of the model by random likelihood value, and limits performance improvement by omitting the gain from $beta$. To address the problem, we propose Beta-Sigma VAE (BS-VAE) that explicitly separates $beta$ and decoder variance $sigma^2_x$ in the model. Our method demonstrates not only superior performance in natural image synthesis but also controllable parameters and predictable analysis compared to conventional VAE. In our experimental evaluation, we employ the analysis of rate-distortion curve and proxy metrics on computer vision datasets. The code is available on https://github.com/overnap/BS-VAE
{"title":"Beta-Sigma VAE: Separating beta and decoder variance in Gaussian variational autoencoder","authors":"Seunghwan Kim, Seungkyu Lee","doi":"arxiv-2409.09361","DOIUrl":"https://doi.org/arxiv-2409.09361","url":null,"abstract":"Variational autoencoder (VAE) is an established generative model but is\u0000notorious for its blurriness. In this work, we investigate the blurry output\u0000problem of VAE and resolve it, exploiting the variance of Gaussian decoder and\u0000$beta$ of beta-VAE. Specifically, we reveal that the indistinguishability of\u0000decoder variance and $beta$ hinders appropriate analysis of the model by\u0000random likelihood value, and limits performance improvement by omitting the\u0000gain from $beta$. To address the problem, we propose Beta-Sigma VAE (BS-VAE)\u0000that explicitly separates $beta$ and decoder variance $sigma^2_x$ in the\u0000model. Our method demonstrates not only superior performance in natural image\u0000synthesis but also controllable parameters and predictable analysis compared to\u0000conventional VAE. In our experimental evaluation, we employ the analysis of\u0000rate-distortion curve and proxy metrics on computer vision datasets. The code\u0000is available on https://github.com/overnap/BS-VAE","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel framework for tensor eigenvalue analysis in the context of multi-modal data fusion, leveraging topological invariants such as Betti numbers. While traditional approaches to tensor eigenvalues rely on algebraic extensions of matrix theory, this work provides a topological perspective that enriches the understanding of tensor structures. By establishing new theorems linking eigenvalues to topological features, the proposed framework offers deeper insights into the latent structure of data, enhancing both interpretability and robustness. Applications to data fusion illustrate the theoretical and practical significance of the approach, demonstrating its potential for broad impact across machine learning and data science domains.
{"title":"Topological Tensor Eigenvalue Theorems in Data Fusion","authors":"Ronald Katende","doi":"arxiv-2409.09392","DOIUrl":"https://doi.org/arxiv-2409.09392","url":null,"abstract":"This paper introduces a novel framework for tensor eigenvalue analysis in the\u0000context of multi-modal data fusion, leveraging topological invariants such as\u0000Betti numbers. While traditional approaches to tensor eigenvalues rely on\u0000algebraic extensions of matrix theory, this work provides a topological\u0000perspective that enriches the understanding of tensor structures. By\u0000establishing new theorems linking eigenvalues to topological features, the\u0000proposed framework offers deeper insights into the latent structure of data,\u0000enhancing both interpretability and robustness. Applications to data fusion\u0000illustrate the theoretical and practical significance of the approach,\u0000demonstrating its potential for broad impact across machine learning and data\u0000science domains.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentin De Bortoli, Iryna Korshunova, Andriy Mnih, Arnaud Doucet
Mass transport problems arise in many areas of machine learning whereby one wants to compute a map transporting one distribution to another. Generative modeling techniques like Generative Adversarial Networks (GANs) and Denoising Diffusion Models (DDMs) have been successfully adapted to solve such transport problems, resulting in CycleGAN and Bridge Matching respectively. However, these methods do not approximate Optimal Transport (OT) maps, which are known to have desirable properties. Existing techniques approximating OT maps for high-dimensional data-rich problems, such as DDM-based Rectified Flow and Schr"odinger Bridge procedures, require fully training a DDM-type model at each iteration, or use mini-batch techniques which can introduce significant errors. We propose a novel algorithm to compute the Schr"odinger Bridge, a dynamic entropy-regularised version of OT, that eliminates the need to train multiple DDM-like models. This algorithm corresponds to a discretisation of a flow of path measures, which we call the Schr"odinger Bridge Flow, whose only stationary point is the Schr"odinger Bridge. We demonstrate the performance of our algorithm on a variety of unpaired data translation tasks.
在机器学习的许多领域都会出现大规模传输问题,即人们希望计算将一个分布传输到另一个分布的地图。生成模型技术,如生成对抗网络(GAN)和去噪扩散模型(DDM),已被成功地用于解决此类传输问题,分别产生了循环生成对抗网络(CycleGAN)和桥匹配(Bridge Matching)。然而,这些方法并不能逼近已知具有理想特性的最优传输(OT)图。现有的针对高维数据丰富问题的近似 OT 地图的技术,如基于 DDM 的整流程序和薛定谔桥程序,需要在每次迭代时完全训练一个 DDM 类型的模型,或者使用可能引入重大误差的迷你批处理技术。我们提出了一种计算 Schr"odinger Bridge 的新算法,即 OT 的动态熵规则化版本,它无需训练多个类似于 DDM 的模型。该算法对应于路径度量流的离散化,我们称之为薛定谔桥流,其唯一的静止点就是薛定谔桥。我们在各种非配对数据转换任务中演示了我们算法的性能。
{"title":"Schrödinger Bridge Flow for Unpaired Data Translation","authors":"Valentin De Bortoli, Iryna Korshunova, Andriy Mnih, Arnaud Doucet","doi":"arxiv-2409.09347","DOIUrl":"https://doi.org/arxiv-2409.09347","url":null,"abstract":"Mass transport problems arise in many areas of machine learning whereby one\u0000wants to compute a map transporting one distribution to another. Generative\u0000modeling techniques like Generative Adversarial Networks (GANs) and Denoising\u0000Diffusion Models (DDMs) have been successfully adapted to solve such transport\u0000problems, resulting in CycleGAN and Bridge Matching respectively. However,\u0000these methods do not approximate Optimal Transport (OT) maps, which are known\u0000to have desirable properties. Existing techniques approximating OT maps for\u0000high-dimensional data-rich problems, such as DDM-based Rectified Flow and\u0000Schr\"odinger Bridge procedures, require fully training a DDM-type model at\u0000each iteration, or use mini-batch techniques which can introduce significant\u0000errors. We propose a novel algorithm to compute the Schr\"odinger Bridge, a\u0000dynamic entropy-regularised version of OT, that eliminates the need to train\u0000multiple DDM-like models. This algorithm corresponds to a discretisation of a\u0000flow of path measures, which we call the Schr\"odinger Bridge Flow, whose only\u0000stationary point is the Schr\"odinger Bridge. We demonstrate the performance of\u0000our algorithm on a variety of unpaired data translation tasks.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting
This paper introduces a new framework for clustering in a distributed network called Distributed Clustering based on Distributional Kernel (K) or KDC that produces the final clusters based on the similarity with respect to the distributions of initial clusters, as measured by K. It is the only framework that satisfies all three of the following properties. First, KDC guarantees that the combined clustering outcome from all sites is equivalent to the clustering outcome of its centralized counterpart from the combined dataset from all sites. Second, the maximum runtime cost of any site in distributed mode is smaller than the runtime cost in centralized mode. Third, it is designed to discover clusters of arbitrary shapes, sizes and densities. To the best of our knowledge, this is the first distributed clustering framework that employs a distributional kernel. The distribution-based clustering leads directly to significantly better clustering outcomes than existing methods of distributed clustering. In addition, we introduce a new clustering algorithm called Kernel Bounded Cluster Cores, which is the best clustering algorithm applied to KDC among existing clustering algorithms. We also show that KDC is a generic framework that enables a quadratic time clustering algorithm to deal with large datasets that would otherwise be impossible.
本文介绍了一种在分布式网络中进行聚类的新框架,称为基于分布核(K)的分布式聚类(Distributed Clustering based on Distributional Kernel (K))或 KDC。首先,KDC 保证来自所有站点的组合聚类结果等同于来自所有站点组合数据集的集中式聚类结果。第二,任何站点在分布式模式下的最大运行时间成本都小于集中式模式下的运行时间成本。第三,它可以发现任意形状、大小和密度的聚类。据我们所知,这是第一个采用分布式内核的分布式聚类框架。与现有的分布式聚类方法相比,基于分布的聚类方法能直接带来更好的聚类结果。此外,我们还引入了一种新的聚类算法,称为 "内核有界聚类内核"(Kernel Bounded Cluster Cores),这是现有聚类算法中应用于 KDC 的最佳聚类算法。我们还证明,KDC 是一个通用框架,它能让二次时间聚类算法处理大型数据集,而这在其他情况下是不可能实现的。
{"title":"Distributed Clustering based on Distributional Kernel","authors":"Hang Zhang, Yang Xu, Lei Gong, Ye Zhu, Kai Ming Ting","doi":"arxiv-2409.09418","DOIUrl":"https://doi.org/arxiv-2409.09418","url":null,"abstract":"This paper introduces a new framework for clustering in a distributed network\u0000called Distributed Clustering based on Distributional Kernel (K) or KDC that\u0000produces the final clusters based on the similarity with respect to the\u0000distributions of initial clusters, as measured by K. It is the only framework\u0000that satisfies all three of the following properties. First, KDC guarantees\u0000that the combined clustering outcome from all sites is equivalent to the\u0000clustering outcome of its centralized counterpart from the combined dataset\u0000from all sites. Second, the maximum runtime cost of any site in distributed\u0000mode is smaller than the runtime cost in centralized mode. Third, it is\u0000designed to discover clusters of arbitrary shapes, sizes and densities. To the\u0000best of our knowledge, this is the first distributed clustering framework that\u0000employs a distributional kernel. The distribution-based clustering leads\u0000directly to significantly better clustering outcomes than existing methods of\u0000distributed clustering. In addition, we introduce a new clustering algorithm\u0000called Kernel Bounded Cluster Cores, which is the best clustering algorithm\u0000applied to KDC among existing clustering algorithms. We also show that KDC is a\u0000generic framework that enables a quadratic time clustering algorithm to deal\u0000with large datasets that would otherwise be impossible.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While achieving high prediction accuracy is a fundamental goal in machine learning, an equally important task is finding a small number of features with high explanatory power. One popular selection technique is permutation importance, which assesses a variable's impact by measuring the change in prediction error after permuting the variable. However, this can be problematic due to the need to create artificial data, a problem shared by other methods as well. Another problem is that variable selection methods can be limited by being model-specific. We introduce a new model-independent approach, Variable Priority (VarPro), which works by utilizing rules without the need to generate artificial data or evaluate prediction error. The method is relatively easy to use, requiring only the calculation of sample averages of simple statistics, and can be applied to many data settings, including regression, classification, and survival. We investigate the asymptotic properties of VarPro and show, among other things, that VarPro has a consistent filtering property for noise variables. Empirical studies using synthetic and real-world data show the method achieves a balanced performance and compares favorably to many state-of-the-art procedures currently used for variable selection.
{"title":"Model-independent variable selection via the rule-based variable priorit","authors":"Min Lu, Hemant Ishwaran","doi":"arxiv-2409.09003","DOIUrl":"https://doi.org/arxiv-2409.09003","url":null,"abstract":"While achieving high prediction accuracy is a fundamental goal in machine\u0000learning, an equally important task is finding a small number of features with\u0000high explanatory power. One popular selection technique is permutation\u0000importance, which assesses a variable's impact by measuring the change in\u0000prediction error after permuting the variable. However, this can be problematic\u0000due to the need to create artificial data, a problem shared by other methods as\u0000well. Another problem is that variable selection methods can be limited by\u0000being model-specific. We introduce a new model-independent approach, Variable\u0000Priority (VarPro), which works by utilizing rules without the need to generate\u0000artificial data or evaluate prediction error. The method is relatively easy to\u0000use, requiring only the calculation of sample averages of simple statistics,\u0000and can be applied to many data settings, including regression, classification,\u0000and survival. We investigate the asymptotic properties of VarPro and show,\u0000among other things, that VarPro has a consistent filtering property for noise\u0000variables. Empirical studies using synthetic and real-world data show the\u0000method achieves a balanced performance and compares favorably to many\u0000state-of-the-art procedures currently used for variable selection.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-armed Bandits (MABs) are increasingly employed in online platforms and e-commerce to optimize decision making for personalized user experiences. In this work, we focus on the Contextual Bandit problem with linear rewards, under conditions of sparsity and batched data. We address the challenge of fairness by excluding irrelevant features from decision-making processes using a novel algorithm, Online Batched Sequential Inclusion (OBSI), which sequentially includes features as confidence in their impact on the reward increases. Our experiments on synthetic data show the superior performance of OBSI compared to other algorithms in terms of regret, relevance of features used, and compute.
{"title":"Batched Online Contextual Sparse Bandits with Sequential Inclusion of Features","authors":"Rowan Swiers, Subash Prabanantham, Andrew Maher","doi":"arxiv-2409.09199","DOIUrl":"https://doi.org/arxiv-2409.09199","url":null,"abstract":"Multi-armed Bandits (MABs) are increasingly employed in online platforms and\u0000e-commerce to optimize decision making for personalized user experiences. In\u0000this work, we focus on the Contextual Bandit problem with linear rewards, under\u0000conditions of sparsity and batched data. We address the challenge of fairness\u0000by excluding irrelevant features from decision-making processes using a novel\u0000algorithm, Online Batched Sequential Inclusion (OBSI), which sequentially\u0000includes features as confidence in their impact on the reward increases. Our\u0000experiments on synthetic data show the superior performance of OBSI compared to\u0000other algorithms in terms of regret, relevance of features used, and compute.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142261751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}