We establish the validity of asymptotic limits for the general transportation problem between random i.i.d. points and their common distribution, with respect to the squared Euclidean distance cost, in any dimension larger than three. Previous results were essentially limited to the two (or one) dimensional case, or to distributions whose absolutely continuous part is uniform. The proof relies upon recent advances in the stability theory of optimal transportation, combined with functional analytic techniques and some ideas from quantitative stochastic homogenization. The key tool we develop is a quantitative upper bound for the usual quadratic optimal transportation problem in terms of its boundary variant, where points can be freely transported along the boundary. The methods we use are applicable to more general random measures, including occupation measure of Brownian paths, and may open the door to further progress on challenging problems at the interface of analysis, probability, and discrete mathematics.
{"title":"Asymptotics for Random Quadratic Transportation Costs","authors":"Martin Huesmann, Michael Goldman, Dario Trevisan","doi":"arxiv-2409.08612","DOIUrl":"https://doi.org/arxiv-2409.08612","url":null,"abstract":"We establish the validity of asymptotic limits for the general transportation\u0000problem between random i.i.d. points and their common distribution, with\u0000respect to the squared Euclidean distance cost, in any dimension larger than\u0000three. Previous results were essentially limited to the two (or one)\u0000dimensional case, or to distributions whose absolutely continuous part is\u0000uniform. The proof relies upon recent advances in the stability theory of optimal\u0000transportation, combined with functional analytic techniques and some ideas\u0000from quantitative stochastic homogenization. The key tool we develop is a\u0000quantitative upper bound for the usual quadratic optimal transportation problem\u0000in terms of its boundary variant, where points can be freely transported along\u0000the boundary. The methods we use are applicable to more general random\u0000measures, including occupation measure of Brownian paths, and may open the door\u0000to further progress on challenging problems at the interface of analysis,\u0000probability, and discrete mathematics.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sample covariance matrices from multi-population typically exhibit several large spiked eigenvalues, which stem from differences between population means and are crucial for inference on the underlying data structure. This paper investigates the asymptotic properties of spiked eigenvalues of a renormalized sample covariance matrices from multi-population in the ultrahigh dimensional context where the dimension-to-sample size ratio p/n go to infinity. The first- and second-order convergence of these spikes are established based on asymptotic properties of three types of sesquilinear forms from multi-population. These findings are further applied to two scenarios,including determination of total number of subgroups and a new criterion for evaluating clustering results in the absence of true labels. Additionally, we provide a unified framework with p/n->cin (0,infty] that integrates the asymptotic results in both high and ultrahigh dimensional settings.
{"title":"On spiked eigenvalues of a renormalized sample covariance matrix from multi-population","authors":"Weiming Li, Zeng Li, Junpeng Zhu","doi":"arxiv-2409.08715","DOIUrl":"https://doi.org/arxiv-2409.08715","url":null,"abstract":"Sample covariance matrices from multi-population typically exhibit several\u0000large spiked eigenvalues, which stem from differences between population means\u0000and are crucial for inference on the underlying data structure. This paper\u0000investigates the asymptotic properties of spiked eigenvalues of a renormalized\u0000sample covariance matrices from multi-population in the ultrahigh dimensional\u0000context where the dimension-to-sample size ratio p/n go to infinity. The first-\u0000and second-order convergence of these spikes are established based on\u0000asymptotic properties of three types of sesquilinear forms from\u0000multi-population. These findings are further applied to two scenarios,including\u0000determination of total number of subgroups and a new criterion for evaluating\u0000clustering results in the absence of true labels. Additionally, we provide a\u0000unified framework with p/n->cin (0,infty] that integrates the asymptotic\u0000results in both high and ultrahigh dimensional settings.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We provide finite-particle convergence rates for the Stein Variational Gradient Descent (SVGD) algorithm in the Kernel Stein Discrepancy ($mathsf{KSD}$) and Wasserstein-2 metrics. Our key insight is the observation that the time derivative of the relative entropy between the joint density of $N$ particle locations and the $N$-fold product target measure, starting from a regular initial distribution, splits into a dominant `negative part' proportional to $N$ times the expected $mathsf{KSD}^2$ and a smaller `positive part'. This observation leads to $mathsf{KSD}$ rates of order $1/sqrt{N}$, providing a near optimal double exponential improvement over the recent result by~cite{shi2024finite}. Under mild assumptions on the kernel and potential, these bounds also grow linearly in the dimension $d$. By adding a bilinear component to the kernel, the above approach is used to further obtain Wasserstein-2 convergence. For the case of `bilinear + Mat'ern' kernels, we derive Wasserstein-2 rates that exhibit a curse-of-dimensionality similar to the i.i.d. setting. We also obtain marginal convergence and long-time propagation of chaos results for the time-averaged particle laws.
{"title":"Improved Finite-Particle Convergence Rates for Stein Variational Gradient Descent","authors":"Krishnakumar Balasubramanian, Sayan Banerjee, Promit Ghosal","doi":"arxiv-2409.08469","DOIUrl":"https://doi.org/arxiv-2409.08469","url":null,"abstract":"We provide finite-particle convergence rates for the Stein Variational\u0000Gradient Descent (SVGD) algorithm in the Kernel Stein Discrepancy\u0000($mathsf{KSD}$) and Wasserstein-2 metrics. Our key insight is the observation\u0000that the time derivative of the relative entropy between the joint density of\u0000$N$ particle locations and the $N$-fold product target measure, starting from a\u0000regular initial distribution, splits into a dominant `negative part'\u0000proportional to $N$ times the expected $mathsf{KSD}^2$ and a smaller `positive\u0000part'. This observation leads to $mathsf{KSD}$ rates of order $1/sqrt{N}$,\u0000providing a near optimal double exponential improvement over the recent result\u0000by~cite{shi2024finite}. Under mild assumptions on the kernel and potential,\u0000these bounds also grow linearly in the dimension $d$. By adding a bilinear\u0000component to the kernel, the above approach is used to further obtain\u0000Wasserstein-2 convergence. For the case of `bilinear + Mat'ern' kernels, we\u0000derive Wasserstein-2 rates that exhibit a curse-of-dimensionality similar to\u0000the i.i.d. setting. We also obtain marginal convergence and long-time\u0000propagation of chaos results for the time-averaged particle laws.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tian-Yi Zhou, Matthew Lau, Jizhou Chen, Wenke Lee, Xiaoming Huo
Anomaly detection is an important problem in many application areas, such as network security. Many deep learning methods for unsupervised anomaly detection produce good empirical performance but lack theoretical guarantees. By casting anomaly detection into a binary classification problem, we establish non-asymptotic upper bounds and a convergence rate on the excess risk on rectified linear unit (ReLU) neural networks trained on synthetic anomalies. Our convergence rate on the excess risk matches the minimax optimal rate in the literature. Furthermore, we provide lower and upper bounds on the number of synthetic anomalies that can attain this optimality. For practical implementation, we relax some conditions to improve the search for the empirical risk minimizer, which leads to competitive performance to other classification-based methods for anomaly detection. Overall, our work provides the first theoretical guarantees of unsupervised neural network-based anomaly detectors and empirical insights on how to design them well.
{"title":"Optimal Classification-based Anomaly Detection with Neural Networks: Theory and Practice","authors":"Tian-Yi Zhou, Matthew Lau, Jizhou Chen, Wenke Lee, Xiaoming Huo","doi":"arxiv-2409.08521","DOIUrl":"https://doi.org/arxiv-2409.08521","url":null,"abstract":"Anomaly detection is an important problem in many application areas, such as\u0000network security. Many deep learning methods for unsupervised anomaly detection\u0000produce good empirical performance but lack theoretical guarantees. By casting\u0000anomaly detection into a binary classification problem, we establish\u0000non-asymptotic upper bounds and a convergence rate on the excess risk on\u0000rectified linear unit (ReLU) neural networks trained on synthetic anomalies.\u0000Our convergence rate on the excess risk matches the minimax optimal rate in the\u0000literature. Furthermore, we provide lower and upper bounds on the number of\u0000synthetic anomalies that can attain this optimality. For practical\u0000implementation, we relax some conditions to improve the search for the\u0000empirical risk minimizer, which leads to competitive performance to other\u0000classification-based methods for anomaly detection. Overall, our work provides\u0000the first theoretical guarantees of unsupervised neural network-based anomaly\u0000detectors and empirical insights on how to design them well.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We prove a formula for the maximal correlation coefficient of the bivariate Marshall Olkin distribution that was conjectured in Lin, Lai, and Govindaraju (2016, Stat. Methodol., 29:1-9). The formula is applied to obtain a new proof for a variance inequality in extreme value statistics that links the disjoint and the sliding block maxima method.
{"title":"On the maximal correlation coefficient for the bivariate Marshall Olkin distribution","authors":"Axel Bücher, Torben Staud","doi":"arxiv-2409.08661","DOIUrl":"https://doi.org/arxiv-2409.08661","url":null,"abstract":"We prove a formula for the maximal correlation coefficient of the bivariate\u0000Marshall Olkin distribution that was conjectured in Lin, Lai, and Govindaraju\u0000(2016, Stat. Methodol., 29:1-9). The formula is applied to obtain a new proof\u0000for a variance inequality in extreme value statistics that links the disjoint\u0000and the sliding block maxima method.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In bipartite incidence graph sampling, the target study units may be formed as connected population elements, which are distinct to the units of sampling and there may exist generally more than one way by which a given study unit can be observed via sampling units. This generalizes ?nite-population element or multistage sampling, where each element can only be sampled directly or via a single primary sampling unit. We study the admissibility of estimators in bipartite incidence graph sampling and identify other admissible estimators than the classic Horvitz-Thompson estimator. Our admissibility results encompass those for ?nite-population sampling.
{"title":"On Admissibility in Bipartite Incidence Graph Sampling","authors":"Pedro García-Segador, Li-Chung Zhang","doi":"arxiv-2409.07970","DOIUrl":"https://doi.org/arxiv-2409.07970","url":null,"abstract":"In bipartite incidence graph sampling, the target study units may be formed\u0000as connected population elements, which are distinct to the units of sampling\u0000and there may exist generally more than one way by which a given study unit can\u0000be observed via sampling units. This generalizes ?nite-population element or\u0000multistage sampling, where each element can only be sampled directly or via a\u0000single primary sampling unit. We study the admissibility of estimators in\u0000bipartite incidence graph sampling and identify other admissible estimators\u0000than the classic Horvitz-Thompson estimator. Our admissibility results\u0000encompass those for ?nite-population sampling.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The test of independence is a crucial component of modern data analysis. However, traditional methods often struggle with the complex dependency structures found in high-dimensional data. To overcome this challenge, we introduce a novel test statistic that captures intricate relationships using similarity and dissimilarity information derived from the data. The statistic exhibits strong power across a broad range of alternatives for high-dimensional data, as demonstrated in extensive simulation studies. Under mild conditions, we show that the new test statistic converges to the $chi^2_4$ distribution under the permutation null distribution, ensuring straightforward type I error control. Furthermore, our research advances the moment method in proving the joint asymptotic normality of multiple double-indexed permutation statistics. We showcase the practical utility of this new test with an application to the Genotype-Tissue Expression dataset, where it effectively measures associations between human tissues.
独立性检验是现代数据分析的重要组成部分。然而,传统方法往往难以应对高维数据中复杂的依赖性结构。为了克服这一难题,我们引入了一种新型检验统计量,利用从数据中获得的相似性和不相似性信息来捕捉错综复杂的关系。大量的模拟研究表明,该统计量在高维数据的各种替代方案中都表现出强大的威力。在温和的条件下,我们证明新的检验统计量收敛于 permutation null 分布下的 $chi^2_4$ 分布,确保了直接的 I 型误差控制。此外,我们的研究还推进了矩方法的发展,证明了多个双指数置换统计量的联合渐近正态性。我们在基因型-组织表达数据集(Genotype-Tissue Expression dataset)上的应用展示了这一新检验的实用性,它能有效地测量人体组织之间的关联。
{"title":"Generalized Independence Test for Modern Data","authors":"Mingshuo Liu, Doudou Zhou, Hao Chen","doi":"arxiv-2409.07745","DOIUrl":"https://doi.org/arxiv-2409.07745","url":null,"abstract":"The test of independence is a crucial component of modern data analysis.\u0000However, traditional methods often struggle with the complex dependency\u0000structures found in high-dimensional data. To overcome this challenge, we\u0000introduce a novel test statistic that captures intricate relationships using\u0000similarity and dissimilarity information derived from the data. The statistic\u0000exhibits strong power across a broad range of alternatives for high-dimensional\u0000data, as demonstrated in extensive simulation studies. Under mild conditions,\u0000we show that the new test statistic converges to the $chi^2_4$ distribution\u0000under the permutation null distribution, ensuring straightforward type I error\u0000control. Furthermore, our research advances the moment method in proving the\u0000joint asymptotic normality of multiple double-indexed permutation statistics.\u0000We showcase the practical utility of this new test with an application to the\u0000Genotype-Tissue Expression dataset, where it effectively measures associations\u0000between human tissues.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The field of quickest change detection (QCD) concerns design and analysis of algorithms to estimate in real time the time at which an important event takes place and identify properties of the post-change behavior. The goal is to devise a stopping time adapted to the observations that minimizes an $L_1$ loss. Approximately optimal solutions are well known under a variety of assumptions. In the work surveyed here we consider the CUSUM statistic, which is defined as a one-dimensional reflected random walk driven by a functional of the observations. It is known that the optimal functional is a log likelihood ratio subject to special statical assumptions. The paper concerns model free approaches to detection design, considering the following questions: 1. What is the performance for a given functional of the observations? 2. How do the conclusions change when there is dependency between pre- and post-change behavior? 3. How can techniques from statistics and machine learning be adapted to approximate the best functional in a given class?
{"title":"Quickest Change Detection Using Mismatched CUSUM","authors":"Austin Cooper, Sean Meyn","doi":"arxiv-2409.07948","DOIUrl":"https://doi.org/arxiv-2409.07948","url":null,"abstract":"The field of quickest change detection (QCD) concerns design and analysis of\u0000algorithms to estimate in real time the time at which an important event takes\u0000place and identify properties of the post-change behavior. The goal is to\u0000devise a stopping time adapted to the observations that minimizes an $L_1$\u0000loss. Approximately optimal solutions are well known under a variety of\u0000assumptions. In the work surveyed here we consider the CUSUM statistic, which\u0000is defined as a one-dimensional reflected random walk driven by a functional of\u0000the observations. It is known that the optimal functional is a log likelihood\u0000ratio subject to special statical assumptions. The paper concerns model free approaches to detection design, considering the\u0000following questions: 1. What is the performance for a given functional of the observations? 2. How do the conclusions change when there is dependency between pre- and\u0000post-change behavior? 3. How can techniques from statistics and machine learning be adapted to\u0000approximate the best functional in a given class?","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The normalized maximum likelihood (NML) code length is widely used as a model selection criterion based on the minimum description length principle, where the model with the shortest NML code length is selected. A common method to calculate the NML code length is to use the sum (for a discrete model) or integral (for a continuous model) of a function defined by the distribution of the maximum likelihood estimator. While this method has been proven to correctly calculate the NML code length of discrete models, no proof has been provided for continuous cases. Consequently, it has remained unclear whether the method can accurately calculate the NML code length of continuous models. In this paper, we solve this problem affirmatively, proving that the method is also correct for continuous cases. Remarkably, completing the proof for continuous cases is non-trivial in that it cannot be achieved by merely replacing the sums in discrete cases with integrals, as the decomposition trick applied to sums in the discrete model case proof is not applicable to integrals in the continuous model case proof. To overcome this, we introduce a novel decomposition approach based on the coarea formula from geometric measure theory, which is essential to establishing our proof for continuous cases.
{"title":"Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models","authors":"Atsushi Suzuki, Kota Fukuzawa, Kenji Yamanishi","doi":"arxiv-2409.08387","DOIUrl":"https://doi.org/arxiv-2409.08387","url":null,"abstract":"The normalized maximum likelihood (NML) code length is widely used as a model\u0000selection criterion based on the minimum description length principle, where\u0000the model with the shortest NML code length is selected. A common method to\u0000calculate the NML code length is to use the sum (for a discrete model) or\u0000integral (for a continuous model) of a function defined by the distribution of\u0000the maximum likelihood estimator. While this method has been proven to\u0000correctly calculate the NML code length of discrete models, no proof has been\u0000provided for continuous cases. Consequently, it has remained unclear whether\u0000the method can accurately calculate the NML code length of continuous models.\u0000In this paper, we solve this problem affirmatively, proving that the method is\u0000also correct for continuous cases. Remarkably, completing the proof for\u0000continuous cases is non-trivial in that it cannot be achieved by merely\u0000replacing the sums in discrete cases with integrals, as the decomposition trick\u0000applied to sums in the discrete model case proof is not applicable to integrals\u0000in the continuous model case proof. To overcome this, we introduce a novel\u0000decomposition approach based on the coarea formula from geometric measure\u0000theory, which is essential to establishing our proof for continuous cases.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuichi Ishida, Yuma Ichikawa, Aki Dote, Toshiyuki Miyazawa, Koji Hukushima
We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the "notorious" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase.
{"title":"Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning","authors":"Yuichi Ishida, Yuma Ichikawa, Aki Dote, Toshiyuki Miyazawa, Koji Hukushima","doi":"arxiv-2409.07679","DOIUrl":"https://doi.org/arxiv-2409.07679","url":null,"abstract":"We propose ratio divergence (RD) learning for discrete energy-based models, a\u0000method that utilizes both training data and a tractable target energy function.\u0000We apply RD learning to restricted Boltzmann machines (RBMs), which are a\u0000minimal model that satisfies the universal approximation theorem for discrete\u0000distributions. RD learning combines the strength of both forward and reverse\u0000Kullback-Leibler divergence (KLD) learning, effectively addressing the\u0000\"notorious\" issues of underfitting with the forward KLD and mode-collapse with\u0000the reverse KLD. Since the summation of forward and reverse KLD seems to be\u0000sufficient to combine the strength of both approaches, we include this learning\u0000method as a direct baseline in numerical experiments to evaluate its\u0000effectiveness. Numerical experiments demonstrate that RD learning significantly\u0000outperforms other learning methods in terms of energy function fitting,\u0000mode-covering, and learning stability across various discrete energy-based\u0000models. Moreover, the performance gaps between RD learning and the other\u0000learning methods become more pronounced as the dimensions of target models\u0000increase.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}