Many recent advances in sequential assimilation of data into nonlinear high-dimensional models are modifications to particle filters which employ efficient searches of a high-dimensional state space. In this work, we present a complementary strategy that combines statistical emulators and particle filters. The emulators are used to learn and offer a computationally cheap approximation to the forward dynamic mapping. This emulator-particle filter (Emu-PF) approach requires a modest number of forward-model runs, but yields well-resolved posterior distributions even in non-Gaussian cases. We explore several modifications to the Emu-PF that utilize mechanisms for dimension reduction to efficiently fit the statistical emulator, and present a series of simulation experiments on an atypical Lorenz-96 system to demonstrate their performance. We conclude with a discussion on how the Emu-PF can be paired with modern particle filtering algorithms.
{"title":"A surrogate-based approach to nonlinear, non-Gaussian joint state-parameter data assimilation","authors":"J. Maclean, E. Spiller","doi":"10.3934/fods.2021019","DOIUrl":"https://doi.org/10.3934/fods.2021019","url":null,"abstract":"Many recent advances in sequential assimilation of data into nonlinear high-dimensional models are modifications to particle filters which employ efficient searches of a high-dimensional state space. In this work, we present a complementary strategy that combines statistical emulators and particle filters. The emulators are used to learn and offer a computationally cheap approximation to the forward dynamic mapping. This emulator-particle filter (Emu-PF) approach requires a modest number of forward-model runs, but yields well-resolved posterior distributions even in non-Gaussian cases. We explore several modifications to the Emu-PF that utilize mechanisms for dimension reduction to efficiently fit the statistical emulator, and present a series of simulation experiments on an atypical Lorenz-96 system to demonstrate their performance. We conclude with a discussion on how the Emu-PF can be paired with modern particle filtering algorithms.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48331060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the problem of estimating linear response statistics under external perturbations using time series of unperturbed dynamics. Based on the fluctuation-dissipation theory, this problem is reformulated as an unsupervised learning task of estimating a density function. We consider a nonparametric density estimator formulated by the kernel embedding of distributions with "Mercer-type" kernels, constructed based on the classical orthogonal polynomials defined on non-compact domains. While the resulting representation is analogous to Polynomial Chaos Expansion (PCE), the connection to the reproducing kernel Hilbert space (RKHS) theory allows one to establish the uniform convergence of the estimator and to systematically address a practical question of identifying the PCE basis for a consistent estimation. We also provide practical conditions for the well-posedness of not only the estimator but also of the underlying response statistics. Finally, we provide a statistical error bound for the density estimation that accounts for the Monte-Carlo averaging over non-i.i.d time series and the biases due to a finite basis truncation. This error bound provides a means to understand the feasibility as well as limitation of the kernel embedding with Mercer-type kernels. Numerically, we verify the effectiveness of the estimator on two stochastic dynamics with known, yet, non-trivial equilibrium densities.
{"title":"Estimating linear response statistics using orthogonal polynomials: An rkhs formulation","authors":"He Zhang, J. Harlim, Xiantao Li","doi":"10.3934/fods.2020021","DOIUrl":"https://doi.org/10.3934/fods.2020021","url":null,"abstract":"We study the problem of estimating linear response statistics under external perturbations using time series of unperturbed dynamics. Based on the fluctuation-dissipation theory, this problem is reformulated as an unsupervised learning task of estimating a density function. We consider a nonparametric density estimator formulated by the kernel embedding of distributions with \"Mercer-type\" kernels, constructed based on the classical orthogonal polynomials defined on non-compact domains. While the resulting representation is analogous to Polynomial Chaos Expansion (PCE), the connection to the reproducing kernel Hilbert space (RKHS) theory allows one to establish the uniform convergence of the estimator and to systematically address a practical question of identifying the PCE basis for a consistent estimation. We also provide practical conditions for the well-posedness of not only the estimator but also of the underlying response statistics. Finally, we provide a statistical error bound for the density estimation that accounts for the Monte-Carlo averaging over non-i.i.d time series and the biases due to a finite basis truncation. This error bound provides a means to understand the feasibility as well as limitation of the kernel embedding with Mercer-type kernels. Numerically, we verify the effectiveness of the estimator on two stochastic dynamics with known, yet, non-trivial equilibrium densities.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48255540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audun D. Myers, Firas A. Khasawneh, Brittany Terese Fasy
We introduce a novel method for Additive Noise Analysis for Persistence Thresholding (ANAPT) which separates significant features in the sublevel set persistence diagram of a time series based on a statistics analysis of the persistence of a noise distribution. Specifically, we consider an additive noise model and leverage the statistical analysis to provide a noise cutoff or confidence interval in the persistence diagram for the observed time series. This analysis is done for several common noise models including Gaussian, uniform, exponential, and Rayleigh distributions. ANAPT is computationally efficient, does not require any signal pre-filtering, is widely applicable, and has open-source software available. We demonstrate the functionality of ANAPT with both numerically simulated examples and an experimental data set. Additionally, we provide an efficient begin{document}$ Theta(nlog(n)) $end{document} algorithm for calculating the zero-dimensional sublevel set persistence homology.
We introduce a novel method for Additive Noise Analysis for Persistence Thresholding (ANAPT) which separates significant features in the sublevel set persistence diagram of a time series based on a statistics analysis of the persistence of a noise distribution. Specifically, we consider an additive noise model and leverage the statistical analysis to provide a noise cutoff or confidence interval in the persistence diagram for the observed time series. This analysis is done for several common noise models including Gaussian, uniform, exponential, and Rayleigh distributions. ANAPT is computationally efficient, does not require any signal pre-filtering, is widely applicable, and has open-source software available. We demonstrate the functionality of ANAPT with both numerically simulated examples and an experimental data set. Additionally, we provide an efficient begin{document}$ Theta(nlog(n)) $end{document} algorithm for calculating the zero-dimensional sublevel set persistence homology.
{"title":"ANAPT: Additive noise analysis for persistence thresholding","authors":"Audun D. Myers, Firas A. Khasawneh, Brittany Terese Fasy","doi":"10.3934/fods.2022005","DOIUrl":"https://doi.org/10.3934/fods.2022005","url":null,"abstract":"We introduce a novel method for Additive Noise Analysis for Persistence Thresholding (ANAPT) which separates significant features in the sublevel set persistence diagram of a time series based on a statistics analysis of the persistence of a noise distribution. Specifically, we consider an additive noise model and leverage the statistical analysis to provide a noise cutoff or confidence interval in the persistence diagram for the observed time series. This analysis is done for several common noise models including Gaussian, uniform, exponential, and Rayleigh distributions. ANAPT is computationally efficient, does not require any signal pre-filtering, is widely applicable, and has open-source software available. We demonstrate the functionality of ANAPT with both numerically simulated examples and an experimental data set. Additionally, we provide an efficient begin{document}$ Theta(nlog(n)) $end{document} algorithm for calculating the zero-dimensional sublevel set persistence homology.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44284181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Consider the class of Ensemble Square Root filtering algorithms for the numerical approximation of the posterior distribution of nonlinear Markovian signals partially observed with linear observations corrupted with independent measurement noise. We analyze the asymptotic behavior of these algorithms in the large ensemble limit both in discrete and continuous time. We identify limiting mean-field processes on the level of the ensemble members, prove corresponding propagation of chaos results and derive associated convergence rates in terms of the ensemble size. In continuous time we also identify the stochastic partial differential equation driving the distribution of the mean-field process and perform a comparison with the Kushner-Stratonovich equation.
{"title":"Mean field limit of Ensemble Square Root filters - discrete and continuous time","authors":"Theresa Lange, W. Stannat","doi":"10.3934/FODS.2021003","DOIUrl":"https://doi.org/10.3934/FODS.2021003","url":null,"abstract":"Consider the class of Ensemble Square Root filtering algorithms for the numerical approximation of the posterior distribution of nonlinear Markovian signals partially observed with linear observations corrupted with independent measurement noise. We analyze the asymptotic behavior of these algorithms in the large ensemble limit both in discrete and continuous time. We identify limiting mean-field processes on the level of the ensemble members, prove corresponding propagation of chaos results and derive associated convergence rates in terms of the ensemble size. In continuous time we also identify the stochastic partial differential equation driving the distribution of the mean-field process and perform a comparison with the Kushner-Stratonovich equation.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":"50 14","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41267351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of this paper is to describe the feedback particle filter algorithm for problems where there are a large number (begin{document}$ M $end{document}) of non-interacting agents (targets) with a large number (begin{document}$ M $end{document}) of non-agent specific observations (measurements) that originate from these agents. In its basic form, the problem is characterized by data association uncertainty whereby the association between the observations and agents must be deduced in addition to the agent state. In this paper, the large-begin{document}$ M $end{document} limit is interpreted as a problem of collective inference. This viewpoint is used to derive the equation for the empirical distribution of the hidden agent states. A feedback particle filter (FPF) algorithm for this problem is presented and illustrated via numerical simulations. Results are presented for the Euclidean and the finite state-space cases, both in continuous-time settings. The classical FPF algorithm is shown to be the special case (with begin{document}$ M = 1 $end{document}) of these more general results. The simulations help show that the algorithm well approximates the empirical distribution of the hidden states for large begin{document}$ M $end{document}.
The purpose of this paper is to describe the feedback particle filter algorithm for problems where there are a large number (begin{document}$ M $end{document}) of non-interacting agents (targets) with a large number (begin{document}$ M $end{document}) of non-agent specific observations (measurements) that originate from these agents. In its basic form, the problem is characterized by data association uncertainty whereby the association between the observations and agents must be deduced in addition to the agent state. In this paper, the large-begin{document}$ M $end{document} limit is interpreted as a problem of collective inference. This viewpoint is used to derive the equation for the empirical distribution of the hidden agent states. A feedback particle filter (FPF) algorithm for this problem is presented and illustrated via numerical simulations. Results are presented for the Euclidean and the finite state-space cases, both in continuous-time settings. The classical FPF algorithm is shown to be the special case (with begin{document}$ M = 1 $end{document}) of these more general results. The simulations help show that the algorithm well approximates the empirical distribution of the hidden states for large begin{document}$ M $end{document}.
{"title":"Feedback particle filter for collective inference","authors":"Jin W. Kim, P. Mehta","doi":"10.3934/fods.2021018","DOIUrl":"https://doi.org/10.3934/fods.2021018","url":null,"abstract":"<p style='text-indent:20px;'>The purpose of this paper is to describe the feedback particle filter algorithm for problems where there are a large number (<inline-formula><tex-math id=\"M1\">begin{document}$ M $end{document}</tex-math></inline-formula>) of non-interacting agents (targets) with a large number (<inline-formula><tex-math id=\"M2\">begin{document}$ M $end{document}</tex-math></inline-formula>) of non-agent specific observations (measurements) that originate from these agents. In its basic form, the problem is characterized by data association uncertainty whereby the association between the observations and agents must be deduced in addition to the agent state. In this paper, the large-<inline-formula><tex-math id=\"M3\">begin{document}$ M $end{document}</tex-math></inline-formula> limit is interpreted as a problem of collective inference. This viewpoint is used to derive the equation for the empirical distribution of the hidden agent states. A feedback particle filter (FPF) algorithm for this problem is presented and illustrated via numerical simulations. Results are presented for the Euclidean and the finite state-space cases, both in continuous-time settings. The classical FPF algorithm is shown to be the special case (with <inline-formula><tex-math id=\"M4\">begin{document}$ M = 1 $end{document}</tex-math></inline-formula>) of these more general results. The simulations help show that the algorithm well approximates the empirical distribution of the hidden states for large <inline-formula><tex-math id=\"M5\">begin{document}$ M $end{document}</tex-math></inline-formula>.</p>","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46470057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple hypothesis testing requires a control procedure: the error probabilities in statistical testing compound when several tests are performed for the same conclusion. A common type of multiple hypothesis testing error rates is the FamilyWise Error Rate (FWER) which measures the probability that any one of the performed tests rejects its null hypothesis erroneously. These are often controlled using Bonferroni’s method or later more sophisticated approaches all of which involve replacing the test level α with α/k, reducing it by a factor of the number of simultaneous tests performed. Common paradigms for hypothesis testing in persistent homology are often based on permutation testing, however increasing the number of permutations to meet a Bonferroni-style threshold can be prohibitively expensive. In this paper we propose a null model based approach to testing for acyclicity (ie trivial homology), coupled with a Family-Wise Error Rate (FWER) control method that does not suffer from these computational costs.
{"title":"Multiple hypothesis testing with persistent homology","authors":"Mikael Vejdemo-Johansson, Sayan Mukherjee","doi":"10.3934/fods.2022018","DOIUrl":"https://doi.org/10.3934/fods.2022018","url":null,"abstract":"Multiple hypothesis testing requires a control procedure: the error probabilities in statistical testing compound when several tests are performed for the same conclusion. A common type of multiple hypothesis testing error rates is the FamilyWise Error Rate (FWER) which measures the probability that any one of the performed tests rejects its null hypothesis erroneously. These are often controlled using Bonferroni’s method or later more sophisticated approaches all of which involve replacing the test level α with α/k, reducing it by a factor of the number of simultaneous tests performed. Common paradigms for hypothesis testing in persistent homology are often based on permutation testing, however increasing the number of permutations to meet a Bonferroni-style threshold can be prohibitively expensive. In this paper we propose a null model based approach to testing for acyclicity (ie trivial homology), coupled with a Family-Wise Error Rate (FWER) control method that does not suffer from these computational costs.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48362858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The oscillations observed in many time series, particularly in biomedicine, exhibit morphological variations over time. These morphological variations are caused by intrinsic or extrinsic changes to the state of the generating system, henceforth referred to as dynamics. To model these time series (including and specifically pathophysiological ones) and estimate the underlying dynamics, we provide a novel wave-shape oscillatory model. In this model, time-dependent variations in cycle shape occur along a manifold called the wave-shape manifold. To estimate the wave-shape manifold associated with an oscillatory time series, study the dynamics, and visualize the time-dependent changes along the wave-shape manifold, we apply the well-established diffusion maps (DM) algorithm to the set of all observed oscillations. We provide a theoretical guarantee on the dynamical information recovered by the DM algorithm under the proposed model. Applying the proposed model and algorithm to arterial blood pressure (ABP) signals recorded during general anesthesia leads to the extraction of nociception information. Applying the wave-shape oscillatory model and the DM algorithm to cardiac cycles in the electrocardiogram (ECG) leads to ectopy detection and a new ECG-derived respiratory signal, even when the subject has atrial fibrillation.
{"title":"Wave-shape oscillatory model for nonstationary periodic time series analysis","authors":"Yu-Ting Lin, John Malik, Hau‐Tieng Wu","doi":"10.3934/FODS.2021009","DOIUrl":"https://doi.org/10.3934/FODS.2021009","url":null,"abstract":"The oscillations observed in many time series, particularly in biomedicine, exhibit morphological variations over time. These morphological variations are caused by intrinsic or extrinsic changes to the state of the generating system, henceforth referred to as dynamics. To model these time series (including and specifically pathophysiological ones) and estimate the underlying dynamics, we provide a novel wave-shape oscillatory model. In this model, time-dependent variations in cycle shape occur along a manifold called the wave-shape manifold. To estimate the wave-shape manifold associated with an oscillatory time series, study the dynamics, and visualize the time-dependent changes along the wave-shape manifold, we apply the well-established diffusion maps (DM) algorithm to the set of all observed oscillations. We provide a theoretical guarantee on the dynamical information recovered by the DM algorithm under the proposed model. Applying the proposed model and algorithm to arterial blood pressure (ABP) signals recorded during general anesthesia leads to the extraction of nociception information. Applying the wave-shape oscillatory model and the DM algorithm to cardiac cycles in the electrocardiogram (ECG) leads to ectopy detection and a new ECG-derived respiratory signal, even when the subject has atrial fibrillation.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45887443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We apply persistent homology, the dominant tool from the field of topological data analysis, to study electoral redistricting. Our method combines the geographic information from a political districting plan with election data to produce a persistence diagram. We are then able to visualize and analyze large ensembles of computer-generated districting plans of the type commonly used in modern redistricting research (and court challenges). We set out three applications: zoning a state at each scale of districting, comparing elections, and seeking signals of gerrymandering. Our case studies focus on redistricting in Pennsylvania and North Carolina, two states whose legal challenges to enacted plans have raised considerable public interest in the last few years. To address the question of robustness of the persistence diagrams to perturbations in vote data and in district boundaries, we translate the classical stability theorem of Cohen--Steiner et al. into our setting and find that it can be phrased in a manner that is easy to interpret. We accompany the theoretical bound with an empirical demonstration to illustrate diagram stability in practice.
{"title":"The (homological) persistence of gerrymandering","authors":"M. Duchin, Tom Needham, Thomas Weighill","doi":"10.3934/FODS.2021007","DOIUrl":"https://doi.org/10.3934/FODS.2021007","url":null,"abstract":"We apply persistent homology, the dominant tool from the field of topological data analysis, to study electoral redistricting. Our method combines the geographic information from a political districting plan with election data to produce a persistence diagram. We are then able to visualize and analyze large ensembles of computer-generated districting plans of the type commonly used in modern redistricting research (and court challenges). We set out three applications: zoning a state at each scale of districting, comparing elections, and seeking signals of gerrymandering. Our case studies focus on redistricting in Pennsylvania and North Carolina, two states whose legal challenges to enacted plans have raised considerable public interest in the last few years. \u0000To address the question of robustness of the persistence diagrams to perturbations in vote data and in district boundaries, we translate the classical stability theorem of Cohen--Steiner et al. into our setting and find that it can be phrased in a manner that is easy to interpret. We accompany the theoretical bound with an empirical demonstration to illustrate diagram stability in practice.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45846167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider a combined state and drift estimation problem for the linear stochastic heat equation. The infinite-dimensional Bayesian inference problem is formulated in terms of the Kalman-Bucy filter over an extended state space, and its long-time asymptotic properties are studied. Asymptotic posterior contraction rates in the unknown drift function are the main contribution of this paper. Such rates have been studied before for stationary non-parametric Bayesian inverse problems, and here we demonstrate the consistency of our time-dependent formulation with these previous results building upon scale separation and a slow manifold approximation.
{"title":"Posterior contraction rates for non-parametric state and drift estimation","authors":"S. Reich, P. Rozdeba","doi":"10.3934/fods.2020016","DOIUrl":"https://doi.org/10.3934/fods.2020016","url":null,"abstract":"We consider a combined state and drift estimation problem for the linear stochastic heat equation. The infinite-dimensional Bayesian inference problem is formulated in terms of the Kalman-Bucy filter over an extended state space, and its long-time asymptotic properties are studied. Asymptotic posterior contraction rates in the unknown drift function are the main contribution of this paper. Such rates have been studied before for stationary non-parametric Bayesian inverse problems, and here we demonstrate the consistency of our time-dependent formulation with these previous results building upon scale separation and a slow manifold approximation.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42505068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents mathematical results in support of the methodology of the probabilistic learning on manifolds (PLoM) recently introduced by the authors, which has been used with success for analyzing complex engineering systems. The PLoM considers a given initial dataset constituted of a small number of points given in an Euclidean space, which are interpreted as independent realizations of a vector-valued random variable for which its non-Gaussian probability measure is unknown but is, textit{a priori}, concentrated in an unknown subset of the Euclidean space. The objective is to construct a learned dataset constituted of additional realizations that allow the evaluation of converged statistics. A transport of the probability measure estimated with the initial dataset is done through a linear transformation constructed using a reduced-order diffusion-maps basis. In this paper, it is proven that this transported measure is a marginal distribution of the invariant measure of a reduced-order Ito stochastic differential equation that corresponds to a dissipative Hamiltonian dynamical system. This construction allows for preserving the concentration of the probability measure. This property is shown by analyzing a distance between the random matrix constructed with the PLoM and the matrix representing the initial dataset, as a function of the dimension of the basis. It is further proven that this distance has a minimum for a dimension of the reduced-order diffusion-maps basis that is strictly smaller than the number of points in the initial dataset. Finally, a brief numerical application illustrates the mathematical results.
{"title":"Probabilistic learning on manifolds","authors":"Christian Soize, R. Ghanem","doi":"10.3934/fods.2020013","DOIUrl":"https://doi.org/10.3934/fods.2020013","url":null,"abstract":"This paper presents mathematical results in support of the methodology of the probabilistic learning on manifolds (PLoM) recently introduced by the authors, which has been used with success for analyzing complex engineering systems. The PLoM considers a given initial dataset constituted of a small number of points given in an Euclidean space, which are interpreted as independent realizations of a vector-valued random variable for which its non-Gaussian probability measure is unknown but is, textit{a priori}, concentrated in an unknown subset of the Euclidean space. The objective is to construct a learned dataset constituted of additional realizations that allow the evaluation of converged statistics. A transport of the probability measure estimated with the initial dataset is done through a linear transformation constructed using a reduced-order diffusion-maps basis. In this paper, it is proven that this transported measure is a marginal distribution of the invariant measure of a reduced-order Ito stochastic differential equation that corresponds to a dissipative Hamiltonian dynamical system. This construction allows for preserving the concentration of the probability measure. This property is shown by analyzing a distance between the random matrix constructed with the PLoM and the matrix representing the initial dataset, as a function of the dimension of the basis. It is further proven that this distance has a minimum for a dimension of the reduced-order diffusion-maps basis that is strictly smaller than the number of points in the initial dataset. Finally, a brief numerical application illustrates the mathematical results.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44044177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}