Pub Date : 2020-01-01DOI: 10.1017/9781108755528.004
Avrim Blum, J. Hopcroft, R. Kannan
{"title":"Random Walks and Markov Chains","authors":"Avrim Blum, J. Hopcroft, R. Kannan","doi":"10.1017/9781108755528.004","DOIUrl":"https://doi.org/10.1017/9781108755528.004","url":null,"abstract":"","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/9781108755528.004","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"56925823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This papers shows that nonlinear filter in the case of deterministic dynamics is stable with respect to the initial conditions under the conditions that observations are sufficiently rich, both in the context of continuous and discrete time filters. Earlier works on the stability of the nonlinear filters are in the context of stochastic dynamics and assume conditions like compact state space or time independent observation model, whereas we prove filter stability for deterministic dynamics with more general assumptions on the state space and observation process. We give several examples of systems that satisfy these assumptions. We also show that the asymptotic structure of the filtering distribution is related to the dynamical properties of the signal.
{"title":"Stability of non-linear filter for deterministic dynamics","authors":"A. Reddy, A. Apte","doi":"10.3934/fods.2021025","DOIUrl":"https://doi.org/10.3934/fods.2021025","url":null,"abstract":"This papers shows that nonlinear filter in the case of deterministic dynamics is stable with respect to the initial conditions under the conditions that observations are sufficiently rich, both in the context of continuous and discrete time filters. Earlier works on the stability of the nonlinear filters are in the context of stochastic dynamics and assume conditions like compact state space or time independent observation model, whereas we prove filter stability for deterministic dynamics with more general assumptions on the state space and observation process. We give several examples of systems that satisfy these assumptions. We also show that the asymptotic structure of the filtering distribution is related to the dynamical properties of the signal.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46089690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article introduces a Bayesian nonparametric method for quantifying the relative evidence in a dataset in favour of the dependence or independence of two variables conditional on a third. The approach uses Polya tree priors on spaces of conditional probability densities, accounting for uncertainty in the form of the underlying distributions in a nonparametric way. The Bayesian perspective provides an inherently symmetric probability measure of conditional dependence or independence, a feature particularly advantageous in causal discovery and not employed in existing procedures of this type.
{"title":"A Bayesian nonparametric test for conditional independence","authors":"Onur Teymur, S. Filippi","doi":"10.3934/FODS.2020009","DOIUrl":"https://doi.org/10.3934/FODS.2020009","url":null,"abstract":"This article introduces a Bayesian nonparametric method for quantifying the relative evidence in a dataset in favour of the dependence or independence of two variables conditional on a third. The approach uses Polya tree priors on spaces of conditional probability densities, accounting for uncertainty in the form of the underlying distributions in a nonparametric way. The Bayesian perspective provides an inherently symmetric probability measure of conditional dependence or independence, a feature particularly advantageous in causal discovery and not employed in existing procedures of this type.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43943702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dynamic interaction networks frequently arise in biology, communications technology and the social sciences, representing, for example, neuronal connectivity in the brain, internet connections between computers and human interactions within social networks. The evolution and strengthening of the links in such networks can be observed through sequences of connection events occurring between network nodes over time. In some of these applications, the identity and size of the network may be unknown a priori and may change over time. In this article, a model for the evolution of dynamic networks based on the Pitman-Yor process is proposed. This model explicitly admits power-laws in the number of connections on each edge, often present in real world networks, and, for careful choices of the parameters, power-laws for the degree distribution of the nodes. A novel empirical method for the estimation of the hyperparameters of the Pitman-Yor process is proposed, and some necessary corrections for uniform discrete base distributions are carefully addressed. The methodology is tested on synthetic data and in an anomaly detection study on the enterprise computer network of the Los Alamos National Laboratory, and successfully detects connections from a red-team penetration test.
{"title":"Modelling dynamic network evolution as a Pitman-Yor process","authors":"Francesco Sanna Passino, N. Heard","doi":"10.3934/fods.2019013","DOIUrl":"https://doi.org/10.3934/fods.2019013","url":null,"abstract":"Dynamic interaction networks frequently arise in biology, communications technology and the social sciences, representing, for example, neuronal connectivity in the brain, internet connections between computers and human interactions within social networks. The evolution and strengthening of the links in such networks can be observed through sequences of connection events occurring between network nodes over time. In some of these applications, the identity and size of the network may be unknown a priori and may change over time. In this article, a model for the evolution of dynamic networks based on the Pitman-Yor process is proposed. This model explicitly admits power-laws in the number of connections on each edge, often present in real world networks, and, for careful choices of the parameters, power-laws for the degree distribution of the nodes. A novel empirical method for the estimation of the hyperparameters of the Pitman-Yor process is proposed, and some necessary corrections for uniform discrete base distributions are carefully addressed. The methodology is tested on synthetic data and in an anomaly detection study on the enterprise computer network of the Los Alamos National Laboratory, and successfully detects connections from a red-team penetration test.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48066324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article we consider Bayesian inference for partially observed Andersson-Madigan-Perlman (AMP) Gaussian chain graph (CG) models. Such models are of particular interest in applications such as biological networks and financial time series. The model itself features a variety of constraints which make both prior modeling and computational inference challenging. We develop a framework for the aforementioned challenges, using a sequential Monte Carlo (SMC) method for statistical inference. Our approach is illustrated on both simulated data as well as real case studies from university graduation rates and a pharmacokinetics study.
{"title":"Bayesian inference for latent chain graphs","authors":"Deng Lu, M. Iorio, A. Jasra, G. Rosner","doi":"10.3934/fods.2020003","DOIUrl":"https://doi.org/10.3934/fods.2020003","url":null,"abstract":"In this article we consider Bayesian inference for partially observed Andersson-Madigan-Perlman (AMP) Gaussian chain graph (CG) models. Such models are of particular interest in applications such as biological networks and financial time series. The model itself features a variety of constraints which make both prior modeling and computational inference challenging. We develop a framework for the aforementioned challenges, using a sequential Monte Carlo (SMC) method for statistical inference. Our approach is illustrated on both simulated data as well as real case studies from university graduation rates and a pharmacokinetics study.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46556258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent homology is a tool within topological data analysis to detect different dimensional holes in a dataset. The boundaries of the empty territories (i.e., holes) are not well-defined and each has multiple representations. The proposed method, Empty Territory (EmT), provides representations of different dimensional holes with a specified level of complexity of the territory boundary. EmT is designed for the setting where persistent homology uses a Vietoris-Rips complex filtration, and works as a post-analysis to refine the hole representation of the persistent homology algorithm. In particular, EmT uses alpha shapes to obtain a special class of representations that captures the empty territories with a complexity determined by the size of the alpha balls. With a fixed complexity, EmT returns the representation that contains the most points within the special class of representations. This method is limited to finding 1D holes in 2D data and 2D holes in 3D data, and is illustrated on simulation datasets of a homogeneous Poisson point process in 2D and a uniform sampling in 3D. Furthermore, the method is applied to a 2D cell tower location geography dataset and 3D Sloan Digital Sky Survey (SDSS) galaxy dataset, where it works well in capturing the empty territories.
{"title":"EmT: Locating empty territories of homology group generators in a dataset","authors":"Xin Xu, J. Cisewski-Kehe","doi":"10.3934/FODS.2019010","DOIUrl":"https://doi.org/10.3934/FODS.2019010","url":null,"abstract":"Persistent homology is a tool within topological data analysis to detect different dimensional holes in a dataset. The boundaries of the empty territories (i.e., holes) are not well-defined and each has multiple representations. The proposed method, Empty Territory (EmT), provides representations of different dimensional holes with a specified level of complexity of the territory boundary. EmT is designed for the setting where persistent homology uses a Vietoris-Rips complex filtration, and works as a post-analysis to refine the hole representation of the persistent homology algorithm. In particular, EmT uses alpha shapes to obtain a special class of representations that captures the empty territories with a complexity determined by the size of the alpha balls. With a fixed complexity, EmT returns the representation that contains the most points within the special class of representations. This method is limited to finding 1D holes in 2D data and 2D holes in 3D data, and is illustrated on simulation datasets of a homogeneous Poisson point process in 2D and a uniform sampling in 3D. Furthermore, the method is applied to a 2D cell tower location geography dataset and 3D Sloan Digital Sky Survey (SDSS) galaxy dataset, where it works well in capturing the empty territories.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42374169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The sex ratio at birth (SRB) has risen in India and reaches well beyond the levels under normal circumstances since the 1970s. The lasting imbalanced SRB has resulted in much more males than females in India. A population with severely distorted sex ratio is more likely to have prolonged struggle for stability and sustainability. It is crucial to estimate SRB and its imbalance for India on state level and assess the uncertainty around estimates. We develop a Bayesian model to estimate SRB in India from 1990 to 2016 for 29 states and union territories. Our analyses are based on a comprehensive database on state-level SRB with data from the sample registration system, census and Demographic and Health Surveys. The SRB varies greatly across Indian states and union territories in 2016: ranging from 1.026 (95% uncertainty interval [0.971; 1.087]) in Mizoram to 1.181 [1.143; 1.128] in Haryana. We identify 18 states and union territories with imbalanced SRB during 1990–2016, resulting in 14.9 [13.2; 16.5] million of missing female births in India. Uttar Pradesh has the largest share of the missing female births among all states and union territories, taking up to 32.8% [29.5%; 36.3%] of the total number.
{"title":"Levels and trends in the sex ratio at birth and missing female births for 29 states and union territories in India 1990–2016: A Bayesian modeling study","authors":"Fengqing Chao, A. Yadav","doi":"10.3934/FODS.2019008","DOIUrl":"https://doi.org/10.3934/FODS.2019008","url":null,"abstract":"The sex ratio at birth (SRB) has risen in India and reaches well beyond the levels under normal circumstances since the 1970s. The lasting imbalanced SRB has resulted in much more males than females in India. A population with severely distorted sex ratio is more likely to have prolonged struggle for stability and sustainability. It is crucial to estimate SRB and its imbalance for India on state level and assess the uncertainty around estimates. We develop a Bayesian model to estimate SRB in India from 1990 to 2016 for 29 states and union territories. Our analyses are based on a comprehensive database on state-level SRB with data from the sample registration system, census and Demographic and Health Surveys. The SRB varies greatly across Indian states and union territories in 2016: ranging from 1.026 (95% uncertainty interval [0.971; 1.087]) in Mizoram to 1.181 [1.143; 1.128] in Haryana. We identify 18 states and union territories with imbalanced SRB during 1990–2016, resulting in 14.9 [13.2; 16.5] million of missing female births in India. Uttar Pradesh has the largest share of the missing female births among all states and union territories, taking up to 32.8% [29.5%; 36.3%] of the total number.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47172022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the use of power weighted shortest path distance functions for clustering high dimensional Euclidean data, under the assumption that the data is drawn from a collection of disjoint low dimensional manifolds. We argue, theoretically and experimentally, that this leads to higher clustering accuracy. We also present a fast algorithm for computing these distances.
{"title":"Power weighted shortest paths for clustering Euclidean data","authors":"Daniel Mckenzie, S. Damelin","doi":"10.3934/fods.2019014","DOIUrl":"https://doi.org/10.3934/fods.2019014","url":null,"abstract":"We study the use of power weighted shortest path distance functions for clustering high dimensional Euclidean data, under the assumption that the data is drawn from a collection of disjoint low dimensional manifolds. We argue, theoretically and experimentally, that this leads to higher clustering accuracy. We also present a fast algorithm for computing these distances.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"70247788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A wide array of machine learning problems are formulated as the minimization of the expectation of a convex loss function on some parameter space. Since the probability distribution of the data of interest is usually unknown, it is is often estimated from training sets, which may lead to poor out-of-sample performance. In this work, we bring new insights in this problem by using the framework which has been developed in quantitative finance for risk measures. We show that the original min-max problem can be recast as a convex minimization problem under suitable assumptions. We discuss several important examples of robust formulations, in particular by defining ambiguity sets based on $varphi$-divergences and the Wasserstein metric.We also propose an efficient algorithm for solving the corresponding convex optimization problems involving complex convex constraints. Through simulation examples, we demonstrate that this algorithm scales well on real data sets.
{"title":"General risk measures for robust machine learning","authors":"É. Chouzenoux, Henri G'erard, J. Pesquet","doi":"10.3934/fods.2019011","DOIUrl":"https://doi.org/10.3934/fods.2019011","url":null,"abstract":"A wide array of machine learning problems are formulated as the minimization of the expectation of a convex loss function on some parameter space. Since the probability distribution of the data of interest is usually unknown, it is is often estimated from training sets, which may lead to poor out-of-sample performance. In this work, we bring new insights in this problem by using the framework which has been developed in quantitative finance for risk measures. We show that the original min-max problem can be recast as a convex minimization problem under suitable assumptions. We discuss several important examples of robust formulations, in particular by defining ambiguity sets based on $varphi$-divergences and the Wasserstein metric.We also propose an efficient algorithm for solving the corresponding convex optimization problems involving complex convex constraints. Through simulation examples, we demonstrate that this algorithm scales well on real data sets.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43459497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of estimating certain distributions over {0, 1}d is considered here. The distribution represents a quantum system of d qubits, where there are non-trivial dependencies between the qubits. A maximum entropy approach is adopted to reconstruct the distribution from exact moments or observed empirical moments. The Robbins Monro algorithm is used to solve the intractable maximum entropy problem, by constructing an unbiased estimator of the un-normalized target with a sequential Monte Carlo sampler at each iteration. In the case of empirical moments, this coincides with a maximum likelihood estimator. A Bayesian formulation is also considered in order to quantify uncertainty a posteriori. Several approaches are proposed in order to tackle this challenging problem, based on recently developed methodologies. In particular, unbiased estimators of the gradient of the log posterior are constructed and used within a provably convergent Langevin-based Markov chain Monte Carlo method. The methods are illustrated on classically simulated output from quantum simulators.
{"title":"Estimation and uncertainty quantification for the output from quantum simulators","authors":"R. Bennink, A. Jasra, K. Law, P. Lougovski","doi":"10.3934/FODS.2019007","DOIUrl":"https://doi.org/10.3934/FODS.2019007","url":null,"abstract":"The problem of estimating certain distributions over {0, 1}d is considered here. The distribution represents a quantum system of d qubits, where there are non-trivial dependencies between the qubits. A maximum entropy approach is adopted to reconstruct the distribution from exact moments or observed empirical moments. The Robbins Monro algorithm is used to solve the intractable maximum entropy problem, by constructing an unbiased estimator of the un-normalized target with a sequential Monte Carlo sampler at each iteration. In the case of empirical moments, this coincides with a maximum likelihood estimator. A Bayesian formulation is also considered in order to quantify uncertainty a posteriori. Several approaches are proposed in order to tackle this challenging problem, based on recently developed methodologies. In particular, unbiased estimators of the gradient of the log posterior are constructed and used within a provably convergent Langevin-based Markov chain Monte Carlo method. The methods are illustrated on classically simulated output from quantum simulators.","PeriodicalId":73054,"journal":{"name":"Foundations of data science (Springfield, Mo.)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42733584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}