Alex Glyn-Davies, Connor Duffin, Ieva Kazlauskaite, Mark Girolami, Ö. Deniz Akyildiz
In this paper, we develop a class of interacting particle Langevin algorithms to solve inverse problems for partial differential equations (PDEs). In particular, we leverage the statistical finite elements (statFEM) formulation to obtain a finite-dimensional latent variable statistical model where the parameter is that of the (discretised) forward map and the latent variable is the statFEM solution of the PDE which is assumed to be partially observed. We then adapt a recently proposed expectation-maximisation like scheme, interacting particle Langevin algorithm (IPLA), for this problem and obtain a joint estimation procedure for the parameters and the latent variables. We consider three main examples: (i) estimating the forcing for linear Poisson PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii) estimating diffusivity for linear Poisson PDE. We provide computational complexity estimates for forcing estimation in the linear case. We also provide comprehensive numerical experiments and preconditioning strategies that significantly improve the performance, showing that the proposed class of methods can be the choice for parameter inference in PDE models.
{"title":"Statistical Finite Elements via Interacting Particle Langevin Dynamics","authors":"Alex Glyn-Davies, Connor Duffin, Ieva Kazlauskaite, Mark Girolami, Ö. Deniz Akyildiz","doi":"arxiv-2409.07101","DOIUrl":"https://doi.org/arxiv-2409.07101","url":null,"abstract":"In this paper, we develop a class of interacting particle Langevin algorithms\u0000to solve inverse problems for partial differential equations (PDEs). In\u0000particular, we leverage the statistical finite elements (statFEM) formulation\u0000to obtain a finite-dimensional latent variable statistical model where the\u0000parameter is that of the (discretised) forward map and the latent variable is\u0000the statFEM solution of the PDE which is assumed to be partially observed. We\u0000then adapt a recently proposed expectation-maximisation like scheme,\u0000interacting particle Langevin algorithm (IPLA), for this problem and obtain a\u0000joint estimation procedure for the parameters and the latent variables. We\u0000consider three main examples: (i) estimating the forcing for linear Poisson\u0000PDE, (ii) estimating the forcing for nonlinear Poisson PDE, and (iii)\u0000estimating diffusivity for linear Poisson PDE. We provide computational\u0000complexity estimates for forcing estimation in the linear case. We also provide\u0000comprehensive numerical experiments and preconditioning strategies that\u0000significantly improve the performance, showing that the proposed class of\u0000methods can be the choice for parameter inference in PDE models.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As networks continue to increase in size, current methods must be capable of handling large numbers of nodes and edges in order to be practically relevant. Instead of working directly with the entire (large) network, analyzing sub-networks has become a popular approach. Due to a network's inherent inter-connectedness, sub-sampling is not a trivial task. While this problem has gained attention in recent years, it has not received sufficient attention from the statistics community. In this work, we provide a thorough comparison of seven graph sub-sampling algorithms by applying them to divide-and-conquer algorithms for community structure and core-periphery (CP) structure. After discussing the various algorithms and sub-sampling routines, we derive theoretical results for the mis-classification rate of the divide-and-conquer algorithm for CP structure under various sub-sampling schemes. We then perform extensive experiments on both simulated and real-world data to compare the various methods. For the community detection task, we found that sampling nodes uniformly at random yields the best performance. For CP structure on the other hand, there was no single winner, but algorithms which sampled core nodes at a higher rate consistently outperformed other sampling routines, e.g., random edge sampling and random walk sampling. The varying performance of the sampling algorithms on different tasks demonstrates the importance of carefully selecting a sub-sampling routine for the specific application.
{"title":"Graph sub-sampling for divide-and-conquer algorithms in large networks","authors":"Eric Yanchenko","doi":"arxiv-2409.06994","DOIUrl":"https://doi.org/arxiv-2409.06994","url":null,"abstract":"As networks continue to increase in size, current methods must be capable of\u0000handling large numbers of nodes and edges in order to be practically relevant.\u0000Instead of working directly with the entire (large) network, analyzing\u0000sub-networks has become a popular approach. Due to a network's inherent\u0000inter-connectedness, sub-sampling is not a trivial task. While this problem has\u0000gained attention in recent years, it has not received sufficient attention from\u0000the statistics community. In this work, we provide a thorough comparison of\u0000seven graph sub-sampling algorithms by applying them to divide-and-conquer\u0000algorithms for community structure and core-periphery (CP) structure. After\u0000discussing the various algorithms and sub-sampling routines, we derive\u0000theoretical results for the mis-classification rate of the divide-and-conquer\u0000algorithm for CP structure under various sub-sampling schemes. We then perform\u0000extensive experiments on both simulated and real-world data to compare the\u0000various methods. For the community detection task, we found that sampling nodes\u0000uniformly at random yields the best performance. For CP structure on the other\u0000hand, there was no single winner, but algorithms which sampled core nodes at a\u0000higher rate consistently outperformed other sampling routines, e.g., random\u0000edge sampling and random walk sampling. The varying performance of the sampling\u0000algorithms on different tasks demonstrates the importance of carefully\u0000selecting a sub-sampling routine for the specific application.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Causal discovery is designed to identify causal relationships in data, a task that has become increasingly complex due to the computational demands of traditional methods such as VarLiNGAM, which combines Vector Autoregressive Model with Linear Non-Gaussian Acyclic Model for time series data. This study is dedicated to optimising causal discovery specifically for time series data, which is common in practical applications. Time series causal discovery is particularly challenging due to the need to account for temporal dependencies and potential time lag effects. By designing a specialised dataset generator and reducing the computational complexity of the VarLiNGAM model from ( O(m^3 cdot n) ) to ( O(m^3 + m^2 cdot n) ), this study significantly improves the feasibility of processing large datasets. The proposed methods have been validated on advanced computational platforms and tested across simulated, real-world, and large-scale datasets, showcasing enhanced efficiency and performance. The optimised algorithm achieved 7 to 13 times speedup compared with the original algorithm and around 4.5 times speedup compared with the GPU-accelerated version on large-scale datasets with feature sizes between 200 and 400. Our methods aim to push the boundaries of current causal discovery capabilities, making them more robust, scalable, and applicable to real-world scenarios, thus facilitating breakthroughs in various fields such as healthcare and finance.
{"title":"Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal Discovery","authors":"Ziyang Jiao, Ce Guo, Wayne Luk","doi":"arxiv-2409.05500","DOIUrl":"https://doi.org/arxiv-2409.05500","url":null,"abstract":"Causal discovery is designed to identify causal relationships in data, a task\u0000that has become increasingly complex due to the computational demands of\u0000traditional methods such as VarLiNGAM, which combines Vector Autoregressive\u0000Model with Linear Non-Gaussian Acyclic Model for time series data. This study is dedicated to optimising causal discovery specifically for time\u0000series data, which is common in practical applications. Time series causal\u0000discovery is particularly challenging due to the need to account for temporal\u0000dependencies and potential time lag effects. By designing a specialised dataset\u0000generator and reducing the computational complexity of the VarLiNGAM model from\u0000( O(m^3 cdot n) ) to ( O(m^3 + m^2 cdot n) ), this study significantly\u0000improves the feasibility of processing large datasets. The proposed methods\u0000have been validated on advanced computational platforms and tested across\u0000simulated, real-world, and large-scale datasets, showcasing enhanced efficiency\u0000and performance. The optimised algorithm achieved 7 to 13 times speedup\u0000compared with the original algorithm and around 4.5 times speedup compared with\u0000the GPU-accelerated version on large-scale datasets with feature sizes between\u0000200 and 400. Our methods aim to push the boundaries of current causal discovery\u0000capabilities, making them more robust, scalable, and applicable to real-world\u0000scenarios, thus facilitating breakthroughs in various fields such as healthcare\u0000and finance.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers
In differential privacy (DP) mechanisms, it can be beneficial to release "redundant" outputs, in the sense that a quantity can be estimated by combining different combinations of privatized values. Indeed, this structure is present in the DP 2020 Decennial Census products published by the U.S. Census Bureau. With this structure, the DP output can be improved by enforcing self-consistency (i.e., estimators obtained by combining different values result in the same estimate) and we show that the minimum variance processing is a linear projection. However, standard projection algorithms are too computationally expensive in terms of both memory and execution time for applications such as the Decennial Census. We propose the Scalable Efficient Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step process of aggregation and differencing that 1) enforces self-consistency through a linear and unbiased procedure, 2) is computationally and memory efficient, 3) achieves the minimum variance solution under certain structural assumptions, and 4) is empirically shown to be robust to violations of these structural assumptions. We propose three methods of calculating confidence intervals from our estimates, under various assumptions. We apply SEA BLUE to two 2010 Census demonstration products, illustrating its scalability and validity.
{"title":"Best Linear Unbiased Estimate from Privatized Histograms","authors":"Jordan Awan, Adam Edwards, Paul Bartholomew, Andrew Sillers","doi":"arxiv-2409.04387","DOIUrl":"https://doi.org/arxiv-2409.04387","url":null,"abstract":"In differential privacy (DP) mechanisms, it can be beneficial to release\u0000\"redundant\" outputs, in the sense that a quantity can be estimated by combining\u0000different combinations of privatized values. Indeed, this structure is present\u0000in the DP 2020 Decennial Census products published by the U.S. Census Bureau.\u0000With this structure, the DP output can be improved by enforcing\u0000self-consistency (i.e., estimators obtained by combining different values\u0000result in the same estimate) and we show that the minimum variance processing\u0000is a linear projection. However, standard projection algorithms are too\u0000computationally expensive in terms of both memory and execution time for\u0000applications such as the Decennial Census. We propose the Scalable Efficient\u0000Algorithm for Best Linear Unbiased Estimate (SEA BLUE), based on a two step\u0000process of aggregation and differencing that 1) enforces self-consistency\u0000through a linear and unbiased procedure, 2) is computationally and memory\u0000efficient, 3) achieves the minimum variance solution under certain structural\u0000assumptions, and 4) is empirically shown to be robust to violations of these\u0000structural assumptions. We propose three methods of calculating confidence\u0000intervals from our estimates, under various assumptions. We apply SEA BLUE to\u0000two 2010 Census demonstration products, illustrating its scalability and\u0000validity.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose an approach for an application of Bayesian optimization using Sequential Monte Carlo (SMC) and concepts from the statistical physics of classical systems. Our method leverages the power of modern machine learning libraries such as NumPyro and JAX, allowing us to perform Bayesian optimization on multiple platforms, including CPUs, GPUs, TPUs, and in parallel. Our approach enables a low entry level for exploration of the methods while maintaining high performance. We present a promising direction for developing more efficient and effective techniques for a wide range of optimization problems in diverse fields.
{"title":"A Bayesian Optimization through Sequential Monte Carlo and Statistical Physics-Inspired Techniques","authors":"Anton Lebedev, Thomas Warford, M. Emre Şahin","doi":"arxiv-2409.03094","DOIUrl":"https://doi.org/arxiv-2409.03094","url":null,"abstract":"In this paper, we propose an approach for an application of Bayesian\u0000optimization using Sequential Monte Carlo (SMC) and concepts from the\u0000statistical physics of classical systems. Our method leverages the power of\u0000modern machine learning libraries such as NumPyro and JAX, allowing us to\u0000perform Bayesian optimization on multiple platforms, including CPUs, GPUs,\u0000TPUs, and in parallel. Our approach enables a low entry level for exploration\u0000of the methods while maintaining high performance. We present a promising\u0000direction for developing more efficient and effective techniques for a wide\u0000range of optimization problems in diverse fields.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Here, we introduce a novel framework for modelling the spatiotemporal dynamics of disease spread known as conditional logistic individual-level models (CL-ILM's). This framework alleviates much of the computational burden associated with traditional spatiotemporal individual-level models for epidemics, and facilitates the use of standard software for fitting logistic models when analysing spatiotemporal disease patterns. The models can be fitted in either a frequentist or Bayesian framework. Here, we apply the new spatial CL-ILM to both simulated and semi-real data from the UK 2001 foot-and-mouth disease epidemic.
{"title":"Conditional logistic individual-level models of spatial infectious disease dynamics","authors":"Tahmina Akter, Rob Deardon","doi":"arxiv-2409.02353","DOIUrl":"https://doi.org/arxiv-2409.02353","url":null,"abstract":"Here, we introduce a novel framework for modelling the spatiotemporal\u0000dynamics of disease spread known as conditional logistic individual-level\u0000models (CL-ILM's). This framework alleviates much of the computational burden\u0000associated with traditional spatiotemporal individual-level models for\u0000epidemics, and facilitates the use of standard software for fitting logistic\u0000models when analysing spatiotemporal disease patterns. The models can be fitted\u0000in either a frequentist or Bayesian framework. Here, we apply the new spatial\u0000CL-ILM to both simulated and semi-real data from the UK 2001 foot-and-mouth\u0000disease epidemic.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The particle filter (PF), also known as the sequential Monte Carlo (SMC), is designed to approximate high-dimensional probability distributions and their normalizing constants in the discrete-time setting. To reduce the variance of the Monte Carlo approximation, several twisted particle filters (TPF) have been proposed by researchers, where one chooses or learns a twisting function that modifies the Markov transition kernel. In this paper, we study the TPF from a continuous-time perspective. Under suitable settings, we show that the discrete-time model converges to a continuous-time limit, which can be solved through a series of well-studied control-based importance sampling algorithms. This discrete-continuous connection allows the design of new TPF algorithms inspired by established continuous-time algorithms. As a concrete example, guided by existing importance sampling algorithms in the continuous-time setting, we propose a novel algorithm called ``Twisted-Path Particle Filter" (TPPF), where the twist function, parameterized by neural networks, minimizes specific KL-divergence between path measures. Some numerical experiments are given to illustrate the capability of the proposed algorithm.
{"title":"Guidance for twisted particle filter: a continuous-time perspective","authors":"Jianfeng Lu, Yuliang Wang","doi":"arxiv-2409.02399","DOIUrl":"https://doi.org/arxiv-2409.02399","url":null,"abstract":"The particle filter (PF), also known as the sequential Monte Carlo (SMC), is\u0000designed to approximate high-dimensional probability distributions and their\u0000normalizing constants in the discrete-time setting. To reduce the variance of\u0000the Monte Carlo approximation, several twisted particle filters (TPF) have been\u0000proposed by researchers, where one chooses or learns a twisting function that\u0000modifies the Markov transition kernel. In this paper, we study the TPF from a\u0000continuous-time perspective. Under suitable settings, we show that the\u0000discrete-time model converges to a continuous-time limit, which can be solved\u0000through a series of well-studied control-based importance sampling algorithms.\u0000This discrete-continuous connection allows the design of new TPF algorithms\u0000inspired by established continuous-time algorithms. As a concrete example,\u0000guided by existing importance sampling algorithms in the continuous-time\u0000setting, we propose a novel algorithm called ``Twisted-Path Particle Filter\"\u0000(TPPF), where the twist function, parameterized by neural networks, minimizes\u0000specific KL-divergence between path measures. Some numerical experiments are\u0000given to illustrate the capability of the proposed algorithm.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hidden Markov Models (HMM) model a sequence of observations that are dependent on a hidden (or latent) state that follow a Markov chain. These models are widely used in diverse fields including ecology, speech recognition, and genetics.Parameter estimation in HMM is typically performed using the Baum-Welch algorithm, a special case of the Expectation-Maximisation (EM) algorithm. While this method guarantee the convergence to a local maximum, its convergence rates is usually slow.Alternative methods, such as the direct maximisation of the likelihood using quasi-Newton methods (such as L-BFGS-B) can offer faster convergence but can be more complicated to implement due to challenges to deal with the presence of bounds on the space of parameters.We propose a novel hybrid algorithm, QNEM, that combines the Baum-Welch and the quasi-Newton algorithms. QNEM aims to leverage the strength of both algorithms by switching from one method to the other based on the convexity of the likelihood function.We conducted a comparative analysis between QNEM, the Baum-Welch algorithm, an EM acceleration algorithm called SQUAREM (Varadhan, 2008, Scand J Statist), and the L-BFGS-B quasi-Newton method by applying these algorithms to four examples built on different models. We estimated the parameters of each model using the different algorithms and evaluated their performances.Our results show that the best-performing algorithm depends on the model considered. QNEM performs well overall, always being faster or equivalent to L-BFGS-B. The Baum-Welch and SQUAREM algorithms are faster than the quasi-Newton and QNEM algorithms in certain scenarios with multiple optimum. In conclusion, QNEM offers a promising alternative to existing algorithms.
{"title":"Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm","authors":"Sidonie FoulonCESP, NeuroDiderot, Thérèse TruongCESP, Anne-Louise LeuteneggerNeuroDiderot, Hervé PerdryCESP","doi":"arxiv-2409.02477","DOIUrl":"https://doi.org/arxiv-2409.02477","url":null,"abstract":"Hidden Markov Models (HMM) model a sequence of observations that are\u0000dependent on a hidden (or latent) state that follow a Markov chain. These\u0000models are widely used in diverse fields including ecology, speech recognition,\u0000and genetics.Parameter estimation in HMM is typically performed using the\u0000Baum-Welch algorithm, a special case of the Expectation-Maximisation (EM)\u0000algorithm. While this method guarantee the convergence to a local maximum, its\u0000convergence rates is usually slow.Alternative methods, such as the direct\u0000maximisation of the likelihood using quasi-Newton methods (such as L-BFGS-B)\u0000can offer faster convergence but can be more complicated to implement due to\u0000challenges to deal with the presence of bounds on the space of parameters.We\u0000propose a novel hybrid algorithm, QNEM, that combines the Baum-Welch and the\u0000quasi-Newton algorithms. QNEM aims to leverage the strength of both algorithms\u0000by switching from one method to the other based on the convexity of the\u0000likelihood function.We conducted a comparative analysis between QNEM, the\u0000Baum-Welch algorithm, an EM acceleration algorithm called SQUAREM (Varadhan,\u00002008, Scand J Statist), and the L-BFGS-B quasi-Newton method by applying these\u0000algorithms to four examples built on different models. We estimated the\u0000parameters of each model using the different algorithms and evaluated their\u0000performances.Our results show that the best-performing algorithm depends on the\u0000model considered. QNEM performs well overall, always being faster or equivalent\u0000to L-BFGS-B. The Baum-Welch and SQUAREM algorithms are faster than the\u0000quasi-Newton and QNEM algorithms in certain scenarios with multiple optimum. In\u0000conclusion, QNEM offers a promising alternative to existing algorithms.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We analyze existing DD methods through this broader lens, highlighting their strengths and limitations in terms of accuracy and faithfulness to optimal DD operation. Finally, we present numerical results for two case studies important in contemporary settings. Firstly, we address a critical challenge in medical data analysis: merging the knowledge from different datasets composed of intersecting, but not identical, sets of features, in order to construct a larger dataset in what is usually a small sample setting. Secondly, we consider out-of-distribution error across boundary conditions for physics-informed neural networks (PINNs), showing the potential for DD to provide more physically faithful data. By establishing this general formulation of DD, we aim to establish a new research paradigm by which DD can be understood and from which new DD techniques can arise.
{"title":"Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning","authors":"Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian, Anthony Quinn, Fadwa Idlahcen, Yiran Chen","doi":"arxiv-2409.01410","DOIUrl":"https://doi.org/arxiv-2409.01410","url":null,"abstract":"Dataset distillation (DD) is an increasingly important technique that focuses\u0000on constructing a synthetic dataset capable of capturing the core information\u0000in training data to achieve comparable performance in models trained on the\u0000latter. While DD has a wide range of applications, the theory supporting it is\u0000less well evolved. New methods of DD are compared on a common set of\u0000benchmarks, rather than oriented towards any particular learning task. In this\u0000work, we present a formal model of DD, arguing that a precise characterization\u0000of the underlying optimization problem must specify the inference task\u0000associated with the application of interest. Without this task-specific focus,\u0000the DD problem is under-specified, and the selection of a DD algorithm for a\u0000particular task is merely heuristic. Our formalization reveals novel\u0000applications of DD across different modeling environments. We analyze existing\u0000DD methods through this broader lens, highlighting their strengths and\u0000limitations in terms of accuracy and faithfulness to optimal DD operation.\u0000Finally, we present numerical results for two case studies important in\u0000contemporary settings. Firstly, we address a critical challenge in medical data\u0000analysis: merging the knowledge from different datasets composed of\u0000intersecting, but not identical, sets of features, in order to construct a\u0000larger dataset in what is usually a small sample setting. Secondly, we consider\u0000out-of-distribution error across boundary conditions for physics-informed\u0000neural networks (PINNs), showing the potential for DD to provide more\u0000physically faithful data. By establishing this general formulation of DD, we\u0000aim to establish a new research paradigm by which DD can be understood and from\u0000which new DD techniques can arise.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142189492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typical simulation approaches for evaluating the performance of statistical methods on populations embedded in social networks may fail to capture important features of real-world networks. It can therefore be unclear whether inference methods for causal effects due to interference that have been shown to perform well in such synthetic networks are applicable to social networks which arise in the real world. Plasmode simulation studies use a real dataset created from natural processes, but with part of the data-generation mechanism known. However, given the sensitivity of relational data, many network data are protected from unauthorized access or disclosure. In such case, plasmode simulations cannot use released versions of real datasets which often omit the network links, and instead can only rely on parameters estimated from them. A statistical framework for creating replicated simulation datasets from private social network data is developed and validated. The approach consists of simulating from a parametric exponential family random graph model fitted to the network data and resampling from the observed exposure and covariate distributions to preserve the associations among these variables.
{"title":"Plasmode simulation for the evaluation of causal inference methods in homophilous social networks","authors":"Vanessa McNealis, Erica E. M. Moodie, Nema Dean","doi":"arxiv-2409.01316","DOIUrl":"https://doi.org/arxiv-2409.01316","url":null,"abstract":"Typical simulation approaches for evaluating the performance of statistical\u0000methods on populations embedded in social networks may fail to capture\u0000important features of real-world networks. It can therefore be unclear whether\u0000inference methods for causal effects due to interference that have been shown\u0000to perform well in such synthetic networks are applicable to social networks\u0000which arise in the real world. Plasmode simulation studies use a real dataset\u0000created from natural processes, but with part of the data-generation mechanism\u0000known. However, given the sensitivity of relational data, many network data are\u0000protected from unauthorized access or disclosure. In such case, plasmode\u0000simulations cannot use released versions of real datasets which often omit the\u0000network links, and instead can only rely on parameters estimated from them. A\u0000statistical framework for creating replicated simulation datasets from private\u0000social network data is developed and validated. The approach consists of\u0000simulating from a parametric exponential family random graph model fitted to\u0000the network data and resampling from the observed exposure and covariate\u0000distributions to preserve the associations among these variables.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}