For over 30 years, synthetic data has been heralded as a promising solution to make sensitive datasets accessible. However, despite much research effort and several high-profile use-cases, the widespread adoption of synthetic data as a tool for open, accessible, reproducible research with sensitive data is still a distant dream. In this opinion, Erik-Jan van Kesteren, head of the ODISSEI Social Data Science team, argues that in order to progress towards widespread adoption of synthetic data as a privacy enhancing technology, the data science research community should shift focus away from developing better synthesis methods: instead, it should develop accessible tools, educate peers, and publish small-scale case studies.
30 多年来,合成数据一直被认为是使敏感数据集可访问的一种有前途的解决方案。然而,尽管做了大量的研究工作,也有几个备受瞩目的使用案例,但要广泛采用合成数据作为开放、可访问、可重复的敏感数据研究工具,仍然是一个遥远的梦想。在本文中,ODISSEI 社会数据科学团队负责人 Erik-Jan van Kesteren 认为,为了将合成数据作为隐私增强技术广泛采用,数据科学研究界应将重点从开发更好的合成方法转移到开发可访问的工具、教育同行和发布小规模案例研究上来。
{"title":"To democratize research with sensitive data, we should make synthetic data more accessible","authors":"Erik-Jan van Kesteren","doi":"arxiv-2404.17271","DOIUrl":"https://doi.org/arxiv-2404.17271","url":null,"abstract":"For over 30 years, synthetic data has been heralded as a promising solution\u0000to make sensitive datasets accessible. However, despite much research effort\u0000and several high-profile use-cases, the widespread adoption of synthetic data\u0000as a tool for open, accessible, reproducible research with sensitive data is\u0000still a distant dream. In this opinion, Erik-Jan van Kesteren, head of the\u0000ODISSEI Social Data Science team, argues that in order to progress towards\u0000widespread adoption of synthetic data as a privacy enhancing technology, the\u0000data science research community should shift focus away from developing better\u0000synthesis methods: instead, it should develop accessible tools, educate peers,\u0000and publish small-scale case studies.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This report provides an exploration of different distance measures that can be used with the $K$-means algorithm for cluster analysis. Specifically, we investigate the Mahalanobis distance, and critically assess any benefits it may have over the more traditional measures of the Euclidean, Manhattan and Maximum distances. We perform this by first defining the metrics, before considering their advantages and drawbacks as discussed in literature regarding this area. We apply these distances, first to some simulated data and then to subsets of the Dry Bean dataset [1], to explore if there is a better quality detectable for one metric over the others in these cases. One of the sections is devoted to analysing the information obtained from ChatGPT in response to prompts relating to this topic.
{"title":"An Investigation into Distance Measures in Cluster Analysis","authors":"Zoe Shapcott","doi":"arxiv-2404.13664","DOIUrl":"https://doi.org/arxiv-2404.13664","url":null,"abstract":"This report provides an exploration of different distance measures that can\u0000be used with the $K$-means algorithm for cluster analysis. Specifically, we\u0000investigate the Mahalanobis distance, and critically assess any benefits it may\u0000have over the more traditional measures of the Euclidean, Manhattan and Maximum\u0000distances. We perform this by first defining the metrics, before considering\u0000their advantages and drawbacks as discussed in literature regarding this area.\u0000We apply these distances, first to some simulated data and then to subsets of\u0000the Dry Bean dataset [1], to explore if there is a better quality detectable\u0000for one metric over the others in these cases. One of the sections is devoted\u0000to analysing the information obtained from ChatGPT in response to prompts\u0000relating to this topic.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140801506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Air quality is a critical component of environmental health. Monitoring and analysis of particulate matter with a diameter of 2.5 micrometers or smaller (PM2.5) plays a pivotal role in understanding air quality changes. This study focuses on the application of a new bandpass bootstrap approach, termed the Variable Bandpass Periodic Block Bootstrap (VBPBB), for analyzing time series data which provides modeled predictions of daily mean PM2.5 concentrations over 16 years in Manhattan, New York, the United States. The VBPBB can be used to explore periodically correlated (PC) principal components for this daily mean PM2.5 dataset. This method uses bandpass filters to isolate distinct PC components from datasets, removing unwanted interference including noise, and bootstraps the PC components. This preserves the PC structure and permits a better understanding of the periodic characteristics of time series data. The results of the VBPBB are compared against outcomes from alternative block bootstrapping techniques. The findings of this research indicate potential trends of elevated PM2.5 levels, providing evidence of significant semi-annual and weekly patterns missed by other methods.
空气质量是环境健康的重要组成部分。对直径为 2.5 微米或更小的颗粒物(PM2.5)的监测和分析在了解空气质量变化方面起着关键作用。本研究侧重于应用一种新的带通自举法,即可变带通周期块自举法(VBPBB)来分析时间序列数据,该数据提供了美国纽约曼哈顿 16 年间 PM2.5 日均浓度的模型预测。VBPBB 可用于探索该 PM2.5 日均值数据集的周期相关(PC)主成分。该方法使用带通滤波器从数据集中分离出不同的 PC 成分,去除包括噪声在内的不必要干扰,并对 PC 成分进行绑定。这样可以保留 PC 结构,更好地理解时间序列数据的周期特征。VBPBB 的结果与其他块引导技术的结果进行了比较。研究结果表明了 PM2.5 水平升高的潜在趋势,提供了其他方法所忽略的重要的半年和一周模式的证据。
{"title":"Seasonal and Periodic Patterns of PM2.5 in Manhattan using the Variable Bandpass Periodic Block Bootstrap","authors":"Yanan Sun, Edward Valachovic","doi":"arxiv-2404.08738","DOIUrl":"https://doi.org/arxiv-2404.08738","url":null,"abstract":"Air quality is a critical component of environmental health. Monitoring and\u0000analysis of particulate matter with a diameter of 2.5 micrometers or smaller\u0000(PM2.5) plays a pivotal role in understanding air quality changes. This study\u0000focuses on the application of a new bandpass bootstrap approach, termed the\u0000Variable Bandpass Periodic Block Bootstrap (VBPBB), for analyzing time series\u0000data which provides modeled predictions of daily mean PM2.5 concentrations over\u000016 years in Manhattan, New York, the United States. The VBPBB can be used to\u0000explore periodically correlated (PC) principal components for this daily mean\u0000PM2.5 dataset. This method uses bandpass filters to isolate distinct PC\u0000components from datasets, removing unwanted interference including noise, and\u0000bootstraps the PC components. This preserves the PC structure and permits a\u0000better understanding of the periodic characteristics of time series data. The\u0000results of the VBPBB are compared against outcomes from alternative block\u0000bootstrapping techniques. The findings of this research indicate potential\u0000trends of elevated PM2.5 levels, providing evidence of significant semi-annual\u0000and weekly patterns missed by other methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140566016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electric generation and consumption are an essential component of contemporary living, influencing diverse facets of our daily routines, convenience, and economic progress. There is a high demand for characterizing the periodic pattern of electricity consumption. VBPBB employs a bandpass filter aligned to retain the frequency of a PC component and eliminating interference from other components. This leads to a significant reduction in the size of bootstrapped confidence intervals. Furthermore, other PC bootstrap methods preserve one but not multiple periodically correlated components, resulting in superior performance compared to other methods by providing a more precise estimation of the sampling distribution for the desired characteristics. The study of the periodic means of Turkey electricity consumption using VBPBB is presented and compared with outcomes from alternative bootstrapping approaches. These findings offer significant evidence supporting the existence of daily, weekly, and annual PC patterns, along with information on their timing and confidence intervals for their effects. This information is valuable for enhancing predictions and preparations for future responses to electricity consumption.
发电和用电是当代生活的重要组成部分,影响着我们日常生活的方方面面、生活的便利和经济的进步。人们对描述电力消耗的周期性模式有很高的要求。VBPBB 采用带通滤波器,对准 PC 元件的频率,消除其他元件的干扰。这大大减少了引导置信区间的大小。此外,其他 PC 引导方法只保留一个而不保留多个周期相关成分,因此与其他方法相比,该方法能更精确地估计所需特征的抽样分布,从而获得更优越的性能。本文介绍了使用 VBPBB 对土耳其耗电量的周期性平均值进行的研究,并与其他引导方法的结果进行了比较。这些研究结果提供了重要的证据,证明了每日、每周和每年 PC 模式的存在,并提供了关于其时间及其影响的置信区间的信息。这些信息对于加强对未来用电反应的预测和准备工作非常有价值。
{"title":"Non-Parametric Estimation of Multiple Periodic Components in Turkey's Electricity Consumption","authors":"Jie Yao, Edward Valachovic","doi":"arxiv-2404.03786","DOIUrl":"https://doi.org/arxiv-2404.03786","url":null,"abstract":"Electric generation and consumption are an essential component of\u0000contemporary living, influencing diverse facets of our daily routines,\u0000convenience, and economic progress. There is a high demand for characterizing\u0000the periodic pattern of electricity consumption. VBPBB employs a bandpass\u0000filter aligned to retain the frequency of a PC component and eliminating\u0000interference from other components. This leads to a significant reduction in\u0000the size of bootstrapped confidence intervals. Furthermore, other PC bootstrap\u0000methods preserve one but not multiple periodically correlated components,\u0000resulting in superior performance compared to other methods by providing a more\u0000precise estimation of the sampling distribution for the desired\u0000characteristics. The study of the periodic means of Turkey electricity\u0000consumption using VBPBB is presented and compared with outcomes from\u0000alternative bootstrapping approaches. These findings offer significant evidence\u0000supporting the existence of daily, weekly, and annual PC patterns, along with\u0000information on their timing and confidence intervals for their effects. This\u0000information is valuable for enhancing predictions and preparations for future\u0000responses to electricity consumption.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Using a deterministic framework allows us to estimate a function with the purpose of interpolating data in spatial statistics. Radial basis functions are commonly used for scattered data interpolation in a d-dimensional space, however, interpolation problems have to deal with dense matrices. For the case of smoothing thin plate splines, we propose an efficient way to address this problem by compressing the dense matrix by an hierarchical matrix ($mathcal{H}$-matrix) and using the conjugate gradient method to solve the linear system of equations. A simulation study was conducted to assess the effectiveness of the spatial interpolation method. The results indicated that employing an $mathcal{H}$-matrix along with the conjugate gradient method allows for efficient computations while maintaining a minimal error. We also provide a sensitivity analysis that covers a range of smoothing and compression parameter values, along with a Monte Carlo simulation aimed at quantifying uncertainty in the approximated function. Lastly, we present a comparative study between the proposed approach and thin plate regression using the "mgcv" package of the statistical software R. The comparison results demonstrate similar interpolation performance between the two methods.
利用确定性框架,我们可以估算出一个函数,其目的是对空间统计中的数据进行插值。径向基函数通常用于 d 维空间的分散数据插值,但插值问题必须处理密集矩阵。针对平滑薄板样条的情况,我们提出了一种有效的方法来解决这一问题,即通过分层矩阵($mathcal{H}$-matrix)压缩密集矩阵,并使用共轭梯度法求解线性方程组。为了评估空间插值法的有效性,我们进行了一项模拟研究。结果表明,使用 $mathcal{H}$ 矩阵和共轭梯度法可以在保持最小误差的同时实现高效计算。我们还提供了涵盖一系列平滑和压缩参数值的敏感性分析,以及旨在量化近似函数不确定性的蒙特卡罗模拟。最后,我们使用统计软件 R 的 "mgcv "软件包对所提出的方法和薄板回归进行了比较研究。
{"title":"Efficient estimation for a smoothing thin plate spline in a two-dimensional space","authors":"Joaquin Cavieres, Michael Karkulik","doi":"arxiv-2404.01902","DOIUrl":"https://doi.org/arxiv-2404.01902","url":null,"abstract":"Using a deterministic framework allows us to estimate a function with the\u0000purpose of interpolating data in spatial statistics. Radial basis functions are\u0000commonly used for scattered data interpolation in a d-dimensional space,\u0000however, interpolation problems have to deal with dense matrices. For the case\u0000of smoothing thin plate splines, we propose an efficient way to address this\u0000problem by compressing the dense matrix by an hierarchical matrix\u0000($mathcal{H}$-matrix) and using the conjugate gradient method to solve the\u0000linear system of equations. A simulation study was conducted to assess the\u0000effectiveness of the spatial interpolation method. The results indicated that\u0000employing an $mathcal{H}$-matrix along with the conjugate gradient method\u0000allows for efficient computations while maintaining a minimal error. We also\u0000provide a sensitivity analysis that covers a range of smoothing and compression\u0000parameter values, along with a Monte Carlo simulation aimed at quantifying\u0000uncertainty in the approximated function. Lastly, we present a comparative\u0000study between the proposed approach and thin plate regression using the \"mgcv\"\u0000package of the statistical software R. The comparison results demonstrate\u0000similar interpolation performance between the two methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The selection of best variables is a challenging problem in supervised and unsupervised learning, especially in high dimensional contexts where the number of variables is usually much larger than the number of observations. In this paper, we focus on two multivariate statistical methods: principal components analysis and partial least squares. Both approaches are popular linear dimension-reduction methods with numerous applications in several fields including in genomics, biology, environmental science, and engineering. In particular, these approaches build principal components, new variables that are combinations of all the original variables. A main drawback of principal components is the difficulty to interpret them when the number of variables is large. To define principal components from the most relevant variables, we propose to cast the best subset solution path method into principal component analysis and partial least square frameworks. We offer a new alternative by exploiting a continuous optimization algorithm for best subset solution path. Empirical studies show the efficacy of our approach for providing the best subset solution path. The usage of our algorithm is further exposed through the analysis of two real datasets. The first dataset is analyzed using the principle component analysis while the analysis of the second dataset is based on partial least square framework.
{"title":"Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization","authors":"Benoit Liquet, Sarat Moka, Samuel Muller","doi":"arxiv-2403.20007","DOIUrl":"https://doi.org/arxiv-2403.20007","url":null,"abstract":"The selection of best variables is a challenging problem in supervised and\u0000unsupervised learning, especially in high dimensional contexts where the number\u0000of variables is usually much larger than the number of observations. In this\u0000paper, we focus on two multivariate statistical methods: principal components\u0000analysis and partial least squares. Both approaches are popular linear\u0000dimension-reduction methods with numerous applications in several fields\u0000including in genomics, biology, environmental science, and engineering. In\u0000particular, these approaches build principal components, new variables that are\u0000combinations of all the original variables. A main drawback of principal\u0000components is the difficulty to interpret them when the number of variables is\u0000large. To define principal components from the most relevant variables, we\u0000propose to cast the best subset solution path method into principal component\u0000analysis and partial least square frameworks. We offer a new alternative by\u0000exploiting a continuous optimization algorithm for best subset solution path.\u0000Empirical studies show the efficacy of our approach for providing the best\u0000subset solution path. The usage of our algorithm is further exposed through the\u0000analysis of two real datasets. The first dataset is analyzed using the\u0000principle component analysis while the analysis of the second dataset is based\u0000on partial least square framework.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Are name statistics in the Gospels and Acts a good test of historicity? Kamil Gregor and Brian Blais, in a recent article in The Journal for the Study of the Historical Jesus, argue that the sample of name occurrences in the Gospels and Acts is too small to be determinative and that several statistical anomalies weigh against a positive verdict. Unfortunately, their conclusions result directly from improper testing and questionable data selection. Chi-squared goodness-of-fit testing establishes that name occurrences in the Gospels and Acts fit into their historical context at least as good as those in the works of Josephus. Additionally, they fit better than occurrences derived from ancient fictional sources and occurrences from modern, well-researched historical novels.
福音书》和《使徒行传》中的姓名统计能很好地检验历史性吗?卡米尔-格雷戈尔(KamilGregor)和布赖恩-布莱斯(Brian Blais)最近在《历史上的耶稣研究期刊》(The Journal for the Study of the Historical Jesus)上发表文章,认为《福音书》和《使徒行传》中出现姓名的样本太少,不能起决定性作用,而且一些统计上的反常现象也不利于得出肯定的结论。不幸的是,他们的结论直接源于不恰当的测试和可疑的数据选择。契合度检验证明,《福音书》和《使徒行传》中出现的人名与其历史背景的契合度至少不亚于约瑟夫作品中的人名。此外,它们比从古代虚构资料中提取的人名和从现代经过深入研究的历史小说中提取的人名更符合历史背景。
{"title":"Why Name Popularity is a Good Test of Historicity","authors":"Luuk van de Weghe, Jason Wilson","doi":"arxiv-2403.14883","DOIUrl":"https://doi.org/arxiv-2403.14883","url":null,"abstract":"Are name statistics in the Gospels and Acts a good test of historicity? Kamil\u0000Gregor and Brian Blais, in a recent article in The Journal for the Study of the\u0000Historical Jesus, argue that the sample of name occurrences in the Gospels and\u0000Acts is too small to be determinative and that several statistical anomalies\u0000weigh against a positive verdict. Unfortunately, their conclusions result\u0000directly from improper testing and questionable data selection. Chi-squared\u0000goodness-of-fit testing establishes that name occurrences in the Gospels and\u0000Acts fit into their historical context at least as good as those in the works\u0000of Josephus. Additionally, they fit better than occurrences derived from\u0000ancient fictional sources and occurrences from modern, well-researched\u0000historical novels.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140301921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper studies robust and distributionally robust optimization based on the extended $varphi$-divergence under the Fundamental Risk Quadrangle framework. We present the primal and dual representations of the quadrangle elements: risk, deviation, regret, error, and statistic. The framework provides an interpretation of portfolio optimization, classification and regression as robust optimization. We furnish illustrative examples demonstrating that many common problems are included in this framework. The $varphi$-divergence risk measure used in distributionally robust optimization is a special case. We conduct a case study to visualize the risk envelope.
{"title":"Risk Quadrangle and Robust Optimization Based on $varphi$-Divergence","authors":"Cheng Peng, Anton Malandii, Stan Uryasev","doi":"arxiv-2403.10987","DOIUrl":"https://doi.org/arxiv-2403.10987","url":null,"abstract":"This paper studies robust and distributionally robust optimization based on\u0000the extended $varphi$-divergence under the Fundamental Risk Quadrangle\u0000framework. We present the primal and dual representations of the quadrangle\u0000elements: risk, deviation, regret, error, and statistic. The framework provides\u0000an interpretation of portfolio optimization, classification and regression as\u0000robust optimization. We furnish illustrative examples demonstrating that many\u0000common problems are included in this framework. The $varphi$-divergence risk\u0000measure used in distributionally robust optimization is a special case. We\u0000conduct a case study to visualize the risk envelope.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140170962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Causal identification in causal Bayes nets (CBNs) is an important tool in causal inference allowing the derivation of interventional distributions from observational distributions where this is possible in principle. However, most existing formulations of causal identification using techniques such as d-separation and do-calculus are expressed within the mathematical language of classical probability theory on CBNs. However, there are many causal settings where probability theory and hence current causal identification techniques are inapplicable such as relational databases, dataflow programs such as hardware description languages, distributed systems and most modern machine learning algorithms. We show that this restriction can be lifted by replacing the use of classical probability theory with the alternative axiomatic foundation of symmetric monoidal categories. In this alternative axiomatization, we show how an unambiguous and clean distinction can be drawn between the general syntax of causal models and any specific semantic implementation of that causal model. This allows a purely syntactic algorithmic description of general causal identification by a translation of recent formulations of the general ID algorithm through fixing. Our description is given entirely in terms of the non-parametric ADMG structure specifying a causal model and the algebraic signature of the corresponding monoidal category, to which a sequence of manipulations is then applied so as to arrive at a modified monoidal category in which the desired, purely syntactic interventional causal model, is obtained. We use this idea to derive purely syntactic analogues of classical back-door and front-door causal adjustment, and illustrate an application to a more complex causal model.
因果贝叶斯网(CBN)中的因果识别是因果推理的重要工具,它允许从观察分布推导出干预分布,而这在原则上是可能的。然而,大多数现有的因果识别公式都是使用 CBN 上经典概率论的数学语言表达的,如分离(d-separation)和计算(do-calculus)等技术。然而,在许多因果关系环境中,概率论和当前的因果识别技术都是适用的,例如关系数据库、数据流程序(如硬件描述语言)、分布式系统和大多数现代机器学习算法。我们的研究表明,用对称单环范畴的替代公理基础来取代经典概率论,可以解除这一限制。在这种替代性公理化中,我们展示了如何在因果模型的一般语法与该因果模型的任何具体语义实现之间划出明确而清晰的区别。这使得我们可以通过对一般 ID 算法的最新表述进行固定化,从而对一般因果识别进行纯粹的语法算法描述。我们的描述完全是以当时的非参数 ADMG 结构给出的,它指定了一个因果模型和相应单义范畴的代数学特征,然后对其进行一系列处理,从而得到一个经过修改的单义范畴,在这个范畴中可以得到所需的纯句法介入因果模型。我们利用这一思想推导出经典的后门和前门因果调整的纯句法类似物,并说明了它在更复杂的因果模型中的应用。
{"title":"Algorithmic syntactic causal identification","authors":"Dhurim Cakiqi, Max A. Little","doi":"arxiv-2403.09580","DOIUrl":"https://doi.org/arxiv-2403.09580","url":null,"abstract":"Causal identification in causal Bayes nets (CBNs) is an important tool in\u0000causal inference allowing the derivation of interventional distributions from\u0000observational distributions where this is possible in principle. However, most\u0000existing formulations of causal identification using techniques such as\u0000d-separation and do-calculus are expressed within the mathematical language of\u0000classical probability theory on CBNs. However, there are many causal settings\u0000where probability theory and hence current causal identification techniques are\u0000inapplicable such as relational databases, dataflow programs such as hardware\u0000description languages, distributed systems and most modern machine learning\u0000algorithms. We show that this restriction can be lifted by replacing the use of\u0000classical probability theory with the alternative axiomatic foundation of\u0000symmetric monoidal categories. In this alternative axiomatization, we show how\u0000an unambiguous and clean distinction can be drawn between the general syntax of\u0000causal models and any specific semantic implementation of that causal model.\u0000This allows a purely syntactic algorithmic description of general causal\u0000identification by a translation of recent formulations of the general ID\u0000algorithm through fixing. Our description is given entirely in terms of the\u0000non-parametric ADMG structure specifying a causal model and the algebraic\u0000signature of the corresponding monoidal category, to which a sequence of\u0000manipulations is then applied so as to arrive at a modified monoidal category\u0000in which the desired, purely syntactic interventional causal model, is\u0000obtained. We use this idea to derive purely syntactic analogues of classical\u0000back-door and front-door causal adjustment, and illustrate an application to a\u0000more complex causal model.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Globally, there is an increased need for guidelines to produce high-quality data outputs for analysis. There is no framework currently exists providing guidelines for a comprehensive approach in producing analysis ready data (ARD). Through critically reviewing and summarising current literature, this paper proposes such guidelines for the creation of ARD. The guidelines proposed in this paper inform ten steps in the generation of ARD: ethics, project documentation, data governance, data management, data storage, data discovery and collection, data cleaning, quality assurance, metadata, and data dictionary. These steps are illustrated through a substantive case study which aimed to create ARD for a digital spatial platform: the Australian Child and Youth Wellbeing Atlas (ACYWA).
{"title":"Guidelines for the Creation of Analysis Ready Data","authors":"Harriette Phillips, Aiden Price, Owen Forbes, Claire Boulange, Kerrie Mengersen, Marketa Reeves, Rebecca Glauert","doi":"arxiv-2403.08127","DOIUrl":"https://doi.org/arxiv-2403.08127","url":null,"abstract":"Globally, there is an increased need for guidelines to produce high-quality\u0000data outputs for analysis. There is no framework currently exists providing\u0000guidelines for a comprehensive approach in producing analysis ready data (ARD).\u0000Through critically reviewing and summarising current literature, this paper\u0000proposes such guidelines for the creation of ARD. The guidelines proposed in\u0000this paper inform ten steps in the generation of ARD: ethics, project\u0000documentation, data governance, data management, data storage, data discovery\u0000and collection, data cleaning, quality assurance, metadata, and data\u0000dictionary. These steps are illustrated through a substantive case study which\u0000aimed to create ARD for a digital spatial platform: the Australian Child and\u0000Youth Wellbeing Atlas (ACYWA).","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}