arXiv - STAT - Other Statistics最新文献

英文中文

To democratize research with sensitive data, we should make synthetic data more accessible 为实现敏感数据研究的民主化，我们应该让合成数据更容易获取

arXiv - STAT - Other Statistics

Pub Date : 2024-04-26 DOI: arxiv-2404.17271

Erik-Jan van Kesteren

For over 30 years, synthetic data has been heralded as a promising solutionto make sensitive datasets accessible. However, despite much research effortand several high-profile use-cases, the widespread adoption of synthetic dataas a tool for open, accessible, reproducible research with sensitive data isstill a distant dream. In this opinion, Erik-Jan van Kesteren, head of theODISSEI Social Data Science team, argues that in order to progress towardswidespread adoption of synthetic data as a privacy enhancing technology, thedata science research community should shift focus away from developing bettersynthesis methods: instead, it should develop accessible tools, educate peers,and publish small-scale case studies.

30 多年来，合成数据一直被认为是使敏感数据集可访问的一种有前途的解决方案。然而，尽管做了大量的研究工作，也有几个备受瞩目的使用案例，但要广泛采用合成数据作为开放、可访问、可重复的敏感数据研究工具，仍然是一个遥远的梦想。在本文中，ODISSEI 社会数据科学团队负责人 Erik-Jan van Kesteren 认为，为了将合成数据作为隐私增强技术广泛采用，数据科学研究界应将重点从开发更好的合成方法转移到开发可访问的工具、教育同行和发布小规模案例研究上来。

引用次数: 0

An Investigation into Distance Measures in Cluster Analysis 对聚类分析中的距离测量方法的研究

arXiv - STAT - Other Statistics

Pub Date : 2024-04-21 DOI: arxiv-2404.13664

Zoe Shapcott

This report provides an exploration of different distance measures that canbe used with the $K$-means algorithm for cluster analysis. Specifically, weinvestigate the Mahalanobis distance, and critically assess any benefits it mayhave over the more traditional measures of the Euclidean, Manhattan and Maximumdistances. We perform this by first defining the metrics, before consideringtheir advantages and drawbacks as discussed in literature regarding this area.We apply these distances, first to some simulated data and then to subsets ofthe Dry Bean dataset [1], to explore if there is a better quality detectablefor one metric over the others in these cases. One of the sections is devotedto analysing the information obtained from ChatGPT in response to promptsrelating to this topic.

本报告探讨了可与 K$-means 算法一起用于聚类分析的不同距离测量方法。具体来说，我们研究了马哈拉诺比斯距离，并认真评估了它与欧几里得、曼哈顿和最大距离等传统度量方法相比可能具有的优势。我们将这些距离首先应用于一些模拟数据，然后再应用于 Dry Bean 数据集[1]的子集，以探索在这些情况下，一种度量是否比其他度量能检测出更好的质量。其中有一节专门分析了从 ChatGPT 中获取的与该主题相关的信息。

引用次数: 0

Seasonal and Periodic Patterns of PM2.5 in Manhattan using the Variable Bandpass Periodic Block Bootstrap 使用可变带通周期块引导法分析曼哈顿 PM2.5 的季节性和周期性模式

arXiv - STAT - Other Statistics

Pub Date : 2024-04-12 DOI: arxiv-2404.08738

Yanan Sun, Edward Valachovic

Air quality is a critical component of environmental health. Monitoring andanalysis of particulate matter with a diameter of 2.5 micrometers or smaller(PM2.5) plays a pivotal role in understanding air quality changes. This studyfocuses on the application of a new bandpass bootstrap approach, termed theVariable Bandpass Periodic Block Bootstrap (VBPBB), for analyzing time seriesdata which provides modeled predictions of daily mean PM2.5 concentrations over16 years in Manhattan, New York, the United States. The VBPBB can be used toexplore periodically correlated (PC) principal components for this daily meanPM2.5 dataset. This method uses bandpass filters to isolate distinct PCcomponents from datasets, removing unwanted interference including noise, andbootstraps the PC components. This preserves the PC structure and permits abetter understanding of the periodic characteristics of time series data. Theresults of the VBPBB are compared against outcomes from alternative blockbootstrapping techniques. The findings of this research indicate potentialtrends of elevated PM2.5 levels, providing evidence of significant semi-annualand weekly patterns missed by other methods.

空气质量是环境健康的重要组成部分。对直径为 2.5 微米或更小的颗粒物（PM2.5）的监测和分析在了解空气质量变化方面起着关键作用。本研究侧重于应用一种新的带通自举法，即可变带通周期块自举法（VBPBB）来分析时间序列数据，该数据提供了美国纽约曼哈顿 16 年间 PM2.5 日均浓度的模型预测。VBPBB 可用于探索该 PM2.5 日均值数据集的周期相关（PC）主成分。该方法使用带通滤波器从数据集中分离出不同的 PC 成分，去除包括噪声在内的不必要干扰，并对 PC 成分进行绑定。这样可以保留 PC 结构，更好地理解时间序列数据的周期特征。VBPBB 的结果与其他块引导技术的结果进行了比较。研究结果表明了 PM2.5 水平升高的潜在趋势，提供了其他方法所忽略的重要的半年和一周模式的证据。

{"title":"Seasonal and Periodic Patterns of PM2.5 in Manhattan using the Variable Bandpass Periodic Block Bootstrap","authors":"Yanan Sun, Edward Valachovic","doi":"arxiv-2404.08738","DOIUrl":"https://doi.org/arxiv-2404.08738","url":null,"abstract":"Air quality is a critical component of environmental health. Monitoring and\u0000analysis of particulate matter with a diameter of 2.5 micrometers or smaller\u0000(PM2.5) plays a pivotal role in understanding air quality changes. This study\u0000focuses on the application of a new bandpass bootstrap approach, termed the\u0000Variable Bandpass Periodic Block Bootstrap (VBPBB), for analyzing time series\u0000data which provides modeled predictions of daily mean PM2.5 concentrations over\u000016 years in Manhattan, New York, the United States. The VBPBB can be used to\u0000explore periodically correlated (PC) principal components for this daily mean\u0000PM2.5 dataset. This method uses bandpass filters to isolate distinct PC\u0000components from datasets, removing unwanted interference including noise, and\u0000bootstraps the PC components. This preserves the PC structure and permits a\u0000better understanding of the periodic characteristics of time series data. The\u0000results of the VBPBB are compared against outcomes from alternative block\u0000bootstrapping techniques. The findings of this research indicate potential\u0000trends of elevated PM2.5 levels, providing evidence of significant semi-annual\u0000and weekly patterns missed by other methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140566016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non-Parametric Estimation of Multiple Periodic Components in Turkey's Electricity Consumption 土耳其电力消费中多周期成分的非参数估计

arXiv - STAT - Other Statistics

Pub Date : 2024-04-04 DOI: arxiv-2404.03786

Jie Yao, Edward Valachovic

Electric generation and consumption are an essential component ofcontemporary living, influencing diverse facets of our daily routines,convenience, and economic progress. There is a high demand for characterizingthe periodic pattern of electricity consumption. VBPBB employs a bandpassfilter aligned to retain the frequency of a PC component and eliminatinginterference from other components. This leads to a significant reduction inthe size of bootstrapped confidence intervals. Furthermore, other PC bootstrapmethods preserve one but not multiple periodically correlated components,resulting in superior performance compared to other methods by providing a moreprecise estimation of the sampling distribution for the desiredcharacteristics. The study of the periodic means of Turkey electricityconsumption using VBPBB is presented and compared with outcomes fromalternative bootstrapping approaches. These findings offer significant evidencesupporting the existence of daily, weekly, and annual PC patterns, along withinformation on their timing and confidence intervals for their effects. Thisinformation is valuable for enhancing predictions and preparations for futureresponses to electricity consumption.

发电和用电是当代生活的重要组成部分，影响着我们日常生活的方方面面、生活的便利和经济的进步。人们对描述电力消耗的周期性模式有很高的要求。VBPBB 采用带通滤波器，对准 PC 元件的频率，消除其他元件的干扰。这大大减少了引导置信区间的大小。此外，其他 PC 引导方法只保留一个而不保留多个周期相关成分，因此与其他方法相比，该方法能更精确地估计所需特征的抽样分布，从而获得更优越的性能。本文介绍了使用 VBPBB 对土耳其耗电量的周期性平均值进行的研究，并与其他引导方法的结果进行了比较。这些研究结果提供了重要的证据，证明了每日、每周和每年 PC 模式的存在，并提供了关于其时间及其影响的置信区间的信息。这些信息对于加强对未来用电反应的预测和准备工作非常有价值。

{"title":"Non-Parametric Estimation of Multiple Periodic Components in Turkey's Electricity Consumption","authors":"Jie Yao, Edward Valachovic","doi":"arxiv-2404.03786","DOIUrl":"https://doi.org/arxiv-2404.03786","url":null,"abstract":"Electric generation and consumption are an essential component of\u0000contemporary living, influencing diverse facets of our daily routines,\u0000convenience, and economic progress. There is a high demand for characterizing\u0000the periodic pattern of electricity consumption. VBPBB employs a bandpass\u0000filter aligned to retain the frequency of a PC component and eliminating\u0000interference from other components. This leads to a significant reduction in\u0000the size of bootstrapped confidence intervals. Furthermore, other PC bootstrap\u0000methods preserve one but not multiple periodically correlated components,\u0000resulting in superior performance compared to other methods by providing a more\u0000precise estimation of the sampling distribution for the desired\u0000characteristics. The study of the periodic means of Turkey electricity\u0000consumption using VBPBB is presented and compared with outcomes from\u0000alternative bootstrapping approaches. These findings offer significant evidence\u0000supporting the existence of daily, weekly, and annual PC patterns, along with\u0000information on their timing and confidence intervals for their effects. This\u0000information is valuable for enhancing predictions and preparations for future\u0000responses to electricity consumption.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient estimation for a smoothing thin plate spline in a two-dimensional space 二维空间中平滑薄板样条线的高效估算

arXiv - STAT - Other Statistics

Pub Date : 2024-04-02 DOI: arxiv-2404.01902

Joaquin Cavieres, Michael Karkulik

Using a deterministic framework allows us to estimate a function with thepurpose of interpolating data in spatial statistics. Radial basis functions arecommonly used for scattered data interpolation in a d-dimensional space,however, interpolation problems have to deal with dense matrices. For the caseof smoothing thin plate splines, we propose an efficient way to address thisproblem by compressing the dense matrix by an hierarchical matrix($mathcal{H}$-matrix) and using the conjugate gradient method to solve thelinear system of equations. A simulation study was conducted to assess theeffectiveness of the spatial interpolation method. The results indicated thatemploying an $mathcal{H}$-matrix along with the conjugate gradient methodallows for efficient computations while maintaining a minimal error. We alsoprovide a sensitivity analysis that covers a range of smoothing and compressionparameter values, along with a Monte Carlo simulation aimed at quantifyinguncertainty in the approximated function. Lastly, we present a comparativestudy between the proposed approach and thin plate regression using the "mgcv"package of the statistical software R. The comparison results demonstratesimilar interpolation performance between the two methods.

利用确定性框架，我们可以估算出一个函数，其目的是对空间统计中的数据进行插值。径向基函数通常用于 d 维空间的分散数据插值，但插值问题必须处理密集矩阵。针对平滑薄板样条的情况，我们提出了一种有效的方法来解决这一问题，即通过分层矩阵（$mathcal{H}$-matrix）压缩密集矩阵，并使用共轭梯度法求解线性方程组。为了评估空间插值法的有效性，我们进行了一项模拟研究。结果表明，使用 $mathcal{H}$ 矩阵和共轭梯度法可以在保持最小误差的同时实现高效计算。我们还提供了涵盖一系列平滑和压缩参数值的敏感性分析，以及旨在量化近似函数不确定性的蒙特卡罗模拟。最后，我们使用统计软件 R 的 "mgcv "软件包对所提出的方法和薄板回归进行了比较研究。

{"title":"Efficient estimation for a smoothing thin plate spline in a two-dimensional space","authors":"Joaquin Cavieres, Michael Karkulik","doi":"arxiv-2404.01902","DOIUrl":"https://doi.org/arxiv-2404.01902","url":null,"abstract":"Using a deterministic framework allows us to estimate a function with the\u0000purpose of interpolating data in spatial statistics. Radial basis functions are\u0000commonly used for scattered data interpolation in a d-dimensional space,\u0000however, interpolation problems have to deal with dense matrices. For the case\u0000of smoothing thin plate splines, we propose an efficient way to address this\u0000problem by compressing the dense matrix by an hierarchical matrix\u0000($mathcal{H}$-matrix) and using the conjugate gradient method to solve the\u0000linear system of equations. A simulation study was conducted to assess the\u0000effectiveness of the spatial interpolation method. The results indicated that\u0000employing an $mathcal{H}$-matrix along with the conjugate gradient method\u0000allows for efficient computations while maintaining a minimal error. We also\u0000provide a sensitivity analysis that covers a range of smoothing and compression\u0000parameter values, along with a Monte Carlo simulation aimed at quantifying\u0000uncertainty in the approximated function. Lastly, we present a comparative\u0000study between the proposed approach and thin plate regression using the \"mgcv\"\u0000package of the statistical software R. The comparison results demonstrate\u0000similar interpolation performance between the two methods.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization 利用连续优化实现线性降维模型的最佳子集求解路径

arXiv - STAT - Other Statistics

Pub Date : 2024-03-29 DOI: arxiv-2403.20007

Benoit Liquet, Sarat Moka, Samuel Muller

The selection of best variables is a challenging problem in supervised andunsupervised learning, especially in high dimensional contexts where the numberof variables is usually much larger than the number of observations. In thispaper, we focus on two multivariate statistical methods: principal componentsanalysis and partial least squares. Both approaches are popular lineardimension-reduction methods with numerous applications in several fieldsincluding in genomics, biology, environmental science, and engineering. Inparticular, these approaches build principal components, new variables that arecombinations of all the original variables. A main drawback of principalcomponents is the difficulty to interpret them when the number of variables islarge. To define principal components from the most relevant variables, wepropose to cast the best subset solution path method into principal componentanalysis and partial least square frameworks. We offer a new alternative byexploiting a continuous optimization algorithm for best subset solution path.Empirical studies show the efficacy of our approach for providing the bestsubset solution path. The usage of our algorithm is further exposed through theanalysis of two real datasets. The first dataset is analyzed using theprinciple component analysis while the analysis of the second dataset is basedon partial least square framework.

在监督和非监督学习中，选择最佳变量是一个具有挑战性的问题，尤其是在高维环境中，变量的数量通常比观测值的数量要多得多。本文重点讨论两种多元统计方法：主成分分析法和偏最小二乘法。这两种方法都是流行的线性维度还原方法，在基因组学、生物学、环境科学和工程学等多个领域都有大量应用。特别是，这两种方法都能建立主成分，即由所有原始变量组合而成的新变量。主成分的主要缺点是在变量数量较多时难以解释。为了从最相关的变量中定义主成分，我们建议将最佳子集求解路径法引入主成分分析和偏最小二乘法框架。实证研究表明，我们的方法能有效提供最佳子集求解路径。通过对两个真实数据集的分析，进一步揭示了我们算法的用途。对第一个数据集的分析采用了原理成分分析法，而对第二个数据集的分析则基于偏最小二乘法框架。

{"title":"Best Subset Solution Path for Linear Dimension Reduction Models using Continuous Optimization","authors":"Benoit Liquet, Sarat Moka, Samuel Muller","doi":"arxiv-2403.20007","DOIUrl":"https://doi.org/arxiv-2403.20007","url":null,"abstract":"The selection of best variables is a challenging problem in supervised and\u0000unsupervised learning, especially in high dimensional contexts where the number\u0000of variables is usually much larger than the number of observations. In this\u0000paper, we focus on two multivariate statistical methods: principal components\u0000analysis and partial least squares. Both approaches are popular linear\u0000dimension-reduction methods with numerous applications in several fields\u0000including in genomics, biology, environmental science, and engineering. In\u0000particular, these approaches build principal components, new variables that are\u0000combinations of all the original variables. A main drawback of principal\u0000components is the difficulty to interpret them when the number of variables is\u0000large. To define principal components from the most relevant variables, we\u0000propose to cast the best subset solution path method into principal component\u0000analysis and partial least square frameworks. We offer a new alternative by\u0000exploiting a continuous optimization algorithm for best subset solution path.\u0000Empirical studies show the efficacy of our approach for providing the best\u0000subset solution path. The usage of our algorithm is further exposed through the\u0000analysis of two real datasets. The first dataset is analyzed using the\u0000principle component analysis while the analysis of the second dataset is based\u0000on partial least square framework.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140565845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Why Name Popularity is a Good Test of Historicity 为什么名字的流行性是历史性的良好检验标准？

arXiv - STAT - Other Statistics

Pub Date : 2024-03-21 DOI: arxiv-2403.14883

Luuk van de Weghe, Jason Wilson

Are name statistics in the Gospels and Acts a good test of historicity? KamilGregor and Brian Blais, in a recent article in The Journal for the Study of theHistorical Jesus, argue that the sample of name occurrences in the Gospels andActs is too small to be determinative and that several statistical anomaliesweigh against a positive verdict. Unfortunately, their conclusions resultdirectly from improper testing and questionable data selection. Chi-squaredgoodness-of-fit testing establishes that name occurrences in the Gospels andActs fit into their historical context at least as good as those in the worksof Josephus. Additionally, they fit better than occurrences derived fromancient fictional sources and occurrences from modern, well-researchedhistorical novels.

福音书》和《使徒行传》中的姓名统计能很好地检验历史性吗？卡米尔-格雷戈尔（KamilGregor）和布赖恩-布莱斯（Brian Blais）最近在《历史上的耶稣研究期刊》（The Journal for the Study of the Historical Jesus）上发表文章，认为《福音书》和《使徒行传》中出现姓名的样本太少，不能起决定性作用，而且一些统计上的反常现象也不利于得出肯定的结论。不幸的是，他们的结论直接源于不恰当的测试和可疑的数据选择。契合度检验证明，《福音书》和《使徒行传》中出现的人名与其历史背景的契合度至少不亚于约瑟夫作品中的人名。此外，它们比从古代虚构资料中提取的人名和从现代经过深入研究的历史小说中提取的人名更符合历史背景。

引用次数: 0

Risk Quadrangle and Robust Optimization Based on $varphi$-Divergence 基于$varphi$-发散的风险四边形和稳健优化

arXiv - STAT - Other Statistics

Pub Date : 2024-03-16 DOI: arxiv-2403.10987

Cheng Peng, Anton Malandii, Stan Uryasev

This paper studies robust and distributionally robust optimization based onthe extended $varphi$-divergence under the Fundamental Risk Quadrangleframework. We present the primal and dual representations of the quadrangleelements: risk, deviation, regret, error, and statistic. The framework providesan interpretation of portfolio optimization, classification and regression asrobust optimization. We furnish illustrative examples demonstrating that manycommon problems are included in this framework. The $varphi$-divergence riskmeasure used in distributionally robust optimization is a special case. Weconduct a case study to visualize the risk envelope.

本文在基本风险四边形框架下研究基于扩展的$varphi$-发散的稳健和分布稳健优化。我们介绍了四边形元素的基本表示法和对偶表示法：风险、偏差、遗憾、误差和统计量。该框架将投资组合优化、分类和回归解释为稳健优化。我们举例说明了许多常见问题都包含在这一框架中。分布稳健优化中使用的 $varphi$- divergence 风险度量就是一个特例。我们进行了一项案例研究，以直观展示风险包络。

引用次数: 0

Algorithmic syntactic causal identification 算法句法因果识别

arXiv - STAT - Other Statistics

Pub Date : 2024-03-14 DOI: arxiv-2403.09580

Dhurim Cakiqi, Max A. Little

Causal identification in causal Bayes nets (CBNs) is an important tool incausal inference allowing the derivation of interventional distributions fromobservational distributions where this is possible in principle. However, mostexisting formulations of causal identification using techniques such asd-separation and do-calculus are expressed within the mathematical language ofclassical probability theory on CBNs. However, there are many causal settingswhere probability theory and hence current causal identification techniques areinapplicable such as relational databases, dataflow programs such as hardwaredescription languages, distributed systems and most modern machine learningalgorithms. We show that this restriction can be lifted by replacing the use ofclassical probability theory with the alternative axiomatic foundation ofsymmetric monoidal categories. In this alternative axiomatization, we show howan unambiguous and clean distinction can be drawn between the general syntax ofcausal models and any specific semantic implementation of that causal model.This allows a purely syntactic algorithmic description of general causalidentification by a translation of recent formulations of the general IDalgorithm through fixing. Our description is given entirely in terms of thenon-parametric ADMG structure specifying a causal model and the algebraicsignature of the corresponding monoidal category, to which a sequence ofmanipulations is then applied so as to arrive at a modified monoidal categoryin which the desired, purely syntactic interventional causal model, isobtained. We use this idea to derive purely syntactic analogues of classicalback-door and front-door causal adjustment, and illustrate an application to amore complex causal model.

因果贝叶斯网（CBN）中的因果识别是因果推理的重要工具，它允许从观察分布推导出干预分布，而这在原则上是可能的。然而，大多数现有的因果识别公式都是使用 CBN 上经典概率论的数学语言表达的，如分离（d-separation）和计算（do-calculus）等技术。然而，在许多因果关系环境中，概率论和当前的因果识别技术都是适用的，例如关系数据库、数据流程序（如硬件描述语言）、分布式系统和大多数现代机器学习算法。我们的研究表明，用对称单环范畴的替代公理基础来取代经典概率论，可以解除这一限制。在这种替代性公理化中，我们展示了如何在因果模型的一般语法与该因果模型的任何具体语义实现之间划出明确而清晰的区别。这使得我们可以通过对一般 ID 算法的最新表述进行固定化，从而对一般因果识别进行纯粹的语法算法描述。我们的描述完全是以当时的非参数 ADMG 结构给出的，它指定了一个因果模型和相应单义范畴的代数学特征，然后对其进行一系列处理，从而得到一个经过修改的单义范畴，在这个范畴中可以得到所需的纯句法介入因果模型。我们利用这一思想推导出经典的后门和前门因果调整的纯句法类似物，并说明了它在更复杂的因果模型中的应用。

{"title":"Algorithmic syntactic causal identification","authors":"Dhurim Cakiqi, Max A. Little","doi":"arxiv-2403.09580","DOIUrl":"https://doi.org/arxiv-2403.09580","url":null,"abstract":"Causal identification in causal Bayes nets (CBNs) is an important tool in\u0000causal inference allowing the derivation of interventional distributions from\u0000observational distributions where this is possible in principle. However, most\u0000existing formulations of causal identification using techniques such as\u0000d-separation and do-calculus are expressed within the mathematical language of\u0000classical probability theory on CBNs. However, there are many causal settings\u0000where probability theory and hence current causal identification techniques are\u0000inapplicable such as relational databases, dataflow programs such as hardware\u0000description languages, distributed systems and most modern machine learning\u0000algorithms. We show that this restriction can be lifted by replacing the use of\u0000classical probability theory with the alternative axiomatic foundation of\u0000symmetric monoidal categories. In this alternative axiomatization, we show how\u0000an unambiguous and clean distinction can be drawn between the general syntax of\u0000causal models and any specific semantic implementation of that causal model.\u0000This allows a purely syntactic algorithmic description of general causal\u0000identification by a translation of recent formulations of the general ID\u0000algorithm through fixing. Our description is given entirely in terms of the\u0000non-parametric ADMG structure specifying a causal model and the algebraic\u0000signature of the corresponding monoidal category, to which a sequence of\u0000manipulations is then applied so as to arrive at a modified monoidal category\u0000in which the desired, purely syntactic interventional causal model, is\u0000obtained. We use this idea to derive purely syntactic analogues of classical\u0000back-door and front-door causal adjustment, and illustrate an application to a\u0000more complex causal model.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140150232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guidelines for the Creation of Analysis Ready Data 创建分析就绪数据指南

arXiv - STAT - Other Statistics

Pub Date : 2024-03-12 DOI: arxiv-2403.08127

Harriette Phillips, Aiden Price, Owen Forbes, Claire Boulange, Kerrie Mengersen, Marketa Reeves, Rebecca Glauert

Globally, there is an increased need for guidelines to produce high-qualitydata outputs for analysis. There is no framework currently exists providingguidelines for a comprehensive approach in producing analysis ready data (ARD).Through critically reviewing and summarising current literature, this paperproposes such guidelines for the creation of ARD. The guidelines proposed inthis paper inform ten steps in the generation of ARD: ethics, projectdocumentation, data governance, data management, data storage, data discoveryand collection, data cleaning, quality assurance, metadata, and datadictionary. These steps are illustrated through a substantive case study whichaimed to create ARD for a digital spatial platform: the Australian Child andYouth Wellbeing Atlas (ACYWA).

在全球范围内，越来越需要制定指导方针，以生成高质量的分析数据输出。本文通过对现有文献的批判性回顾和总结，提出了创建 ARD 的指导原则。本文提出的指南涵盖了生成 ARD 的十个步骤：伦理、项目文档、数据治理、数据管理、数据存储、数据发现和收集、数据清理、质量保证、元数据和数据字典。这些步骤通过一个实质性案例研究加以说明，该案例研究旨在为数字空间平台创建 ARD：澳大利亚儿童与青少年福祉地图集（ACYWA）。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - STAT - Other Statistics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀