Journal of data science : JDS最新文献

英文中文

Impact of Bias Correction of the Least Squares Estimation on Bootstrap Confidence Intervals for Bifurcating Autoregressive Models 最小二乘估计偏差校正对分岔自回归模型自举置信区间的影响

Journal of data science : JDS

Pub Date : 2023-01-01 DOI: 10.6339/23-jds1092

T. Elbayoumi, S. Mostafa

The least squares (LS) estimator of the autoregressive coefficient in the bifurcating autoregressive (BAR) model was recently shown to suffer from substantial bias, especially for small to moderate samples. This study investigates the impact of the bias in the LS estimator on the behavior of various types of bootstrap confidence intervals for the autoregressive coefficient and introduces methods for constructing bias-corrected bootstrap confidence intervals. We first describe several bootstrap confidence interval procedures for the autoregressive coefficient of the BAR model and present their bias-corrected versions. The behavior of uncorrected and corrected confidence interval procedures is studied empirically through extensive Monte Carlo simulations and two real cell lineage data applications. The empirical results show that the bias in the LS estimator can have a significant negative impact on the behavior of bootstrap confidence intervals and that bias correction can significantly improve the performance of bootstrap confidence intervals in terms of coverage, width, and symmetry.

分岔自回归(BAR)模型中自回归系数的最小二乘(LS)估计量最近被证明存在很大的偏差，特别是对于小到中等样本。本文研究了LS估计量的偏差对自回归系数的各种类型的自举置信区间的影响，并介绍了构造偏差校正的自举置信区间的方法。我们首先描述了BAR模型自回归系数的几个自举置信区间过程，并给出了它们的偏差校正版本。通过广泛的蒙特卡罗模拟和两个真实的细胞谱系数据应用，对未校正和校正置信区间程序的行为进行了经验研究。实证结果表明，LS估计器中的偏差会对自举置信区间的行为产生显著的负面影响，并且偏差校正可以显著提高自举置信区间在覆盖率、宽度和对称性方面的性能。

引用次数: 0

Computing Pseudolikelihood Estimators for Exponential-Family Random Graph Models 计算指数族随机图模型的伪似然估计量

Journal of data science : JDS

Pub Date : 2023-01-01 DOI: 10.6339/23-jds1094

Christian S. Schmid, David R. Hunter

The reputation of the maximum pseudolikelihood estimator (MPLE) for Exponential Random Graph Models (ERGM) has undergone a drastic change over the past 30 years. While first receiving broad support, mainly due to its computational feasibility and the lack of alternatives, general opinions started to change with the introduction of approximate maximum likelihood estimator (MLE) methods that became practicable due to increasing computing power and the introduction of MCMC methods. Previous comparison studies appear to yield contradicting results regarding the preference of these two point estimators; however, there is consensus that the prevailing method to obtain an MPLE’s standard error by the inverse Hessian matrix generally underestimates standard errors. We propose replacing the inverse Hessian matrix by an approximation of the Godambe matrix that results in confidence intervals with appropriate coverage rates and that, in addition, enables examining for model degeneracy. Our results also provide empirical evidence for the asymptotic normality of the MPLE under certain conditions.

在过去的30年里，指数随机图模型(ERGM)的最大伪似然估计量(MPLE)的名声发生了巨大的变化。虽然最初得到广泛支持，主要是由于其计算可行性和缺乏替代方案，但随着近似最大似然估计(MLE)方法的引入，由于计算能力的提高和MCMC方法的引入，这种方法变得可行，普遍的观点开始改变。先前的比较研究似乎产生矛盾的结果关于这两个点估计的偏好;然而，人们一致认为，通过逆Hessian矩阵获得MPLE标准误差的主流方法通常会低估标准误差。我们建议用Godambe矩阵的近似值替换逆Hessian矩阵，从而产生具有适当覆盖率的置信区间，此外，还可以检查模型退化。我们的结果也为MPLE在一定条件下的渐近正态性提供了经验证据。

引用次数: 3

Covid-19 Vaccine Efficacy: Accuracy Assessment, Comparison, and Caveats Covid-19疫苗有效性:准确性评估、比较和注意事项

Journal of data science : JDS

Pub Date : 2023-01-01 DOI: 10.6339/23-jds1089

Wenjiang J. Fu, Jieni Li, P. Scheet

Vaccine efficacy is a key index to evaluate vaccines in initial clinical trials during the development of vaccines. In particular, it plays a crucial role in authorizing Covid-19 vaccines. It has been reported that Covid-19 vaccine efficacy varies with a number of factors, including demographics of population, time after vaccine administration, and virus strains. By examining clinical trial data of three Covid-19 vaccine studies, we find that current approach to evaluating vaccines with an overall efficacy does not provide desired accuracy. It requires no time frame during which a candidate vaccine is evaluated, and is subject to misuse, resulting in potential misleading information and interpretation. In particular, we illustrate with clinical trial data that the variability of vaccine efficacy is underestimated. We demonstrate that a new method may help to address these caveats. It leads to accurate estimation of the variation of efficacy, provides useful information to define a reasonable time frame to evaluate vaccines, and avoids misuse of vaccine efficacy and misleading information.

疫苗疗效是疫苗研制过程中评价疫苗初期临床试验的关键指标。特别是，它在批准Covid-19疫苗方面发挥着至关重要的作用。据报道，Covid-19疫苗的效力因多种因素而异，包括人口统计数据、接种疫苗后的时间和病毒株。通过检查三项Covid-19疫苗研究的临床试验数据，我们发现目前评估疫苗总体疗效的方法没有提供所需的准确性。它不需要评估候选疫苗的时间框架，而且容易被误用，导致潜在的误导性信息和解释。特别是，我们用临床试验数据说明疫苗效力的可变性被低估了。我们证明了一种新的方法可能有助于解决这些问题。它有助于准确估计效力的变化，为确定评估疫苗的合理时间框架提供有用的信息，并避免滥用疫苗效力和误导信息。

引用次数: 0

The Second Competition on Spatial Statistics for Large Datasets 第二届大型数据集空间统计竞赛

Journal of data science : JDS

Pub Date : 2022-11-06 DOI: 10.6339/22-jds1076

Sameh Abdulah, Faten S. Alamri, Pratik Nag, Ying Sun, H. Ltaief, D. Keyes, M. Genton

In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.

在过去的几十年里，随着数据收集技术的发展，许多研究领域的空间和时空数据集的规模迅速增加。因此，空间统计学中的经典统计方法面临着计算挑战。例如，地质统计学中的克里格预测器在大型数据集的传统硬件架构上变得令人望而却步，因为它在处理大型密集矩阵运算时需要高计算能力和内存占用。多年来，已经提出了各种近似方法来解决这些计算问题，然而，社区缺乏评估其近似效率的整体过程。为了提供公平的评估，2021年，我们组织了第一次大型数据集空间统计竞赛，由我们的ExaGeoStat软件生成，并要求参与者报告估计和预测结果。由于其获得了广泛认可的成功，并应许多参与者的要求，我们在2022年组织了第二次比赛，重点是对更复杂的空间和时空过程的预测，包括单变量非平稳空间过程、单变量平稳时空过程和双变量平稳空间过程。在本文中，我们详细描述了数据生成过程，并将有价值的数据集公开以供更广泛的采用。然后，我们审查了来自全球14支球队的提交方法，分析了比赛结果，并评估了每支球队的表现。

{"title":"The Second Competition on Spatial Statistics for Large Datasets","authors":"Sameh Abdulah, Faten S. Alamri, Pratik Nag, Ying Sun, H. Ltaief, D. Keyes, M. Genton","doi":"10.6339/22-jds1076","DOIUrl":"https://doi.org/10.6339/22-jds1076","url":null,"abstract":"In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42045278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Vecchia Approximations and Optimization for Multivariate Matérn Models 多元数学模型的Vecchia逼近与优化

Journal of data science : JDS

Pub Date : 2022-10-17 DOI: 10.6339/22-jds1074

Youssef A. Fahmy, J. Guinness

We describe our implementation of the multivariate Matérn model for multivariate spatial datasets, using Vecchia’s approximation and a Fisher scoring optimization algorithm. We consider various pararameterizations for the multivariate Matérn that have been proposed in the literature for ensuring model validity, as well as an unconstrained model. A strength of our study is that the code is tested on many real-world multivariate spatial datasets. We use it to study the effect of ordering and conditioning in Vecchia’s approximation and the restrictions imposed by the various parameterizations. We also consider a model in which co-located nuggets are correlated across components and find that forcing this cross-component nugget correlation to be zero can have a serious impact on the other model parameters, so we suggest allowing cross-component correlation in co-located nugget terms.

我们使用Vecchia近似和Fisher评分优化算法，描述了我们对多元空间数据集的多元mat模型的实现。我们考虑了文献中为确保模型有效性以及无约束模型而提出的多元mat n的各种参数化。我们研究的一个优势是代码在许多现实世界的多元空间数据集上进行了测试。我们用它来研究Vecchia近似中排序和条件的影响以及各种参数化所施加的限制。我们还考虑了一个模型，其中同位的金块是跨组件相关的，并发现强迫这种跨组件的金块相关性为零可能会对其他模型参数产生严重影响，因此我们建议在同位的金块项中允许跨组件相关。

引用次数: 2

Geostatistics for Large Datasets on Riemannian Manifolds: A Matrix-Free Approach 黎曼流形上大数据集的地质统计学:无矩阵方法

Journal of data science : JDS

Pub Date : 2022-08-26 DOI: 10.6339/22-jds1075

M. Pereira, N. Desassis, D. Allard

Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.

大型或非常大型的空间（和时空）数据集已成为许多环境和气候研究中的常见位置。这些数据通常是在非欧几里得空间（如地球）中收集的，并且它们通常呈现非平稳各向异性。本文提出了一种在紧致黎曼流形上对高斯随机场（GRF）进行建模的通用方法，该方法弥合了现有关于非平稳GRF和流形上随机场的工作之间的差距。这种方法可以应用于任何光滑的紧致流形，特别是任何紧致曲面。通过定义一个考虑优先相关方向的黎曼度量，我们的方法可以解释由域的局部变形引起的非平稳几何各向异性。我们提供了可扩展的算法来估计参数，并通过克里格和模拟进行最佳预测，从而能够处理非常大的网格。提供了平稳和非平稳的插图。

引用次数: 6

EVIboost for the Estimation of Extreme Value Index Under Heterogeneous Extremes 非均匀极值下极值指数估计的EVIboost

Journal of data science : JDS

Pub Date : 2022-05-28 DOI: 10.6339/22-jds1067

Jiaxi Wang, Yanxi Hou, Xingchi Li, Tiandong Wang

Modeling heterogeneity on heavy-tailed distributions under a regression framework is challenging, yet classical statistical methodologies usually place conditions on the distribution models to facilitate the learning procedure. However, these conditions will likely overlook the complex dependence structure between the heaviness of tails and the covariates. Moreover, data sparsity on tail regions makes the inference method less stable, leading to biased estimates for extreme-related quantities. This paper proposes a gradient boosting algorithm to estimate a functional extreme value index with heterogeneous extremes. Our proposed algorithm is a data-driven procedure capturing complex and dynamic structures in tail distributions. We also conduct extensive simulation studies to show the prediction accuracy of the proposed algorithm. In addition, we apply our method to a real-world data set to illustrate the state-dependent and time-varying properties of heavy-tail phenomena in the financial industry.

在回归框架下对重尾分布的异质性进行建模是具有挑战性的，然而经典的统计方法通常会在分布模型上设置条件，以促进学习过程。然而，这些条件可能会忽略尾部的权重和协变量之间的复杂依赖结构。此外，尾部区域的数据稀疏性使推理方法不太稳定，导致对极端相关量的估计存在偏差。本文提出了一种梯度提升算法来估计具有异质极值的函数极值指数。我们提出的算法是一个数据驱动的过程，捕捉尾部分布中的复杂动态结构。我们还进行了广泛的仿真研究，以证明所提出的算法的预测准确性。此外，我们将我们的方法应用于真实世界的数据集，以说明金融业中重尾现象的状态相关和时变特性。

引用次数: 0

Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation 鲁棒可扩展的非参数多类概率估计的线性算法

Journal of data science : JDS

Pub Date : 2022-05-25 DOI: 10.6339/22-jds1069

Liyun Zeng, Hao Helen Zhang

Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.

多类概率估计是在给定协变信息的情况下，估计属于一类的数据点的条件概率的问题。它在统计分析和数据科学中有着广泛的应用。最近，已经开发了一类加权支持向量机（wSVM），用于通过集合学习来估计K类问题的类概率（Wu et al.，2010；Wang et al.，2019），其中K是类的数量。估计量是鲁棒的，并且实现了高精度的概率估计，但它们的学习是通过成对耦合实现的，这需要K中的多项式时间。在本文中，我们提出了两种新的学习方案，基线学习和一对一（OVA）学习，以在计算效率和估计精度方面进一步提高wSVM。特别地，基线学习具有最佳的计算复杂度，因为它在K中是线性的。尽管在计算中不是最有效的，但发现OVA在所比较的所有过程中具有最佳的估计精度。所得到的估计量是无分布的，并且被证明是一致的。我们进一步进行了大量的数值实验来证明它们的有限样本性能。

{"title":"Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation","authors":"Liyun Zeng, Hao Helen Zhang","doi":"10.6339/22-jds1069","DOIUrl":"https://doi.org/10.6339/22-jds1069","url":null,"abstract":"Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44600781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Incorporating Interventions to an Extended SEIRD Model with Vaccination: Application to COVID-19 in Qatar 将干预措施纳入扩展SEIRD模型与疫苗接种:在卡塔尔COVID-19中的应用

Journal of data science : JDS

Pub Date : 2022-04-23 DOI: 10.6339/23-JDS1105

Elizabeth B Amona, R. Ghanam, E. Boone, Indranil Sahoo, L. Abu-Raddad

The COVID-19 outbreak of 2020 has required many governments to develop and adopt mathematical-statistical models of the pandemic for policy and planning purposes. To this end, this work provides a tutorial on building a compartmental model using Susceptible, Exposed, Infected, Recovered, Deaths and Vaccinated (SEIRDV) status through time. The proposed model uses interventions to quantify the impact of various government attempts made to slow the spread of the virus. Furthermore, a vaccination parameter is also incorporated in the model, which is inactive until the time the vaccine is deployed. A Bayesian framework is utilized to perform both parameter estimation and prediction. Predictions are made to determine when the peak Active Infections occur. We provide inferential frameworks for assessing the effects of government interventions on the dynamic progression of the pandemic, including the impact of vaccination. The proposed model also allows for quantification of number of excess deaths averted over the study period due to vaccination.

2020年2019冠状病毒病的爆发要求许多政府为政策和规划目的制定和采用大流行的数学统计模型。为此，本工作提供了一个关于建立一个间隔模型的教程，该模型使用易感、暴露、感染、恢复、死亡和接种(SEIRDV)状态的时间。提出的模型使用干预措施来量化政府为减缓病毒传播所做的各种尝试的影响。此外，模型中还包含了一个疫苗接种参数，该参数在部署疫苗之前处于非活动状态。采用贝叶斯框架进行参数估计和预测。预测是为了确定何时出现活跃感染高峰。我们为评估政府干预对流行病动态发展的影响提供了推论框架，包括疫苗接种的影响。拟议的模型还允许对研究期间因接种疫苗而避免的额外死亡人数进行量化。

引用次数: 1

Causal Discovery for Observational Sciences Using Supervised Machine Learning 使用监督机器学习的观察科学因果发现

Journal of data science : JDS

Pub Date : 2022-02-25 DOI: 10.6339/23-jds1088

A. H. Petersen, J. Ramsey, C. Ekstrøm, P. Spirtes

Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct discovery methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error trade off is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.

因果推断可以估计因果效应，但除非通过实验收集数据，否则统计分析必须依赖于预先指定的因果模型。因果发现算法是根据数据构建此类因果模型的经验方法。几种渐近正确的发现方法已经存在，但它们通常在较小的样本上很困难。此外，大多数方法都专注于非常稀疏的因果模型，这可能并不总是现实生活中数据生成机制的真实表示。最后，虽然这些方法提出的因果关系通常是正确的，但他们关于因果不相关的说法有很高的错误率。这种非保守的误差权衡对于观测科学来说并不理想，因为观测科学直接使用由此产生的模型来进行因果推断：具有许多缺失因果关系的因果模型需要太强的假设，并可能导致有偏差的效应估计。我们提出了一种新的因果发现方法来解决这三个缺点：监督学习发现（SLdisco）。SLdisco使用监督机器学习来获得从观测数据到因果模型等价类的映射。我们在一项基于高斯数据的大型模拟研究中评估了SLdisco，并考虑了模型大小和样本大小的几种选择。我们发现SLdisco比现有程序更保守，只是信息量略低，对样本量的敏感性也较低。我们还提供了一个真实的流行病学数据应用程序。我们使用随机子采样来研究小样本上的真实数据性能，并再次发现SLdisco对样本量不太敏感，因此似乎可以更好地利用小数据集中的可用信息。

{"title":"Causal Discovery for Observational Sciences Using Supervised Machine Learning","authors":"A. H. Petersen, J. Ramsey, C. Ekstrøm, P. Spirtes","doi":"10.6339/23-jds1088","DOIUrl":"https://doi.org/10.6339/23-jds1088","url":null,"abstract":"Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct discovery methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error trade off is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45032104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Journal of data science : JDS

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀