首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
FROSTY: A High-Dimensional Scale-Free Bayesian Network Learning Method FROSTY:一种高维无标度贝叶斯网络学习方法
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1097
Joshua Bang, Sang-Yun Oh
We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.
提出了一种基于稀疏Cholesky分解的可扩展贝叶斯网络学习算法。我们的方法只需要观测数据和用户指定的置信度作为输入,并且可以估计具有数千个变量的网络。对于有p个顶点的图,该方法的计算复杂度为$O({p^{3}})$。大量的数值实验证明了该方法的有效性,并取得了令人满意的结果。在模拟中,我们方法的初始步骤还改进了另一种贝叶斯网络结构估计方法,该方法使用无向图作为输入。
{"title":"FROSTY: A High-Dimensional Scale-Free Bayesian Network Learning Method","authors":"Joshua Bang, Sang-Yun Oh","doi":"10.6339/23-jds1097","DOIUrl":"https://doi.org/10.6339/23-jds1097","url":null,"abstract":"We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R R中高维低样本量数据分类的投影寻踪指标评价
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1096
Zhaoxing Wu, Chunming Zhang
Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $mathrm{PDA}$ index with the $mathrm{LDA}$ index and $mathrm{SVM}$, demonstrating that the $mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.
分析“大p小n”数据在广泛的应用领域中变得越来越重要。Lee和Cook(2010)在线性判别分析($mathrm{LDA}$)指数的基础上,设计了惩罚判别分析($mathrm{PDA}$)指数作为投影追踪指标,用于对高维数据进行分类,并取得了很好的结果。然而,与流行的支持向量机($mathrm{SVM}$)相比,关于其性能的信息很少。本文进行了大量的数值研究,比较了$ mathm {PDA}$索引与$ mathm {LDA}$索引和$ mathm {SVM}$索引的性能,表明$ mathm {PDA}$索引对异常值具有鲁棒性,能够处理极小样本量、很少重要变量和多类的高维数据集。对几个真实世界数据集的分析揭示了单个方法的实际优势和局限性,表明$ mathm {PDA}$索引为复杂的高维数据分类提供了一个有用的替代工具。这些新的见解,以及R包类spp中$ mathm {PDA}$索引函数的实际实现,使统计学家和数据科学家能够有效地使用这两组分类工具。
{"title":"Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R","authors":"Zhaoxing Wu, Chunming Zhang","doi":"10.6339/23-jds1096","DOIUrl":"https://doi.org/10.6339/23-jds1096","url":null,"abstract":"Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $mathrm{PDA}$ index with the $mathrm{LDA}$ index and $mathrm{SVM}$, demonstrating that the $mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Assessment of Crop-Specific Land Cover Predictions Using High-Order Markov Chains and Deep Neural Networks 基于高阶马尔可夫链和深度神经网络的作物特定土地覆盖预测评估
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1098
L. Sartore, C. Boryan, Andrew Dau, P. Willis
High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.
高阶马尔可夫链(HOMC)是基于过渡概率的传统模型,被美国农业部(USDA)国家农业统计局(NASS)用于研究一段时间内的作物轮作模式。然而,homc通常存在稀疏性和可识别性问题,因为分类数据被表示为指标(或虚拟)变量。事实上,参数空间的维数随着分析所需homc的阶数呈指数增长。正如文献中所显示的那样,虽然简约的表示减少了参数的数量,但它们往往导致不太准确的预测。大多数简约模型都是在大数据结构上训练的,这些大数据结构可以使用替代算法进行压缩和有效处理。因此,有必要对使用新的HOMC算法和不同类型的深度神经网络(DNN)在一系列农业条件下获得的预测结果进行全面的评估和比较,以确定哪种模型最适合美国农业的操作性作物特定土地覆盖预测。本文将6个神经网络模型应用于美国中西部和南部6个农业集约化县2011 - 2021年的作物轮作数据,这些数据反映了美国中西部和南部主要作物的种植范围和多种作物轮作模式。这六个县包括:北达科他州的伦维尔;帕金斯,内布拉斯加州;黑尔,德州;利文斯顿,伊利诺斯州;麦克莱恩,伊利诺斯州;以及俄亥俄州的谢尔比。结果表明,DNN模型在2021年对所有县的整体预测精度更高。所提出的深度神经网络模型允许摄取长时间序列数据,并且比用于预测美国作物特定土地覆盖的新HOMC算法具有更高的精度值。
{"title":"An Assessment of Crop-Specific Land Cover Predictions Using High-Order Markov Chains and Deep Neural Networks","authors":"L. Sartore, C. Boryan, Andrew Dau, P. Willis","doi":"10.6339/23-jds1098","DOIUrl":"https://doi.org/10.6339/23-jds1098","url":null,"abstract":"High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Binary Classification of Malignant Mesothelioma: A Comparative Study 恶性间皮瘤二元分类的比较研究
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1090
Ted Si Yuan Cheng, Xiyue Liao
Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.
恶性间皮瘤是一种侵袭性癌症,通常发生在覆盖胸部或腹部衬里的薄层组织中。虽然癌症本身是罕见且致命的,但早期诊断将有助于治疗并改善结果。间皮瘤通常在晚期才被诊断出来。症状与其他更常见的疾病相似。因此,早期预测和诊断间皮瘤对于早期治疗这种经常被诊断得太晚的癌症至关重要。这个全面的经验比较的目标是确定基于召回(灵敏度)的最佳表现模型。我们特别希望避免假阴性,因为当患者实际上患有癌症时,将其诊断为健康是非常昂贵的。模型训练将基于k-fold交叉验证进行。选择随机森林作为最优模型。根据该模型,年龄和石棉暴露时间被列为影响间皮瘤诊断的最重要特征。
{"title":"Binary Classification of Malignant Mesothelioma: A Comparative Study","authors":"Ted Si Yuan Cheng, Xiyue Liao","doi":"10.6339/23-jds1090","DOIUrl":"https://doi.org/10.6339/23-jds1090","url":null,"abstract":"Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data 为更灵活的a /B测试奠定基础:中期监控在大规模数据中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1099
Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer
The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.
错误花费函数和停止规则的使用已经成为进行中期分析的有力工具。不仅在传统的临床试验中,而且在A/B试验中,都广泛需要实施中期分析。尽管许多论文总结了错误花费方法,但在大规模数据的背景下,帮助找到“最佳”边界的工作有限。在本文中,我们总结了15个边界,这些边界由5个误差花费函数组成,这些函数允许因无效、差异或两者兼而有之而早期终止,以及没有临时监测的固定样本量设计。仿真是基于一个实际的a /B测试问题,比较两个独立的比例。我们在每只手臂500到250000的范围内检查样本大小,以反映可能使用a /B测试的不同设置。最优边界的选择使用一个建议的损失函数进行总结,该损失函数结合了变量之间没有差异的零实验下的期望样本量的不同权重,变量之间存在差异的实验下的期望样本量,以及如果a /B测试没有在中期分析中早期停止所需的最大样本量。本文给出了基于充分供电、低功率和高功率设计的模拟设置的结果,并给出了在每种设置中选择“最佳”设计的建议。
{"title":"Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data","authors":"Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer","doi":"10.6339/23-jds1099","DOIUrl":"https://doi.org/10.6339/23-jds1099","url":null,"abstract":"The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Computing Pseudolikelihood Estimators for Exponential-Family Random Graph Models 计算指数族随机图模型的伪似然估计量
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1094
Christian S. Schmid, David R. Hunter
The reputation of the maximum pseudolikelihood estimator (MPLE) for Exponential Random Graph Models (ERGM) has undergone a drastic change over the past 30 years. While first receiving broad support, mainly due to its computational feasibility and the lack of alternatives, general opinions started to change with the introduction of approximate maximum likelihood estimator (MLE) methods that became practicable due to increasing computing power and the introduction of MCMC methods. Previous comparison studies appear to yield contradicting results regarding the preference of these two point estimators; however, there is consensus that the prevailing method to obtain an MPLE’s standard error by the inverse Hessian matrix generally underestimates standard errors. We propose replacing the inverse Hessian matrix by an approximation of the Godambe matrix that results in confidence intervals with appropriate coverage rates and that, in addition, enables examining for model degeneracy. Our results also provide empirical evidence for the asymptotic normality of the MPLE under certain conditions.
在过去的30年里,指数随机图模型(ERGM)的最大伪似然估计量(MPLE)的名声发生了巨大的变化。虽然最初得到广泛支持,主要是由于其计算可行性和缺乏替代方案,但随着近似最大似然估计(MLE)方法的引入,由于计算能力的提高和MCMC方法的引入,这种方法变得可行,普遍的观点开始改变。先前的比较研究似乎产生矛盾的结果关于这两个点估计的偏好;然而,人们一致认为,通过逆Hessian矩阵获得MPLE标准误差的主流方法通常会低估标准误差。我们建议用Godambe矩阵的近似值替换逆Hessian矩阵,从而产生具有适当覆盖率的置信区间,此外,还可以检查模型退化。我们的结果也为MPLE在一定条件下的渐近正态性提供了经验证据。
{"title":"Computing Pseudolikelihood Estimators for Exponential-Family Random Graph Models","authors":"Christian S. Schmid, David R. Hunter","doi":"10.6339/23-jds1094","DOIUrl":"https://doi.org/10.6339/23-jds1094","url":null,"abstract":"The reputation of the maximum pseudolikelihood estimator (MPLE) for Exponential Random Graph Models (ERGM) has undergone a drastic change over the past 30 years. While first receiving broad support, mainly due to its computational feasibility and the lack of alternatives, general opinions started to change with the introduction of approximate maximum likelihood estimator (MLE) methods that became practicable due to increasing computing power and the introduction of MCMC methods. Previous comparison studies appear to yield contradicting results regarding the preference of these two point estimators; however, there is consensus that the prevailing method to obtain an MPLE’s standard error by the inverse Hessian matrix generally underestimates standard errors. We propose replacing the inverse Hessian matrix by an approximation of the Godambe matrix that results in confidence intervals with appropriate coverage rates and that, in addition, enables examining for model degeneracy. Our results also provide empirical evidence for the asymptotic normality of the MPLE under certain conditions.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Impact of Bias Correction of the Least Squares Estimation on Bootstrap Confidence Intervals for Bifurcating Autoregressive Models 最小二乘估计偏差校正对分岔自回归模型自举置信区间的影响
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1092
T. Elbayoumi, S. Mostafa
The least squares (LS) estimator of the autoregressive coefficient in the bifurcating autoregressive (BAR) model was recently shown to suffer from substantial bias, especially for small to moderate samples. This study investigates the impact of the bias in the LS estimator on the behavior of various types of bootstrap confidence intervals for the autoregressive coefficient and introduces methods for constructing bias-corrected bootstrap confidence intervals. We first describe several bootstrap confidence interval procedures for the autoregressive coefficient of the BAR model and present their bias-corrected versions. The behavior of uncorrected and corrected confidence interval procedures is studied empirically through extensive Monte Carlo simulations and two real cell lineage data applications. The empirical results show that the bias in the LS estimator can have a significant negative impact on the behavior of bootstrap confidence intervals and that bias correction can significantly improve the performance of bootstrap confidence intervals in terms of coverage, width, and symmetry.
分岔自回归(BAR)模型中自回归系数的最小二乘(LS)估计量最近被证明存在很大的偏差,特别是对于小到中等样本。本文研究了LS估计量的偏差对自回归系数的各种类型的自举置信区间的影响,并介绍了构造偏差校正的自举置信区间的方法。我们首先描述了BAR模型自回归系数的几个自举置信区间过程,并给出了它们的偏差校正版本。通过广泛的蒙特卡罗模拟和两个真实的细胞谱系数据应用,对未校正和校正置信区间程序的行为进行了经验研究。实证结果表明,LS估计器中的偏差会对自举置信区间的行为产生显著的负面影响,并且偏差校正可以显著提高自举置信区间在覆盖率、宽度和对称性方面的性能。
{"title":"Impact of Bias Correction of the Least Squares Estimation on Bootstrap Confidence Intervals for Bifurcating Autoregressive Models","authors":"T. Elbayoumi, S. Mostafa","doi":"10.6339/23-jds1092","DOIUrl":"https://doi.org/10.6339/23-jds1092","url":null,"abstract":"The least squares (LS) estimator of the autoregressive coefficient in the bifurcating autoregressive (BAR) model was recently shown to suffer from substantial bias, especially for small to moderate samples. This study investigates the impact of the bias in the LS estimator on the behavior of various types of bootstrap confidence intervals for the autoregressive coefficient and introduces methods for constructing bias-corrected bootstrap confidence intervals. We first describe several bootstrap confidence interval procedures for the autoregressive coefficient of the BAR model and present their bias-corrected versions. The behavior of uncorrected and corrected confidence interval procedures is studied empirically through extensive Monte Carlo simulations and two real cell lineage data applications. The empirical results show that the bias in the LS estimator can have a significant negative impact on the behavior of bootstrap confidence intervals and that bias correction can significantly improve the performance of bootstrap confidence intervals in terms of coverage, width, and symmetry.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Covid-19 Vaccine Efficacy: Accuracy Assessment, Comparison, and Caveats Covid-19疫苗有效性:准确性评估、比较和注意事项
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1089
Wenjiang J. Fu, Jieni Li, P. Scheet
Vaccine efficacy is a key index to evaluate vaccines in initial clinical trials during the development of vaccines. In particular, it plays a crucial role in authorizing Covid-19 vaccines. It has been reported that Covid-19 vaccine efficacy varies with a number of factors, including demographics of population, time after vaccine administration, and virus strains. By examining clinical trial data of three Covid-19 vaccine studies, we find that current approach to evaluating vaccines with an overall efficacy does not provide desired accuracy. It requires no time frame during which a candidate vaccine is evaluated, and is subject to misuse, resulting in potential misleading information and interpretation. In particular, we illustrate with clinical trial data that the variability of vaccine efficacy is underestimated. We demonstrate that a new method may help to address these caveats. It leads to accurate estimation of the variation of efficacy, provides useful information to define a reasonable time frame to evaluate vaccines, and avoids misuse of vaccine efficacy and misleading information.
疫苗疗效是疫苗研制过程中评价疫苗初期临床试验的关键指标。特别是,它在批准Covid-19疫苗方面发挥着至关重要的作用。据报道,Covid-19疫苗的效力因多种因素而异,包括人口统计数据、接种疫苗后的时间和病毒株。通过检查三项Covid-19疫苗研究的临床试验数据,我们发现目前评估疫苗总体疗效的方法没有提供所需的准确性。它不需要评估候选疫苗的时间框架,而且容易被误用,导致潜在的误导性信息和解释。特别是,我们用临床试验数据说明疫苗效力的可变性被低估了。我们证明了一种新的方法可能有助于解决这些问题。它有助于准确估计效力的变化,为确定评估疫苗的合理时间框架提供有用的信息,并避免滥用疫苗效力和误导信息。
{"title":"Covid-19 Vaccine Efficacy: Accuracy Assessment, Comparison, and Caveats","authors":"Wenjiang J. Fu, Jieni Li, P. Scheet","doi":"10.6339/23-jds1089","DOIUrl":"https://doi.org/10.6339/23-jds1089","url":null,"abstract":"Vaccine efficacy is a key index to evaluate vaccines in initial clinical trials during the development of vaccines. In particular, it plays a crucial role in authorizing Covid-19 vaccines. It has been reported that Covid-19 vaccine efficacy varies with a number of factors, including demographics of population, time after vaccine administration, and virus strains. By examining clinical trial data of three Covid-19 vaccine studies, we find that current approach to evaluating vaccines with an overall efficacy does not provide desired accuracy. It requires no time frame during which a candidate vaccine is evaluated, and is subject to misuse, resulting in potential misleading information and interpretation. In particular, we illustrate with clinical trial data that the variability of vaccine efficacy is underestimated. We demonstrate that a new method may help to address these caveats. It leads to accurate estimation of the variation of efficacy, provides useful information to define a reasonable time frame to evaluate vaccines, and avoids misuse of vaccine efficacy and misleading information.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Second Competition on Spatial Statistics for Large Datasets 第二届大型数据集空间统计竞赛
Pub Date : 2022-11-06 DOI: 10.6339/22-jds1076
Sameh Abdulah, Faten S. Alamri, Pratik Nag, Ying Sun, H. Ltaief, D. Keyes, M. Genton
In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.
在过去的几十年里,随着数据收集技术的发展,许多研究领域的空间和时空数据集的规模迅速增加。因此,空间统计学中的经典统计方法面临着计算挑战。例如,地质统计学中的克里格预测器在大型数据集的传统硬件架构上变得令人望而却步,因为它在处理大型密集矩阵运算时需要高计算能力和内存占用。多年来,已经提出了各种近似方法来解决这些计算问题,然而,社区缺乏评估其近似效率的整体过程。为了提供公平的评估,2021年,我们组织了第一次大型数据集空间统计竞赛,由我们的ExaGeoStat软件生成,并要求参与者报告估计和预测结果。由于其获得了广泛认可的成功,并应许多参与者的要求,我们在2022年组织了第二次比赛,重点是对更复杂的空间和时空过程的预测,包括单变量非平稳空间过程、单变量平稳时空过程和双变量平稳空间过程。在本文中,我们详细描述了数据生成过程,并将有价值的数据集公开以供更广泛的采用。然后,我们审查了来自全球14支球队的提交方法,分析了比赛结果,并评估了每支球队的表现。
{"title":"The Second Competition on Spatial Statistics for Large Datasets","authors":"Sameh Abdulah, Faten S. Alamri, Pratik Nag, Ying Sun, H. Ltaief, D. Keyes, M. Genton","doi":"10.6339/22-jds1076","DOIUrl":"https://doi.org/10.6339/22-jds1076","url":null,"abstract":"In the last few decades, the size of spatial and spatio-temporal datasets in many research areas has rapidly increased with the development of data collection technologies. As a result, classical statistical methods in spatial statistics are facing computational challenges. For example, the kriging predictor in geostatistics becomes prohibitive on traditional hardware architectures for large datasets as it requires high computing power and memory footprint when dealing with large dense matrix operations. Over the years, various approximation methods have been proposed to address such computational issues, however, the community lacks a holistic process to assess their approximation efficiency. To provide a fair assessment, in 2021, we organized the first competition on spatial statistics for large datasets, generated by our ExaGeoStat software, and asked participants to report the results of estimation and prediction. Thanks to its widely acknowledged success and at the request of many participants, we organized the second competition in 2022 focusing on predictions for more complex spatial and spatio-temporal processes, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. In this paper, we describe in detail the data generation procedure and make the valuable datasets publicly available for a wider adoption. Then, we review the submitted methods from fourteen teams worldwide, analyze the competition outcomes, and assess the performance of each team.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42045278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Vecchia Approximations and Optimization for Multivariate Matérn Models 多元数学模型的Vecchia逼近与优化
Pub Date : 2022-10-17 DOI: 10.6339/22-jds1074
Youssef A. Fahmy, J. Guinness
We describe our implementation of the multivariate Matérn model for multivariate spatial datasets, using Vecchia’s approximation and a Fisher scoring optimization algorithm. We consider various pararameterizations for the multivariate Matérn that have been proposed in the literature for ensuring model validity, as well as an unconstrained model. A strength of our study is that the code is tested on many real-world multivariate spatial datasets. We use it to study the effect of ordering and conditioning in Vecchia’s approximation and the restrictions imposed by the various parameterizations. We also consider a model in which co-located nuggets are correlated across components and find that forcing this cross-component nugget correlation to be zero can have a serious impact on the other model parameters, so we suggest allowing cross-component correlation in co-located nugget terms.
我们使用Vecchia近似和Fisher评分优化算法,描述了我们对多元空间数据集的多元mat模型的实现。我们考虑了文献中为确保模型有效性以及无约束模型而提出的多元mat n的各种参数化。我们研究的一个优势是代码在许多现实世界的多元空间数据集上进行了测试。我们用它来研究Vecchia近似中排序和条件的影响以及各种参数化所施加的限制。我们还考虑了一个模型,其中同位的金块是跨组件相关的,并发现强迫这种跨组件的金块相关性为零可能会对其他模型参数产生严重影响,因此我们建议在同位的金块项中允许跨组件相关。
{"title":"Vecchia Approximations and Optimization for Multivariate Matérn Models","authors":"Youssef A. Fahmy, J. Guinness","doi":"10.6339/22-jds1074","DOIUrl":"https://doi.org/10.6339/22-jds1074","url":null,"abstract":"We describe our implementation of the multivariate Matérn model for multivariate spatial datasets, using Vecchia’s approximation and a Fisher scoring optimization algorithm. We consider various pararameterizations for the multivariate Matérn that have been proposed in the literature for ensuring model validity, as well as an unconstrained model. A strength of our study is that the code is tested on many real-world multivariate spatial datasets. We use it to study the effect of ordering and conditioning in Vecchia’s approximation and the restrictions imposed by the various parameterizations. We also consider a model in which co-located nuggets are correlated across components and find that forcing this cross-component nugget correlation to be zero can have a serious impact on the other model parameters, so we suggest allowing cross-component correlation in co-located nugget terms.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44042544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1