首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Identification of Optimal Combined Moderators for Time to Relapse 复吸时间最优组合调节因子的识别
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1107
Bang Wang, Yu Cheng, M. Levine
Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.
当治疗异质性存在时,识别治疗效果调节剂(即调节因子)在提高治疗疗效方面起着至关重要的作用。然而,在检测治疗效果调节剂方面的研究往往力度不足,并且每个统计模型检查一个调节剂的探索性分析经常产生虚假的相互作用。因此,在这项工作中,我们专注于创建一个直观且易于实施的框架,以促进治疗效果调节剂的发现,并针对事件发生时间提出治疗建议。为了最小化指定错误的主效应的影响并避免复杂的建模,我们通过将被处理组与对照组匹配,并通过回归匹配对平均调节因子的观察结果的差异来建模条件平均处理效应,从而构建了框架。采用反截后概率加权法处理截后观测值。由于匹配是所提出方法的基础,我们探索了不同的匹配度量,并建议在存在连续调节因子和分类调节因子时使用马氏距离。匹配后,该框架可灵活结合线性回归、最小绝对收缩和选择算子(Lasso)、随机森林等常用的变量选择和预测方法,创建不同组合的潜在调节因子。最优组合是由出袋预测误差和受试者工作特性曲线下面积决定的,从而给出正确的治疗建议。我们通过广泛的模拟和对真实试验数据的分析,比较了各种组合调节剂的性能。我们的方法可以很容易地使用现有的R包实现,从而产生一个直接的最佳组合缓和剂来提出治疗建议。
{"title":"Identification of Optimal Combined Moderators for Time to Relapse","authors":"Bang Wang, Yu Cheng, M. Levine","doi":"10.6339/23-jds1107","DOIUrl":"https://doi.org/10.6339/23-jds1107","url":null,"abstract":"Identifying treatment effect modifiers (i.e., moderators) plays an essential role in improving treatment efficacy when substantial treatment heterogeneity exists. However, studies are often underpowered for detecting treatment effect modifiers, and exploratory analyses that examine one moderator per statistical model often yield spurious interactions. Therefore, in this work, we focus on creating an intuitive and readily implementable framework to facilitate the discovery of treatment effect modifiers and to make treatment recommendations for time-to-event outcomes. To minimize the impact of a misspecified main effect and avoid complex modeling, we construct the framework by matching the treated with the controls and modeling the conditional average treatment effect via regressing the difference in the observed outcomes of a matched pair on the averaged moderators. Inverse-probability-of-censoring weighting is used to handle censored observations. As matching is the foundation of the proposed methods, we explore different matching metrics and recommend the use of Mahalanobis distance when both continuous and categorical moderators are present. After matching, the proposed framework can be flexibly combined with popular variable selection and prediction methods such as linear regression, least absolute shrinkage and selection operator (Lasso), and random forest to create different combinations of potential moderators. The optimal combination is determined by the out-of-bag prediction error and the area under the receiver operating characteristic curve in making correct treatment recommendations. We compare the performance of various combined moderators through extensive simulations and the analysis of real trial data. Our approach can be easily implemented using existing R packages, resulting in a straightforward optimal combined moderator to make treatment recommendations.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting the Use of Generalized Least Squares in Time Series Regression Models 回顾广义最小二乘在时间序列回归模型中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1108
Yue Fang, S. Koreisha, Q. Shao
Linear regression models are widely used in empirical studies. When serial correlation is present in the residuals, generalized least squares (GLS) estimation is commonly used to improve estimation efficiency. This paper proposes the use of an alternative estimator, the approximate generalized least squares estimators based on high-order AR(p) processes (GLS-AR). We show that GLS-AR estimators are asymptotically efficient as GLS estimators, as both the number of AR lag, p, and the number of observations, n, increase together so that $p=o({n^{1/4}})$ in the limit. The proposed GLS-AR estimators do not require the identification of the residual serial autocorrelation structure and perform more robust in finite samples than the conventional FGLS-based tests. Finally, we illustrate the usefulness of GLS-AR method by applying it to the global warming data from 1850–2012.
线性回归模型在实证研究中被广泛使用。当残差中存在序列相关时,一般采用广义最小二乘(GLS)估计来提高估计效率。本文提出了一种替代估计量——基于高阶AR(p)过程的近似广义最小二乘估计量(GLS-AR)。我们证明了GLS-AR估计器作为GLS估计器是渐近有效的,因为AR滞后数p和观测数n一起增加,使得$p= 0 ({n^{1/4}})$在极限上。所提出的GLS-AR估计器不需要识别残差序列自相关结构,并且在有限样本中比传统的基于fgls的测试具有更高的鲁棒性。最后,通过对1850-2012年全球变暖数据的分析,说明了GLS-AR方法的有效性。
{"title":"Revisiting the Use of Generalized Least Squares in Time Series Regression Models","authors":"Yue Fang, S. Koreisha, Q. Shao","doi":"10.6339/23-jds1108","DOIUrl":"https://doi.org/10.6339/23-jds1108","url":null,"abstract":"Linear regression models are widely used in empirical studies. When serial correlation is present in the residuals, generalized least squares (GLS) estimation is commonly used to improve estimation efficiency. This paper proposes the use of an alternative estimator, the approximate generalized least squares estimators based on high-order AR(p) processes (GLS-AR). We show that GLS-AR estimators are asymptotically efficient as GLS estimators, as both the number of AR lag, p, and the number of observations, n, increase together so that $p=o({n^{1/4}})$ in the limit. The proposed GLS-AR estimators do not require the identification of the residual serial autocorrelation structure and perform more robust in finite samples than the conventional FGLS-based tests. Finally, we illustrate the usefulness of GLS-AR method by applying it to the global warming data from 1850–2012.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models 用非齐次隐马尔可夫模型分析洪都拉斯降雨模式
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1091
Gustavo Alexis Sabillón, D. Zuanetti
One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.
在过去的几十年里,主要的气候兴趣之一是理解和描述世界上特定地区的降雨模式作为其他气候协变量的函数。我们对洪都拉斯特古西加尔巴的历史气候监测数据进行了分析,使用非同质隐马尔可夫模型(nhhmm),这是一种通常用于识别和预测异质状态的动态模型。为了有效和可扩展地估计NHMM,我们提出了随机期望最大化(EM)算法和贝叶斯方法,并比较了它们在合成数据中的性能。虽然这些方法已经被用于估计其他几种统计模型,但nhmm的情况并非如此,它仍然广泛地使用传统的EM算法进行拟合。我们观察到,在测试条件下,贝叶斯算法和随机EM算法的性能是相似的,并讨论了它们的细微差异。通过对洪都拉斯降雨数据集的分析,我们确定了三个非均匀降雨期,并选择温度和湿度作为相关协变量来解释这些时期之间的动态关系。
{"title":"Analyzing the Rainfall Pattern in Honduras Through Non-Homogeneous Hidden Markov Models","authors":"Gustavo Alexis Sabillón, D. Zuanetti","doi":"10.6339/23-jds1091","DOIUrl":"https://doi.org/10.6339/23-jds1091","url":null,"abstract":"One of the major climatic interests of the last decades has been to understand and describe the rainfall patterns of specific areas of the world as functions of other climate covariates. We do it for the historical climate monitoring data from Tegucigalpa, Honduras, using non-homogeneous hidden Markov models (NHMMs), which are dynamic models usually used to identify and predict heterogeneous regimes. For estimating the NHMM in an efficient and scalable way, we propose the stochastic Expectation-Maximization (EM) algorithm and a Bayesian method, and compare their performance in synthetic data. Although these methodologies have already been used for estimating several other statistical models, it is not the case of NHMMs which are still widely fitted by the traditional EM algorithm. We observe that, under tested conditions, the performance of the Bayesian and stochastic EM algorithms is similar and discuss their slight differences. Analyzing the Honduras rainfall data set, we identify three heterogeneous rainfall periods and select temperature and humidity as relevant covariates for explaining the dynamic relation among these periods.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network A/B Testing: Nonparametric Statistical Significance Test Based on Cluster-Level Permutation 网络A/B测试:基于集群水平排列的非参数统计显著性检验
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1112
Hongwei Shang, Xiaolin Shi, Bai Jiang
A/B testing is widely used for comparing two versions of a product and evaluating new proposed product features. It is of great importance for decision-making and has been applied as a golden standard in the IT industry. It is essentially a form of two-sample statistical hypothesis testing. Average treatment effect (ATE) and the corresponding p-value can be obtained under certain assumptions. One key assumption in traditional A/B testing is the stable-unit-treatment-value assumption (SUTVA): there is no interference among different units. It means that the observation on one unit is unaffected by the particular assignment of treatments to the other units. Nonetheless, interference is very common in social network settings where people communicate and spread information to their neighbors. Therefore, the SUTVA assumption is violated. Analysis ignoring this network effect will lead to biased estimation of ATE. Most existing works focus mainly on the design of experiment and data analysis in order to produce estimators with good performance in regards to bias and variance. Little attention has been paid to the calculation of p-value. We work on the calculation of p-value for the ATE estimator in network A/B tests. After a brief review of existing research methods on design of experiment based on graph cluster randomization and different ATE estimation methods, we propose a permutation method for calculating p-value based on permutation test at the cluster level. The effectiveness of the method against that based on individual-level permutation is validated in a simulation study mimicking realistic settings.
A/B测试广泛用于比较产品的两个版本和评估新提出的产品功能。它对决策具有重要意义,并已成为It行业的黄金标准。它本质上是一种双样本统计假设检验。在一定的假设条件下,可以得到平均处理效果(ATE)及其对应的p值。传统A/B测试中的一个关键假设是稳定单元处理值假设(SUTVA):不同单元之间不存在干扰。这意味着对一个单元的观察不受对其他单元的特殊处理分配的影响。尽管如此,在人们与邻居交流和传播信息的社交网络环境中,干扰是非常常见的。因此,违反了SUTVA假设。忽略这种网络效应的分析将导致ATE估计的偏差。大多数现有的工作主要集中在实验设计和数据分析上,以产生在偏差和方差方面具有良好性能的估计器。p值的计算很少受到重视。我们研究了网络A/B测试中ATE估计器的p值的计算。在简要回顾现有基于图类随机化的实验设计研究方法和不同ATE估计方法的基础上,提出了一种基于聚类水平置换检验的p值计算置换方法。在模拟现实环境的仿真研究中,验证了该方法对基于个体水平排列的方法的有效性。
{"title":"Network A/B Testing: Nonparametric Statistical Significance Test Based on Cluster-Level Permutation","authors":"Hongwei Shang, Xiaolin Shi, Bai Jiang","doi":"10.6339/23-jds1112","DOIUrl":"https://doi.org/10.6339/23-jds1112","url":null,"abstract":"A/B testing is widely used for comparing two versions of a product and evaluating new proposed product features. It is of great importance for decision-making and has been applied as a golden standard in the IT industry. It is essentially a form of two-sample statistical hypothesis testing. Average treatment effect (ATE) and the corresponding p-value can be obtained under certain assumptions. One key assumption in traditional A/B testing is the stable-unit-treatment-value assumption (SUTVA): there is no interference among different units. It means that the observation on one unit is unaffected by the particular assignment of treatments to the other units. Nonetheless, interference is very common in social network settings where people communicate and spread information to their neighbors. Therefore, the SUTVA assumption is violated. Analysis ignoring this network effect will lead to biased estimation of ATE. Most existing works focus mainly on the design of experiment and data analysis in order to produce estimators with good performance in regards to bias and variance. Little attention has been paid to the calculation of p-value. We work on the calculation of p-value for the ATE estimator in network A/B tests. After a brief review of existing research methods on design of experiment based on graph cluster randomization and different ATE estimation methods, we propose a permutation method for calculating p-value based on permutation test at the cluster level. The effectiveness of the method against that based on individual-level permutation is validated in a simulation study mimicking realistic settings.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Editorial: Advances in Network Data Science 社论:网络数据科学进展
Pub Date : 2023-01-01 DOI: 10.6339/23-jds213edi
Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu
This special issue features nine articles on “Advances in Network Data Science”. Data science is an interdisciplinary research field utilizing scientific methods to facilitate knowledge and insights from structured and unstructured data across a broad range of domains. Network data are proliferating in many fields, and network data analysis has become a burgeoning research in the data science community. Due to the nature of heterogeneity and complexity of network data, classical statistical approaches for network model fitting face a great deal of challenges, especially for large-scale network data. Therefore, it becomes crucial to develop advanced methodological and computational tools to cope with challenges associated with massive and complex network data analyses. This special issue highlights some recent studies in the area of network data analysis, showcasing a variety of contributions in statistical methodology, two real-world applications, a software package for network generation, and a survey on handling missing values in networks. Five articles are published in the Statistical Data Science Section. Wang and Resnick (2023) employed point processes to investigate the macroscopic growth dynamics of geographically concentrated regional networks. They discovered that during the startup phase, a self-exciting point process effectively modeled the growth process, and subsequently, the growth of links could be suitably described by a non-homogeneous Poisson process. Komolafe
本期特刊收录了九篇关于“网络数据科学进展”的文章。数据科学是一个跨学科的研究领域,利用科学的方法从广泛的领域中结构化和非结构化数据中获取知识和见解。网络数据在许多领域激增,网络数据分析已成为数据科学界的一项新兴研究。由于网络数据的异质性和复杂性,传统的网络模型拟合的统计方法面临着很大的挑战,特别是对于大规模的网络数据。因此,开发先进的方法和计算工具来应对与大量复杂网络数据分析相关的挑战变得至关重要。本期特刊重点介绍了网络数据分析领域的一些最新研究,展示了统计方法的各种贡献,两个现实世界的应用,一个网络生成软件包,以及对处理网络中缺失值的调查。在统计数据科学部分发表了五篇文章。Wang和Resnick(2023)采用点过程研究地理集中区域网络的宏观增长动态。他们发现,在启动阶段,一个自激点过程有效地模拟了生长过程,随后,链接的生长可以用非齐次泊松过程来适当地描述。Komolafe
{"title":"Editorial: Advances in Network Data Science","authors":"Yuguo Chen, Daniel Sewell, Panpan Zhang, Xuening Zhu","doi":"10.6339/23-jds213edi","DOIUrl":"https://doi.org/10.6339/23-jds213edi","url":null,"abstract":"This special issue features nine articles on “Advances in Network Data Science”. Data science is an interdisciplinary research field utilizing scientific methods to facilitate knowledge and insights from structured and unstructured data across a broad range of domains. Network data are proliferating in many fields, and network data analysis has become a burgeoning research in the data science community. Due to the nature of heterogeneity and complexity of network data, classical statistical approaches for network model fitting face a great deal of challenges, especially for large-scale network data. Therefore, it becomes crucial to develop advanced methodological and computational tools to cope with challenges associated with massive and complex network data analyses. This special issue highlights some recent studies in the area of network data analysis, showcasing a variety of contributions in statistical methodology, two real-world applications, a software package for network generation, and a survey on handling missing values in networks. Five articles are published in the Statistical Data Science Section. Wang and Resnick (2023) employed point processes to investigate the macroscopic growth dynamics of geographically concentrated regional networks. They discovered that during the startup phase, a self-exciting point process effectively modeled the growth process, and subsequently, the growth of links could be suitably described by a non-homogeneous Poisson process. Komolafe","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FROSTY: A High-Dimensional Scale-Free Bayesian Network Learning Method FROSTY:一种高维无标度贝叶斯网络学习方法
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1097
Joshua Bang, Sang-Yun Oh
We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.
提出了一种基于稀疏Cholesky分解的可扩展贝叶斯网络学习算法。我们的方法只需要观测数据和用户指定的置信度作为输入,并且可以估计具有数千个变量的网络。对于有p个顶点的图,该方法的计算复杂度为$O({p^{3}})$。大量的数值实验证明了该方法的有效性,并取得了令人满意的结果。在模拟中,我们方法的初始步骤还改进了另一种贝叶斯网络结构估计方法,该方法使用无向图作为输入。
{"title":"FROSTY: A High-Dimensional Scale-Free Bayesian Network Learning Method","authors":"Joshua Bang, Sang-Yun Oh","doi":"10.6339/23-jds1097","DOIUrl":"https://doi.org/10.6339/23-jds1097","url":null,"abstract":"We propose a scalable Bayesian network learning algorithm based on sparse Cholesky decomposition. Our approach only requires observational data and user-specified confidence level as inputs and can estimate networks with thousands of variables. The computational complexity of the proposed method is $O({p^{3}})$ for a graph with p vertices. Extensive numerical experiments illustrate the usefulness of our method with promising results. In simulation, the initial step in our approach also improves an alternative Bayesian network structure estimation method that uses an undirected graph as an input.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R R中高维低样本量数据分类的投影寻踪指标评价
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1096
Zhaoxing Wu, Chunming Zhang
Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $mathrm{PDA}$ index with the $mathrm{LDA}$ index and $mathrm{SVM}$, demonstrating that the $mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.
分析“大p小n”数据在广泛的应用领域中变得越来越重要。Lee和Cook(2010)在线性判别分析($mathrm{LDA}$)指数的基础上,设计了惩罚判别分析($mathrm{PDA}$)指数作为投影追踪指标,用于对高维数据进行分类,并取得了很好的结果。然而,与流行的支持向量机($mathrm{SVM}$)相比,关于其性能的信息很少。本文进行了大量的数值研究,比较了$ mathm {PDA}$索引与$ mathm {LDA}$索引和$ mathm {SVM}$索引的性能,表明$ mathm {PDA}$索引对异常值具有鲁棒性,能够处理极小样本量、很少重要变量和多类的高维数据集。对几个真实世界数据集的分析揭示了单个方法的实际优势和局限性,表明$ mathm {PDA}$索引为复杂的高维数据分类提供了一个有用的替代工具。这些新的见解,以及R包类spp中$ mathm {PDA}$索引函数的实际实现,使统计学家和数据科学家能够有效地使用这两组分类工具。
{"title":"Assessment of Projection Pursuit Index for Classifying High Dimension Low Sample Size Data in R","authors":"Zhaoxing Wu, Chunming Zhang","doi":"10.6339/23-jds1096","DOIUrl":"https://doi.org/10.6339/23-jds1096","url":null,"abstract":"Analyzing “large p small n” data is becoming increasingly paramount in a wide range of application fields. As a projection pursuit index, the Penalized Discriminant Analysis ($mathrm{PDA}$) index, built upon the Linear Discriminant Analysis ($mathrm{LDA}$) index, is devised in Lee and Cook (2010) to classify high-dimensional data with promising results. Yet, there is little information available about its performance compared with the popular Support Vector Machine ($mathrm{SVM}$). This paper conducts extensive numerical studies to compare the performance of the $mathrm{PDA}$ index with the $mathrm{LDA}$ index and $mathrm{SVM}$, demonstrating that the $mathrm{PDA}$ index is robust to outliers and able to handle high-dimensional datasets with extremely small sample sizes, few important variables, and multiple classes. Analyses of several motivating real-world datasets reveal the practical advantages and limitations of individual methods, suggesting that the $mathrm{PDA}$ index provides a useful alternative tool for classifying complex high-dimensional data. These new insights, along with the hands-on implementation of the $mathrm{PDA}$ index functions in the R package classPP, facilitate statisticians and data scientists to make effective use of both sets of classification tools.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Binary Classification of Malignant Mesothelioma: A Comparative Study 恶性间皮瘤二元分类的比较研究
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1090
Ted Si Yuan Cheng, Xiyue Liao
Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.
恶性间皮瘤是一种侵袭性癌症,通常发生在覆盖胸部或腹部衬里的薄层组织中。虽然癌症本身是罕见且致命的,但早期诊断将有助于治疗并改善结果。间皮瘤通常在晚期才被诊断出来。症状与其他更常见的疾病相似。因此,早期预测和诊断间皮瘤对于早期治疗这种经常被诊断得太晚的癌症至关重要。这个全面的经验比较的目标是确定基于召回(灵敏度)的最佳表现模型。我们特别希望避免假阴性,因为当患者实际上患有癌症时,将其诊断为健康是非常昂贵的。模型训练将基于k-fold交叉验证进行。选择随机森林作为最优模型。根据该模型,年龄和石棉暴露时间被列为影响间皮瘤诊断的最重要特征。
{"title":"Binary Classification of Malignant Mesothelioma: A Comparative Study","authors":"Ted Si Yuan Cheng, Xiyue Liao","doi":"10.6339/23-jds1090","DOIUrl":"https://doi.org/10.6339/23-jds1090","url":null,"abstract":"Malignant mesotheliomas are aggressive cancers that occur in the thin layer of tissue that covers most commonly the linings of the chest or abdomen. Though the cancer itself is rare and deadly, early diagnosis will help with treatment and improve outcomes. Mesothelioma is usually diagnosed in the later stages. Symptoms are similar to other, more common conditions. As such, predicting and diagnosing mesothelioma early is essential to starting early treatment for a cancer that is often diagnosed too late. The goal of this comprehensive empirical comparison is to determine the best-performing model based on recall (sensitivity). We particularly wish to avoid false negatives, as it is costly to diagnose a patient as healthy when they actually have cancer. Model training will be conducted based on k-fold cross validation. Random forest is chosen as the optimal model. According to this model, age and duration of asbestos exposure are ranked as the most important features affecting diagnosis of mesothelioma.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data 为更灵活的a /B测试奠定基础:中期监控在大规模数据中的应用
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1099
Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer
The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.
错误花费函数和停止规则的使用已经成为进行中期分析的有力工具。不仅在传统的临床试验中,而且在A/B试验中,都广泛需要实施中期分析。尽管许多论文总结了错误花费方法,但在大规模数据的背景下,帮助找到“最佳”边界的工作有限。在本文中,我们总结了15个边界,这些边界由5个误差花费函数组成,这些函数允许因无效、差异或两者兼而有之而早期终止,以及没有临时监测的固定样本量设计。仿真是基于一个实际的a /B测试问题,比较两个独立的比例。我们在每只手臂500到250000的范围内检查样本大小,以反映可能使用a /B测试的不同设置。最优边界的选择使用一个建议的损失函数进行总结,该损失函数结合了变量之间没有差异的零实验下的期望样本量的不同权重,变量之间存在差异的实验下的期望样本量,以及如果a /B测试没有在中期分析中早期停止所需的最大样本量。本文给出了基于充分供电、低功率和高功率设计的模拟设置的结果,并给出了在每种设置中选择“最佳”设计的建议。
{"title":"Building a Foundation for More Flexible A/B Testing: Applications of Interim Monitoring to Large Scale Data","authors":"Wenru Zhou, Miranda Kroehl, Maxene Meier, A. Kaizer","doi":"10.6339/23-jds1099","DOIUrl":"https://doi.org/10.6339/23-jds1099","url":null,"abstract":"The use of error spending functions and stopping rules has become a powerful tool for conducting interim analyses. The implementation of an interim analysis is broadly desired not only in traditional clinical trials but also in A/B tests. Although many papers have summarized error spending approaches, limited work has been done in the context of large-scale data that assists in finding the “optimal” boundary. In this paper, we summarized fifteen boundaries that consist of five error spending functions that allow early termination for futility, difference, or both, as well as a fixed sample size design without interim monitoring. The simulation is based on a practical A/B testing problem comparing two independent proportions. We examine sample sizes across a range of values from 500 to 250,000 per arm to reflect different settings where A/B testing may be utilized. The choices of optimal boundaries are summarized using a proposed loss function that incorporates different weights for the expected sample size under a null experiment with no difference between variants, the expected sample size under an experiment with a difference in the variants, and the maximum sample size needed if the A/B test did not stop early at an interim analysis. The results are presented for simulation settings based on adequately powered, under-powered, and over-powered designs with recommendations for selecting the “optimal” design in each setting.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Assessment of Crop-Specific Land Cover Predictions Using High-Order Markov Chains and Deep Neural Networks 基于高阶马尔可夫链和深度神经网络的作物特定土地覆盖预测评估
Pub Date : 2023-01-01 DOI: 10.6339/23-jds1098
L. Sartore, C. Boryan, Andrew Dau, P. Willis
High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.
高阶马尔可夫链(HOMC)是基于过渡概率的传统模型,被美国农业部(USDA)国家农业统计局(NASS)用于研究一段时间内的作物轮作模式。然而,homc通常存在稀疏性和可识别性问题,因为分类数据被表示为指标(或虚拟)变量。事实上,参数空间的维数随着分析所需homc的阶数呈指数增长。正如文献中所显示的那样,虽然简约的表示减少了参数的数量,但它们往往导致不太准确的预测。大多数简约模型都是在大数据结构上训练的,这些大数据结构可以使用替代算法进行压缩和有效处理。因此,有必要对使用新的HOMC算法和不同类型的深度神经网络(DNN)在一系列农业条件下获得的预测结果进行全面的评估和比较,以确定哪种模型最适合美国农业的操作性作物特定土地覆盖预测。本文将6个神经网络模型应用于美国中西部和南部6个农业集约化县2011 - 2021年的作物轮作数据,这些数据反映了美国中西部和南部主要作物的种植范围和多种作物轮作模式。这六个县包括:北达科他州的伦维尔;帕金斯,内布拉斯加州;黑尔,德州;利文斯顿,伊利诺斯州;麦克莱恩,伊利诺斯州;以及俄亥俄州的谢尔比。结果表明,DNN模型在2021年对所有县的整体预测精度更高。所提出的深度神经网络模型允许摄取长时间序列数据,并且比用于预测美国作物特定土地覆盖的新HOMC算法具有更高的精度值。
{"title":"An Assessment of Crop-Specific Land Cover Predictions Using High-Order Markov Chains and Deep Neural Networks","authors":"L. Sartore, C. Boryan, Andrew Dau, P. Willis","doi":"10.6339/23-jds1098","DOIUrl":"https://doi.org/10.6339/23-jds1098","url":null,"abstract":"High-Order Markov Chains (HOMC) are conventional models, based on transition probabilities, that are used by the United States Department of Agriculture (USDA) National Agricultural Statistics Service (NASS) to study crop-rotation patterns over time. However, HOMCs routinely suffer from sparsity and identifiability issues because the categorical data are represented as indicator (or dummy) variables. In fact, the dimension of the parametric space increases exponentially with the order of HOMCs required for analysis. While parsimonious representations reduce the number of parameters, as has been shown in the literature, they often result in less accurate predictions. Most parsimonious models are trained on big data structures, which can be compressed and efficiently processed using alternative algorithms. Consequently, a thorough evaluation and comparison of the prediction results obtain using a new HOMC algorithm and different types of Deep Neural Networks (DNN) across a range of agricultural conditions is warranted to determine which model is most appropriate for operational crop specific land cover prediction of United States (US) agriculture. In this paper, six neural network models are applied to crop rotation data between 2011 and 2021 from six agriculturally intensive counties, which reflect the range of major crops grown and a variety of crop rotation patterns in the Midwest and southern US. The six counties include: Renville, North Dakota; Perkins, Nebraska; Hale, Texas; Livingston, Illinois; McLean, Illinois; and Shelby, Ohio. Results show the DNN models achieve higher overall prediction accuracy for all counties in 2021. The proposed DNN models allow for the ingestion of long time series data, and robustly achieve higher accuracy values than a new HOMC algorithm considered for predicting crop specific land cover in the US.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71321012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1