首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Geostatistics for Large Datasets on Riemannian Manifolds: A Matrix-Free Approach 黎曼流形上大数据集的地质统计学:无矩阵方法
Pub Date : 2022-08-26 DOI: 10.6339/22-jds1075
M. Pereira, N. Desassis, D. Allard
Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.
大型或非常大型的空间(和时空)数据集已成为许多环境和气候研究中的常见位置。这些数据通常是在非欧几里得空间(如地球)中收集的,并且它们通常呈现非平稳各向异性。本文提出了一种在紧致黎曼流形上对高斯随机场(GRF)进行建模的通用方法,该方法弥合了现有关于非平稳GRF和流形上随机场的工作之间的差距。这种方法可以应用于任何光滑的紧致流形,特别是任何紧致曲面。通过定义一个考虑优先相关方向的黎曼度量,我们的方法可以解释由域的局部变形引起的非平稳几何各向异性。我们提供了可扩展的算法来估计参数,并通过克里格和模拟进行最佳预测,从而能够处理非常大的网格。提供了平稳和非平稳的插图。
{"title":"Geostatistics for Large Datasets on Riemannian Manifolds: A Matrix-Free Approach","authors":"M. Pereira, N. Desassis, D. Allard","doi":"10.6339/22-jds1075","DOIUrl":"https://doi.org/10.6339/22-jds1075","url":null,"abstract":"Large or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46734367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
EVIboost for the Estimation of Extreme Value Index Under Heterogeneous Extremes 非均匀极值下极值指数估计的EVIboost
Pub Date : 2022-05-28 DOI: 10.6339/22-jds1067
Jiaxi Wang, Yanxi Hou, Xingchi Li, Tiandong Wang
Modeling heterogeneity on heavy-tailed distributions under a regression framework is challenging, yet classical statistical methodologies usually place conditions on the distribution models to facilitate the learning procedure. However, these conditions will likely overlook the complex dependence structure between the heaviness of tails and the covariates. Moreover, data sparsity on tail regions makes the inference method less stable, leading to biased estimates for extreme-related quantities. This paper proposes a gradient boosting algorithm to estimate a functional extreme value index with heterogeneous extremes. Our proposed algorithm is a data-driven procedure capturing complex and dynamic structures in tail distributions. We also conduct extensive simulation studies to show the prediction accuracy of the proposed algorithm. In addition, we apply our method to a real-world data set to illustrate the state-dependent and time-varying properties of heavy-tail phenomena in the financial industry.
在回归框架下对重尾分布的异质性进行建模是具有挑战性的,然而经典的统计方法通常会在分布模型上设置条件,以促进学习过程。然而,这些条件可能会忽略尾部的权重和协变量之间的复杂依赖结构。此外,尾部区域的数据稀疏性使推理方法不太稳定,导致对极端相关量的估计存在偏差。本文提出了一种梯度提升算法来估计具有异质极值的函数极值指数。我们提出的算法是一个数据驱动的过程,捕捉尾部分布中的复杂动态结构。我们还进行了广泛的仿真研究,以证明所提出的算法的预测准确性。此外,我们将我们的方法应用于真实世界的数据集,以说明金融业中重尾现象的状态相关和时变特性。
{"title":"EVIboost for the Estimation of Extreme Value Index Under Heterogeneous Extremes","authors":"Jiaxi Wang, Yanxi Hou, Xingchi Li, Tiandong Wang","doi":"10.6339/22-jds1067","DOIUrl":"https://doi.org/10.6339/22-jds1067","url":null,"abstract":"Modeling heterogeneity on heavy-tailed distributions under a regression framework is challenging, yet classical statistical methodologies usually place conditions on the distribution models to facilitate the learning procedure. However, these conditions will likely overlook the complex dependence structure between the heaviness of tails and the covariates. Moreover, data sparsity on tail regions makes the inference method less stable, leading to biased estimates for extreme-related quantities. This paper proposes a gradient boosting algorithm to estimate a functional extreme value index with heterogeneous extremes. Our proposed algorithm is a data-driven procedure capturing complex and dynamic structures in tail distributions. We also conduct extensive simulation studies to show the prediction accuracy of the proposed algorithm. In addition, we apply our method to a real-world data set to illustrate the state-dependent and time-varying properties of heavy-tail phenomena in the financial industry.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43556826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation 鲁棒可扩展的非参数多类概率估计的线性算法
Pub Date : 2022-05-25 DOI: 10.6339/22-jds1069
Liyun Zeng, Hao Helen Zhang
Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.
多类概率估计是在给定协变信息的情况下,估计属于一类的数据点的条件概率的问题。它在统计分析和数据科学中有着广泛的应用。最近,已经开发了一类加权支持向量机(wSVM),用于通过集合学习来估计K类问题的类概率(Wu et al.,2010;Wang et al.,2019),其中K是类的数量。估计量是鲁棒的,并且实现了高精度的概率估计,但它们的学习是通过成对耦合实现的,这需要K中的多项式时间。在本文中,我们提出了两种新的学习方案,基线学习和一对一(OVA)学习,以在计算效率和估计精度方面进一步提高wSVM。特别地,基线学习具有最佳的计算复杂度,因为它在K中是线性的。尽管在计算中不是最有效的,但发现OVA在所比较的所有过程中具有最佳的估计精度。所得到的估计量是无分布的,并且被证明是一致的。我们进一步进行了大量的数值实验来证明它们的有限样本性能。
{"title":"Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation","authors":"Liyun Zeng, Hao Helen Zhang","doi":"10.6339/22-jds1069","DOIUrl":"https://doi.org/10.6339/22-jds1069","url":null,"abstract":"Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for K-class problems (Wu et al., 2010; Wang et al., 2019), where K is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in K. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in K. Though not the most efficient in computation, the OVA is found to have the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate their finite sample performance.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44600781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incorporating Interventions to an Extended SEIRD Model with Vaccination: Application to COVID-19 in Qatar 将干预措施纳入扩展SEIRD模型与疫苗接种:在卡塔尔COVID-19中的应用
Pub Date : 2022-04-23 DOI: 10.6339/23-JDS1105
Elizabeth B Amona, R. Ghanam, E. Boone, Indranil Sahoo, L. Abu-Raddad
The COVID-19 outbreak of 2020 has required many governments to develop and adopt mathematical-statistical models of the pandemic for policy and planning purposes. To this end, this work provides a tutorial on building a compartmental model using Susceptible, Exposed, Infected, Recovered, Deaths and Vaccinated (SEIRDV) status through time. The proposed model uses interventions to quantify the impact of various government attempts made to slow the spread of the virus. Furthermore, a vaccination parameter is also incorporated in the model, which is inactive until the time the vaccine is deployed. A Bayesian framework is utilized to perform both parameter estimation and prediction. Predictions are made to determine when the peak Active Infections occur. We provide inferential frameworks for assessing the effects of government interventions on the dynamic progression of the pandemic, including the impact of vaccination. The proposed model also allows for quantification of number of excess deaths averted over the study period due to vaccination.
2020年2019冠状病毒病的爆发要求许多政府为政策和规划目的制定和采用大流行的数学统计模型。为此,本工作提供了一个关于建立一个间隔模型的教程,该模型使用易感、暴露、感染、恢复、死亡和接种(SEIRDV)状态的时间。提出的模型使用干预措施来量化政府为减缓病毒传播所做的各种尝试的影响。此外,模型中还包含了一个疫苗接种参数,该参数在部署疫苗之前处于非活动状态。采用贝叶斯框架进行参数估计和预测。预测是为了确定何时出现活跃感染高峰。我们为评估政府干预对流行病动态发展的影响提供了推论框架,包括疫苗接种的影响。拟议的模型还允许对研究期间因接种疫苗而避免的额外死亡人数进行量化。
{"title":"Incorporating Interventions to an Extended SEIRD Model with Vaccination: Application to COVID-19 in Qatar","authors":"Elizabeth B Amona, R. Ghanam, E. Boone, Indranil Sahoo, L. Abu-Raddad","doi":"10.6339/23-JDS1105","DOIUrl":"https://doi.org/10.6339/23-JDS1105","url":null,"abstract":"The COVID-19 outbreak of 2020 has required many governments to develop and adopt mathematical-statistical models of the pandemic for policy and planning purposes. To this end, this work provides a tutorial on building a compartmental model using Susceptible, Exposed, Infected, Recovered, Deaths and Vaccinated (SEIRDV) status through time. The proposed model uses interventions to quantify the impact of various government attempts made to slow the spread of the virus. Furthermore, a vaccination parameter is also incorporated in the model, which is inactive until the time the vaccine is deployed. A Bayesian framework is utilized to perform both parameter estimation and prediction. Predictions are made to determine when the peak Active Infections occur. We provide inferential frameworks for assessing the effects of government interventions on the dynamic progression of the pandemic, including the impact of vaccination. The proposed model also allows for quantification of number of excess deaths averted over the study period due to vaccination.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43542129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Causal Discovery for Observational Sciences Using Supervised Machine Learning 使用监督机器学习的观察科学因果发现
Pub Date : 2022-02-25 DOI: 10.6339/23-jds1088
A. H. Petersen, J. Ramsey, C. Ekstrøm, P. Spirtes
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct discovery methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error trade off is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.
因果推断可以估计因果效应,但除非通过实验收集数据,否则统计分析必须依赖于预先指定的因果模型。因果发现算法是根据数据构建此类因果模型的经验方法。几种渐近正确的发现方法已经存在,但它们通常在较小的样本上很困难。此外,大多数方法都专注于非常稀疏的因果模型,这可能并不总是现实生活中数据生成机制的真实表示。最后,虽然这些方法提出的因果关系通常是正确的,但他们关于因果不相关的说法有很高的错误率。这种非保守的误差权衡对于观测科学来说并不理想,因为观测科学直接使用由此产生的模型来进行因果推断:具有许多缺失因果关系的因果模型需要太强的假设,并可能导致有偏差的效应估计。我们提出了一种新的因果发现方法来解决这三个缺点:监督学习发现(SLdisco)。SLdisco使用监督机器学习来获得从观测数据到因果模型等价类的映射。我们在一项基于高斯数据的大型模拟研究中评估了SLdisco,并考虑了模型大小和样本大小的几种选择。我们发现SLdisco比现有程序更保守,只是信息量略低,对样本量的敏感性也较低。我们还提供了一个真实的流行病学数据应用程序。我们使用随机子采样来研究小样本上的真实数据性能,并再次发现SLdisco对样本量不太敏感,因此似乎可以更好地利用小数据集中的可用信息。
{"title":"Causal Discovery for Observational Sciences Using Supervised Machine Learning","authors":"A. H. Petersen, J. Ramsey, C. Ekstrøm, P. Spirtes","doi":"10.6339/23-jds1088","DOIUrl":"https://doi.org/10.6339/23-jds1088","url":null,"abstract":"Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct discovery methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error trade off is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45032104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes. 使用最近邻高斯过程的空间Probit线性混合模型的可伸缩预测。
Pub Date : 2022-01-01 Epub Date: 2022-11-03 DOI: 10.6339/22-jds1073
Arkajyoti Saha, Abhirup Datta, Sudipto Banerjee

Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.

空间probit广义线性混合模型(spGLMM)具有线性固定效应和空间随机效应,具有高斯过程先验,广泛用于二元空间数据的分析。然而,这种分层混合模型的规范贝叶斯实现可能涉及旷日持久的马尔可夫链蒙特卡罗采样。已经提出了替代方法,通过用多元正态累积分布函数(cdf)直接表示spGLMM的边际似然来规避这一点。我们提出了后一种方法的直接快速再现,用于从空间概率线性混合模型进行预测。我们证明了表征来自spGLMM的二进制空间数据的边缘cdf的cdf的协方差矩阵适用于使用最近邻高斯过程(NNGP)的近似。这促进了使用NNGP的spGLMM的可扩展预测算法,该算法仅涉及稀疏或小矩阵计算,并且可以以令人尴尬的并行方式进行部署。我们通过大量的模拟实验和物种存在-不存在数据的分析,证明了该算法的准确性和可扩展性。
{"title":"Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes.","authors":"Arkajyoti Saha, Abhirup Datta, Sudipto Banerjee","doi":"10.6339/22-jds1073","DOIUrl":"10.6339/22-jds1073","url":null,"abstract":"<p><p>Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"20 4","pages":"533-544"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10544813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41167232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Classification of Plasmodium vivax Malaria Recurrence: An Application of Classifying Unknown Cause of Failure in Competing Risks. 间日疟原虫疟疾复发的动态分类:未知失败原因分类在竞争风险中的应用。
Pub Date : 2022-01-01 Epub Date: 2021-12-09 DOI: 10.6339/21-jds1026
Yutong Liu, Feng-Chang Lin, Jessica T Lin, Quefeng Li

A standard competing risks set-up requires both time to event and cause of failure to be fully observable for all subjects. However, in application, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of Plasmodium vivax malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. In this paper, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to Plasmodium vivax infection data to classify recurrent infections of malaria.

标准的竞争风险设置要求事件发生时间和失败原因对所有主体都是完全可观察到的。然而,在应用中,故障的原因可能并不总是可见的,从而阻碍了风险评估。在一些极端的情况下,没有一个失败的原因是可观察到的。在治疗后间日疟原虫疟疾复发的病例中,患者可能因先前感染而复发或因蚊虫叮咬而获得新的感染。在这种情况下,当存在竞争风险,即新的感染时,复发的时间无法建模。当无法确定感染的真正原因时,预防以前感染复发的治疗效果可能被低估。在本文中,我们开发了一种在竞争风险设置下对潜在故障原因进行分类的新方法,该方法不仅使用事件信息的时间,而且使用基线和事件发生时协变量之间的转换可能性。在仿真实验中,我们的分类器在各种场景下都表现出优异的性能。将该方法应用于间日疟原虫感染资料,对疟疾复发感染进行分类。
{"title":"Dynamic Classification of <i>Plasmodium vivax</i> Malaria Recurrence: An Application of Classifying Unknown Cause of Failure in Competing Risks.","authors":"Yutong Liu,&nbsp;Feng-Chang Lin,&nbsp;Jessica T Lin,&nbsp;Quefeng Li","doi":"10.6339/21-jds1026","DOIUrl":"https://doi.org/10.6339/21-jds1026","url":null,"abstract":"<p><p>A standard competing risks set-up requires both time to event and cause of failure to be fully observable for all subjects. However, in application, the cause of failure may not always be observable, thus impeding the risk assessment. In some extreme cases, none of the causes of failure is observable. In the case of a recurrent episode of <i>Plasmodium vivax</i> malaria following treatment, the patient may have suffered a relapse from a previous infection or acquired a new infection from a mosquito bite. In this case, the time to relapse cannot be modeled when a competing risk, a new infection, is present. The efficacy of a treatment for preventing relapse from a previous infection may be underestimated when the true cause of infection cannot be classified. In this paper, we developed a novel method for classifying the latent cause of failure under a competing risks set-up, which uses not only time to event information but also transition likelihoods between covariates at the baseline and at the time of event occurrence. Our classifier shows superior performance under various scenarios in simulation experiments. The method was applied to <i>Plasmodium vivax</i> infection data to classify recurrent infections of malaria.</p>","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":" ","pages":"51-78"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9347664/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40585832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Python Package open-crypto: A Cryptocurrency Data Collector Python包open-crypto:一个加密货币数据收集器
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1059
Steffen Günther, C. Fieberg, Thorsten Poddig
This paper introduces the package open-crypto for free-of-charge and systematic cryptocurrency data collecting. The package supports several methods to request (1) static data, (2) real-time data and (3) historical data. It allows to retrieve data from over 100 of the most popular and liquid exchanges world-wide. New exchanges can easily be added with the help of provided templates or updated with build-in functions from the project repository. The package is available on GitHub and the Python package index (PyPi). The data is stored in a relational SQL database and therefore accessible from many different programming languages. We provide a hands-on and illustrations for each data type, explanations on the received data and also demonstrate the usability from R and Matlab. Academic research heavily relies on costly or confidential data, however, open data projects are becoming increasingly important. This project is mainly motivated to contribute to openly accessible software and free data in the cryptocurrency markets to improve transparency and reproducibility in research and any other disciplines.
本文介绍了免费、系统地收集加密货币数据的open-crypto包。该包支持几种方法来请求(1)静态数据,(2)实时数据和(3)历史数据。它允许从全球100多个最受欢迎和最具流动性的交易所检索数据。在提供的模板的帮助下,可以很容易地添加新的交换,或者使用项目存储库中的内置功能进行更新。该包可在GitHub和Python包索引(PyPi)上获得。数据存储在关系SQL数据库中,因此可以从许多不同的编程语言访问。我们为每种数据类型提供了动手和插图,对接收到的数据进行了解释,并演示了R和Matlab的可用性。学术研究严重依赖于昂贵或机密的数据,然而,开放数据项目正变得越来越重要。这个项目的主要动机是在加密货币市场上为开放访问的软件和免费数据做出贡献,以提高研究和任何其他学科的透明度和可重复性。
{"title":"The Python Package open-crypto: A Cryptocurrency Data Collector","authors":"Steffen Günther, C. Fieberg, Thorsten Poddig","doi":"10.6339/22-jds1059","DOIUrl":"https://doi.org/10.6339/22-jds1059","url":null,"abstract":"This paper introduces the package open-crypto for free-of-charge and systematic cryptocurrency data collecting. The package supports several methods to request (1) static data, (2) real-time data and (3) historical data. It allows to retrieve data from over 100 of the most popular and liquid exchanges world-wide. New exchanges can easily be added with the help of provided templates or updated with build-in functions from the project repository. The package is available on GitHub and the Python package index (PyPi). The data is stored in a relational SQL database and therefore accessible from many different programming languages. We provide a hands-on and illustrations for each data type, explanations on the received data and also demonstrate the usability from R and Matlab. Academic research heavily relies on costly or confidential data, however, open data projects are becoming increasingly important. This project is mainly motivated to contribute to openly accessible software and free data in the cryptocurrency markets to improve transparency and reproducibility in research and any other disciplines.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiresolution Broad Area Search: Monitoring Spatial Characteristics of Gapless Remote Sensing Data 多分辨率广域搜索:监测无间隙遥感数据的空间特征
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1072
Laura J. Wendelberger, J. Gray, Alyson G. Wilson, R. Houborg, B. Reich
Global earth monitoring aims to identify and characterize land cover change like construction as it occurs. Remote sensing makes it possible to collect large amounts of data in near real-time over vast geographic areas and is becoming available in increasingly fine temporal and spatial resolution. Many methods have been developed for data from a single pixel, but monitoring pixel-wise spectral measurements over time neglects spatial relationships, which become more important as change manifests in a greater number of pixels in higher resolution imagery compared to moderate resolution. Building on our previous robust online Bayesian monitoring (roboBayes) algorithm, we propose monitoring multiresolution signals based on a wavelet decomposition to capture spatial change coherence on several scales to detect change sites. Monitoring only a subset of relevant signals reduces the computational burden. The decomposition relies on gapless data; we use 3 m Planet Fusion Monitoring data. Simulations demonstrate the superiority of the spatial signals in multiresolution roboBayes (MR roboBayes) for detecting subtle changes compared to pixel-wise roboBayes. We use MR roboBayes to detect construction changes in two regions with distinct land cover and seasonal characteristics: Jacksonville, FL (USA) and Dubai (UAE). It achieves site detection with less than two thirds of the monitoring processes required for pixel-wise roboBayes at the same resolution.
全球地球监测旨在识别和描述土地覆盖变化,如建筑变化。遥感技术使在广大地理区域近乎实时地收集大量数据成为可能,而且其时间和空间分辨率也越来越高。对于单个像素的数据已经开发了许多方法,但是随着时间的推移监测逐像素的光谱测量忽略了空间关系,随着变化在高分辨率图像中表现为与中等分辨率相比更多的像素数量,空间关系变得更加重要。在我们之前的鲁棒在线贝叶斯监测(roboBayes)算法的基础上,我们提出了基于小波分解的多分辨率信号监测,以捕获多个尺度上的空间变化相干性来检测变化地点。只监视相关信号的子集可以减少计算负担。分解依赖于无间隙数据;我们使用3 m行星融合监测数据。仿真证明了空间信号在多分辨率机器人贝叶斯(MR roboBayes)中检测细微变化的优势,与像素级机器人贝叶斯相比。我们使用MR机器人贝叶斯来检测两个具有不同土地覆盖和季节特征的地区的建筑变化:美国佛罗里达州的杰克逊维尔和阿联酋的迪拜。在相同分辨率下,它只需要不到三分之二的逐像素机器人贝叶斯所需的监测过程就能实现站点检测。
{"title":"Multiresolution Broad Area Search: Monitoring Spatial Characteristics of Gapless Remote Sensing Data","authors":"Laura J. Wendelberger, J. Gray, Alyson G. Wilson, R. Houborg, B. Reich","doi":"10.6339/22-jds1072","DOIUrl":"https://doi.org/10.6339/22-jds1072","url":null,"abstract":"Global earth monitoring aims to identify and characterize land cover change like construction as it occurs. Remote sensing makes it possible to collect large amounts of data in near real-time over vast geographic areas and is becoming available in increasingly fine temporal and spatial resolution. Many methods have been developed for data from a single pixel, but monitoring pixel-wise spectral measurements over time neglects spatial relationships, which become more important as change manifests in a greater number of pixels in higher resolution imagery compared to moderate resolution. Building on our previous robust online Bayesian monitoring (roboBayes) algorithm, we propose monitoring multiresolution signals based on a wavelet decomposition to capture spatial change coherence on several scales to detect change sites. Monitoring only a subset of relevant signals reduces the computational burden. The decomposition relies on gapless data; we use 3 m Planet Fusion Monitoring data. Simulations demonstrate the superiority of the spatial signals in multiresolution roboBayes (MR roboBayes) for detecting subtle changes compared to pixel-wise roboBayes. We use MR roboBayes to detect construction changes in two regions with distinct land cover and seasonal characteristics: Jacksonville, FL (USA) and Dubai (UAE). It achieves site detection with less than two thirds of the monitoring processes required for pixel-wise roboBayes at the same resolution.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Subpopulation Treatment Effect Pattern Plot (STEPP) Methods with R and Stata 亚种群处理效应模式图(STEPP)方法
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1060
S. Venturini, M. Bonetti, A. Lazar, B. Cole, Xin Victoria Wang, R. Gelber, Wai-Ki Yip
We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examining possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language.
我们介绍了R和Stata的stepp软件包,实现了亚种群处理效应模式图(stepp)方法。STEPP是一种非参数图形工具,旨在检查在连续协变量或复合评分定义的亚群中可能存在的异质性治疗效果。更具体地说,STEPP考虑根据连续协变量(或风险指数)定义的重叠亚群,并估计每个亚群的治疗效果。它也产生置信区域和亚群间治疗效果异质性的检验。原来的方法已经扩展到不同的方向,如不同的生存环境,结果类型,或更有效的程序,以确定重叠的亚群。在本文中,我们还引入了一种新的方法,通过最小化由特定参数组合产生的子种群大小的可变性来确定子种群内的受试者数量。我们使用合成数据和公开可用的数据集来说明这些包。R中最密集的计算是用Fortran实现的,而Stata版本则利用了强大的Mata语言。
{"title":"Subpopulation Treatment Effect Pattern Plot (STEPP) Methods with R and Stata","authors":"S. Venturini, M. Bonetti, A. Lazar, B. Cole, Xin Victoria Wang, R. Gelber, Wai-Ki Yip","doi":"10.6339/22-jds1060","DOIUrl":"https://doi.org/10.6339/22-jds1060","url":null,"abstract":"We introduce the stepp packages for R and Stata that implement the subpopulation treatment effect pattern plot (STEPP) method. STEPP is a nonparametric graphical tool aimed at examining possible heterogeneous treatment effects in subpopulations defined on a continuous covariate or composite score. More pecifically, STEPP considers overlapping subpopulations defined with respect to a continuous covariate (or risk index) and it estimates a treatment effect for each subpopulation. It also produces confidence regions and tests for treatment effect heterogeneity among the subpopulations. The original method has been extended in different directions such as different survival contexts, outcome types, or more efficient procedures for identifying the overlapping subpopulations. In this paper, we also introduce a novel method to determine the number of subjects within the subpopulations by minimizing the variability of the sizes of the subpopulations generated by a specific parameter combination. We illustrate the packages using both synthetic data and publicly available data sets. The most intensive computations in R are implemented in Fortran, while the Stata version exploits the powerful Mata language.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1