首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review 加速统计和数据科学中的定点算法:最新评述
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1051
Bohao Tang, Nicholas C. Henderson, Ravi Varadhan
Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.
不动点算法因其简单、保证收敛和适用于高维问题而在统计学和数据科学中很受欢迎。众所周知的例子包括期望最大化(EM)算法、最大化最小化(MM)和基于梯度的算法,如梯度下降(GD)和近端梯度下降。这些算法的一个特点是收敛速度慢。我们讨论了几种最先进的技术来加速它们的收敛。我们在六个不同的应用中演示并评估了这些技术的效率和健壮性。在加速方案中,SQUAREM表现出稳健的加速,平均加速18倍。DAAREM和重新启动的nesterov方案也一直表现出令人印象深刻的加速。因此,可以使用SQUAREM、DAAREM或restart - nesterov加速方案中的一种来加速原始不动点算法。我们描述了实现细节和软件包,以促进加速方案的应用。我们还讨论了针对给定问题选择特定加速方案的策略。
{"title":"Accelerating Fixed-Point Algorithms in Statistics and Data Science: A State-of-Art Review","authors":"Bohao Tang, Nicholas C. Henderson, Ravi Varadhan","doi":"10.6339/22-jds1051","DOIUrl":"https://doi.org/10.6339/22-jds1051","url":null,"abstract":"Fixed-point algorithms are popular in statistics and data science due to their simplicity, guaranteed convergence, and applicability to high-dimensional problems. Well-known examples include the expectation-maximization (EM) algorithm, majorization-minimization (MM), and gradient-based algorithms like gradient descent (GD) and proximal gradient descent. A characteristic weakness of these algorithms is their slow convergence. We discuss several state-of-art techniques for accelerating their convergence. We demonstrate and evaluate these techniques in terms of their efficiency and robustness in six distinct applications. Among the acceleration schemes, SQUAREM shows robust acceleration with a mean 18-fold speedup. DAAREM and restarted-Nesterov schemes also demonstrate consistently impressive accelerations. Thus, it is possible to accelerate the original fixed-point algorithm by using one of SQUAREM, DAAREM, or restarted-Nesterov acceleration schemes. We describe implementation details and software packages to facilitate the application of the acceleration schemes. We also discuss strategies for selecting a particular acceleration scheme for a given problem.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Editorial: Data Science Meets Social Sciences 社论:数据科学遇上社会科学
Pub Date : 2022-01-01 DOI: 10.6339/22-jds203edi
E. Erosheva, Shahryar Minhas, Gongjun Xu, Ran Xu
{"title":"Editorial: Data Science Meets Social Sciences","authors":"E. Erosheva, Shahryar Minhas, Gongjun Xu, Ran Xu","doi":"10.6339/22-jds203edi","DOIUrl":"https://doi.org/10.6339/22-jds203edi","url":null,"abstract":"","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Propensity Score Modeling in Electronic Health Records with Time-to-Event Endpoints: Application to Kidney Transplantation 以时间到事件为终点的电子健康记录中的倾向评分模型:在肾移植中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1046
Jonathan W. Yu, D. Bandyopadhyay, Shu Yang, Le Kang, G. Gupta
For large observational studies lacking a control group (unlike randomized controlled trials, RCT), propensity scores (PS) are often the method of choice to account for pre-treatment confounding in baseline characteristics, and thereby avoid substantial bias in treatment estimation. A vast majority of PS techniques focus on average treatment effect estimation, without any clear consensus on how to account for confounders, especially in a multiple treatment setting. Furthermore, for time-to event outcomes, the analytical framework is further complicated in presence of high censoring rates (sometimes, due to non-susceptibility of study units to a disease), imbalance between treatment groups, and clustered nature of the data (where, survival outcomes appear in groups). Motivated by a right-censored kidney transplantation dataset derived from the United Network of Organ Sharing (UNOS), we investigate and compare two recent promising PS procedures, (a) the generalized boosted model (GBM), and (b) the covariate-balancing propensity score (CBPS), in an attempt to decouple the causal effects of treatments (here, study subgroups, such as hepatitis C virus (HCV) positive/negative donors, and positive/negative recipients) on time to death of kidney recipients due to kidney failure, post transplantation. For estimation, we employ a 2-step procedure which addresses various complexities observed in the UNOS database within a unified paradigm. First, to adjust for the large number of confounders on the multiple sub-groups, we fit multinomial PS models via procedures (a) and (b). In the next stage, the estimated PS is incorporated into the likelihood of a semi-parametric cure rate Cox proportional hazard frailty model via inverse probability of treatment weighting, adjusted for multi-center clustering and excess censoring, Our data analysis reveals a more informative and superior performance of the full model in terms of treatment effect estimation, over sub-models that relaxes the various features of the event time dataset.
对于缺乏对照组的大型观察性研究(不像随机对照试验,RCT),倾向评分(PS)通常是考虑基线特征的治疗前混淆的选择方法,从而避免治疗估计中的重大偏差。绝大多数PS技术侧重于平均治疗效果估计,对于如何考虑混杂因素没有任何明确的共识,特别是在多重治疗环境中。此外,对于时间到事件的结果,由于存在高审查率(有时,由于研究单位对某种疾病不敏感)、治疗组之间的不平衡以及数据的聚集性(其中,生存结果出现在组中),分析框架进一步复杂化。受来自器官共享联合网络(UNOS)的右审查肾移植数据集的激励,我们调查并比较了两种最近有前途的PS程序,(a)广义增强模型(GBM)和(b)协变量平衡倾向评分(CBPS),试图解解治疗的因果效应(这里,研究亚组,如丙型肝炎病毒(HCV)阳性/阴性供者,和阳性/阴性受者)及时死亡肾受者因肾功能衰竭,移植后。对于估计,我们采用了一个两步程序,在统一的范例中解决UNOS数据库中观察到的各种复杂性。首先,为了调整多个子组上的大量混杂因素,我们通过程序(a)和(b)拟合多项PS模型。在下一阶段,估计的PS通过处理权重的逆概率纳入半参数治愈率Cox比例风险脆弱性模型的可能性,并根据多中心聚类和过度审查进行调整。我们的数据分析显示,在治疗效果估计方面,与放松事件时间数据集的各种特征的子模型相比,完整模型具有更丰富的信息和更优越的性能。
{"title":"Propensity Score Modeling in Electronic Health Records with Time-to-Event Endpoints: Application to Kidney Transplantation","authors":"Jonathan W. Yu, D. Bandyopadhyay, Shu Yang, Le Kang, G. Gupta","doi":"10.6339/22-jds1046","DOIUrl":"https://doi.org/10.6339/22-jds1046","url":null,"abstract":"For large observational studies lacking a control group (unlike randomized controlled trials, RCT), propensity scores (PS) are often the method of choice to account for pre-treatment confounding in baseline characteristics, and thereby avoid substantial bias in treatment estimation. A vast majority of PS techniques focus on average treatment effect estimation, without any clear consensus on how to account for confounders, especially in a multiple treatment setting. Furthermore, for time-to event outcomes, the analytical framework is further complicated in presence of high censoring rates (sometimes, due to non-susceptibility of study units to a disease), imbalance between treatment groups, and clustered nature of the data (where, survival outcomes appear in groups). Motivated by a right-censored kidney transplantation dataset derived from the United Network of Organ Sharing (UNOS), we investigate and compare two recent promising PS procedures, (a) the generalized boosted model (GBM), and (b) the covariate-balancing propensity score (CBPS), in an attempt to decouple the causal effects of treatments (here, study subgroups, such as hepatitis C virus (HCV) positive/negative donors, and positive/negative recipients) on time to death of kidney recipients due to kidney failure, post transplantation. For estimation, we employ a 2-step procedure which addresses various complexities observed in the UNOS database within a unified paradigm. First, to adjust for the large number of confounders on the multiple sub-groups, we fit multinomial PS models via procedures (a) and (b). In the next stage, the estimated PS is incorporated into the likelihood of a semi-parametric cure rate Cox proportional hazard frailty model via inverse probability of treatment weighting, adjusted for multi-center clustering and excess censoring, Our data analysis reveals a more informative and superior performance of the full model in terms of treatment effect estimation, over sub-models that relaxes the various features of the event time dataset.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Effective Tensor Regression with Latent Sparse Regularization 具有隐稀疏正则化的有效张量回归
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1048
Ko-Shin Chen, Tingyang Xu, Guannan Liang, Qianqian Tong, Minghu Song, J. Bi
As data acquisition technologies advance, longitudinal analysis is facing challenges of exploring complex feature patterns from high-dimensional data and modeling potential temporally lagged effects of features on a response. We propose a tensor-based model to analyze multidimensional data. It simultaneously discovers patterns in features and reveals whether features observed at past time points have impact on current outcomes. The model coefficient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. We introduce a so-called latent F-1 norm that can be applied to the coefficient tensor to performed structured selection of features. Specifically, features will be selected along each mode of the tensor. The proposed model takes into account within-subject correlations by employing a tensor-based quadratic inference function. An asymptotic analysis shows that our model can identify true support when the sample size approaches to infinity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a fixed sample size. Computational results on synthetic datasets and real-life fMRI and EEG datasets demonstrate the superior performance of the proposed approach over existing techniques.
随着数据采集技术的进步,纵向分析面临着从高维数据中探索复杂特征模式和建模特征对响应的潜在时间滞后效应的挑战。我们提出了一种基于张量的多维数据分析模型。它同时发现特征中的模式,并揭示在过去时间点观察到的特征是否对当前结果有影响。模型系数,一个k模张量,被分解成k个相同维度张量的总和。我们引入了一个所谓的潜在F-1范数,它可以应用于系数张量来进行特征的结构化选择。具体来说,将沿着张量的每个模式选择特征。该模型通过采用基于张量的二次推理函数考虑了主体内的相关性。渐近分析表明,当样本量接近无穷大时,我们的模型可以识别出真正的支持。为了解决相应的优化问题,我们提出了一种线性化的块坐标下降算法,并证明了它在固定样本量下的收敛性。在合成数据集和真实的fMRI和EEG数据集上的计算结果表明,该方法优于现有技术。
{"title":"An Effective Tensor Regression with Latent Sparse Regularization","authors":"Ko-Shin Chen, Tingyang Xu, Guannan Liang, Qianqian Tong, Minghu Song, J. Bi","doi":"10.6339/22-jds1048","DOIUrl":"https://doi.org/10.6339/22-jds1048","url":null,"abstract":"As data acquisition technologies advance, longitudinal analysis is facing challenges of exploring complex feature patterns from high-dimensional data and modeling potential temporally lagged effects of features on a response. We propose a tensor-based model to analyze multidimensional data. It simultaneously discovers patterns in features and reveals whether features observed at past time points have impact on current outcomes. The model coefficient, a k-mode tensor, is decomposed into a summation of k tensors of the same dimension. We introduce a so-called latent F-1 norm that can be applied to the coefficient tensor to performed structured selection of features. Specifically, features will be selected along each mode of the tensor. The proposed model takes into account within-subject correlations by employing a tensor-based quadratic inference function. An asymptotic analysis shows that our model can identify true support when the sample size approaches to infinity. To solve the corresponding optimization problem, we develop a linearized block coordinate descent algorithm and prove its convergence for a fixed sample size. Computational results on synthetic datasets and real-life fMRI and EEG datasets demonstrate the superior performance of the proposed approach over existing techniques.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Does Aging Make Us Grittier? Disentangling the Age and Generation Effect on Passion and Perseverance 衰老会让我们变得更坚强吗?拆解年龄和世代对激情和毅力的影响
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1041
S. Sanders, Nuwan Indika Millagaha Gedara, Bhavneet Walia, C. Boudreaux, M. Silverstein
Defined as perseverance and passion for long term goals, grit represents an important psychological skill toward goal-attainment in academic and less-stylized settings. An outstanding issue of primary importance is whether age affects grit, ceteris paribus. The 12-item Grit-O Scale and the 8-item Grit-S Scale—from which grit scores are calculated—have not existed for a long period of time. Therefore, Duckworth (2016, p. 37) states in her book, Grit: The Power and Passion of Perseverance, that “we need a different kind of study” to distinguish between rival explanations that either generational cohort or age are more important in explaining variation in grit across individuals. Despite this clear data constraint, we obtain a glimpse into the future in the present study by using a within and between generational cohort age difference-in-difference approach. By specifying generation as a categorical variable and age-in-generation as a count variable in the same regression specifications, we are able to account for the effects of variation in age and generation simultaneously, while avoiding problems of multicollinearity that would hinder post-regression statistical inference. We conclude robust, significant evidence that the negative-parabolic shape of the grit-age profile is driven by generational variation and not by age variation. Our findings suggest that, absent a grit-mindset intervention, individual-level grit may be persistent over time.
砂砾被定义为对长期目标的毅力和激情,在学术和不那么程式化的环境中,砂砾代表了实现目标的重要心理技能。最重要的一个突出问题是,在其他条件不变的情况下,年龄是否会影响砂砾。12项“勇气- 0”量表和8项“勇气- s”量表——用来计算勇气得分——已经不存在很长时间了。因此,Duckworth (2016, p. 37)在她的书《毅力:毅力的力量和激情》中指出,“我们需要一种不同的研究”来区分不同的解释,即世代或年龄在解释个体之间的毅力差异方面更重要。尽管有这种明确的数据限制,我们在本研究中通过使用代际队列内和代际之间的年龄差异方法对未来进行了一瞥。通过在相同的回归规范中指定世代作为分类变量,代中年龄作为计数变量,我们能够同时解释年龄和世代变化的影响,同时避免多重共线性问题,这将阻碍回归后的统计推断。我们得出了强有力的、有意义的证据,表明砂年龄剖面的负抛物线形状是由代际变化而不是年龄变化驱动的。我们的研究结果表明,如果没有勇气心态的干预,个人层面的勇气可能会持续一段时间。
{"title":"Does Aging Make Us Grittier? Disentangling the Age and Generation Effect on Passion and Perseverance","authors":"S. Sanders, Nuwan Indika Millagaha Gedara, Bhavneet Walia, C. Boudreaux, M. Silverstein","doi":"10.6339/22-jds1041","DOIUrl":"https://doi.org/10.6339/22-jds1041","url":null,"abstract":"Defined as perseverance and passion for long term goals, grit represents an important psychological skill toward goal-attainment in academic and less-stylized settings. An outstanding issue of primary importance is whether age affects grit, ceteris paribus. The 12-item Grit-O Scale and the 8-item Grit-S Scale—from which grit scores are calculated—have not existed for a long period of time. Therefore, Duckworth (2016, p. 37) states in her book, Grit: The Power and Passion of Perseverance, that “we need a different kind of study” to distinguish between rival explanations that either generational cohort or age are more important in explaining variation in grit across individuals. Despite this clear data constraint, we obtain a glimpse into the future in the present study by using a within and between generational cohort age difference-in-difference approach. By specifying generation as a categorical variable and age-in-generation as a count variable in the same regression specifications, we are able to account for the effects of variation in age and generation simultaneously, while avoiding problems of multicollinearity that would hinder post-regression statistical inference. We conclude robust, significant evidence that the negative-parabolic shape of the grit-age profile is driven by generational variation and not by age variation. Our findings suggest that, absent a grit-mindset intervention, individual-level grit may be persistent over time.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Do Americans Think the Digital Economy is Fair? Using Supervised Learning to Explore Evaluations of Predictive Automation 美国人认为数字经济公平吗?使用监督学习探索预测自动化的评估
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1053
E. Lehoucq
Predictive automation is a pervasive and archetypical example of the digital economy. Studying how Americans evaluate predictive automation is important because it affects corporate and state governance. However, we have relevant questions unanswered. We lack comparisons across use cases using a nationally representative sample. We also have yet to determine what are the key predictors of evaluations of predictive automation. This article uses the American Trends Panel’s 2018 wave ($n=4,594$) to study whether American adults think predictive automation is fair across four use cases: helping credit decisions, assisting parole decisions, filtering job applicants based on interview videos, and assessing job candidates based on resumes. Results from lasso regressions trained with 112 predictors reveal that people’s evaluations of predictive automation align with their views about social media, technology, and politics.
预测性自动化是数字经济的一个普遍而典型的例子。研究美国人如何评价预测性自动化很重要,因为它会影响公司和国家治理。但是,我们也有一些有关问题没有得到解答。我们缺乏使用全国代表性样本的用例之间的比较。我们还需要确定预测自动化评估的关键预测因素是什么。本文使用美国趋势小组2018年的浪潮($n=4,594$)来研究美国成年人是否认为预测性自动化在四个用例中是公平的:帮助信贷决策,协助假释决策,根据面试视频过滤求职者,以及根据简历评估求职者。用112个预测因子训练的套索回归结果显示,人们对预测自动化的评价与他们对社交媒体、技术和政治的看法一致。
{"title":"Do Americans Think the Digital Economy is Fair? Using Supervised Learning to Explore Evaluations of Predictive Automation","authors":"E. Lehoucq","doi":"10.6339/22-jds1053","DOIUrl":"https://doi.org/10.6339/22-jds1053","url":null,"abstract":"Predictive automation is a pervasive and archetypical example of the digital economy. Studying how Americans evaluate predictive automation is important because it affects corporate and state governance. However, we have relevant questions unanswered. We lack comparisons across use cases using a nationally representative sample. We also have yet to determine what are the key predictors of evaluations of predictive automation. This article uses the American Trends Panel’s 2018 wave ($n=4,594$) to study whether American adults think predictive automation is fair across four use cases: helping credit decisions, assisting parole decisions, filtering job applicants based on interview videos, and assessing job candidates based on resumes. Results from lasso regressions trained with 112 predictors reveal that people’s evaluations of predictive automation align with their views about social media, technology, and politics.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-Dimensional Nonlinear Spatio-Temporal Filtering by Compressing Hierarchical Sparse Cholesky Factors 压缩分层稀疏Cholesky因子的高维非线性时空滤波
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1071
Anirban Chakraborty, M. Katzfuss
Spatio-temporal filtering is a common and challenging task in many environmental applications, where the evolution is often nonlinear and the dimension of the spatial state may be very high. We propose a scalable filtering approach based on a hierarchical sparse Cholesky representation of the filtering covariance matrix. At each time point, we compress the sparse Cholesky factor into a dense matrix with a small number of columns. After applying the evolution to each of these columns, we decompress to obtain a hierarchical sparse Cholesky factor of the forecast covariance, which can then be updated based on newly available data. We illustrate the Cholesky evolution via an equivalent representation in terms of spatial basis functions. We also demonstrate the advantage of our method in numerical comparisons, including using a high-dimensional and nonlinear Lorenz model.
在许多环境应用中,时空滤波是一项常见且具有挑战性的任务,其中演化通常是非线性的,并且空间状态的维数可能非常高。我们提出了一种基于滤波协方差矩阵的分层稀疏Cholesky表示的可扩展滤波方法。在每个时间点,我们将稀疏的Cholesky因子压缩成具有少量列的密集矩阵。在对这些列中的每一列应用进化后,我们解压缩以获得预测协方差的分层稀疏Cholesky因子,然后可以根据新的可用数据更新该因子。我们通过空间基函数的等价表示来说明Cholesky演化。我们还证明了我们的方法在数值比较中的优势,包括使用高维和非线性洛伦兹模型。
{"title":"High-Dimensional Nonlinear Spatio-Temporal Filtering by Compressing Hierarchical Sparse Cholesky Factors","authors":"Anirban Chakraborty, M. Katzfuss","doi":"10.6339/22-jds1071","DOIUrl":"https://doi.org/10.6339/22-jds1071","url":null,"abstract":"Spatio-temporal filtering is a common and challenging task in many environmental applications, where the evolution is often nonlinear and the dimension of the spatial state may be very high. We propose a scalable filtering approach based on a hierarchical sparse Cholesky representation of the filtering covariance matrix. At each time point, we compress the sparse Cholesky factor into a dense matrix with a small number of columns. After applying the evolution to each of these columns, we decompress to obtain a hierarchical sparse Cholesky factor of the forecast covariance, which can then be updated based on newly available data. We illustrate the Cholesky evolution via an equivalent representation in terms of spatial basis functions. We also demonstrate the advantage of our method in numerical comparisons, including using a high-dimensional and nonlinear Lorenz model.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees 基于karhunen - lo<e:1>展开和最小生成树的监督空间区划
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1077
Ranadeep Daw, C. Wikle
The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.
本文提出了一种空间域数据监督区域化的方法。在多个尺度上定义空间过程会导致著名的生态谬误问题。在这里,我们使用生态谬误作为最小化标准的基础,以获得预期区域。空间过程的karhunen - lo展开维持了多个分辨率实现之间的关系。具体来说,我们使用karhunen - lo展开式来定义区划误差,从而使生态谬误最小化。利用空间位置和数据形成的最小生成树实现连续区域化。然后,区域化变得类似于从最小生成树修剪边。通过模拟和实际数据实例对该方法进行了论证。
{"title":"Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees","authors":"Ranadeep Daw, C. Wikle","doi":"10.6339/22-jds1077","DOIUrl":"https://doi.org/10.6339/22-jds1077","url":null,"abstract":"The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Use of Deep Neural Networks for Large-Scale Spatial Prediction 深度神经网络在大尺度空间预测中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1070
Skyler Gray, Matthew J. Heaton, D. Bolintineanu, A. Olson
For spatial kriging (prediction), the Gaussian process (GP) has been the go-to tool of spatial statisticians for decades. However, the GP is plagued by computational intractability, rendering it infeasible for use on large spatial data sets. Neural networks (NNs), on the other hand, have arisen as a flexible and computationally feasible approach for capturing nonlinear relationships. To date, however, NNs have only been scarcely used for problems in spatial statistics but their use is beginning to take root. In this work, we argue for equivalence between a NN and a GP and demonstrate how to implement NNs for kriging from large spatial data. We compare the computational efficacy and predictive power of NNs with that of GP approximations across a variety of big spatial Gaussian, non-Gaussian and binary data applications of up to size $n={10^{6}}$. Our results suggest that fully-connected NNs perform similarly to state-of-the-art, GP-approximated models for short-range predictions but can suffer for longer range predictions.
对于空间克里格(预测),高斯过程(GP)几十年来一直是空间统计学家的首选工具。然而,GP的计算困难,使得它不适合用于大型空间数据集。另一方面,神经网络(NNs)作为一种灵活且计算可行的捕获非线性关系的方法而出现。然而,到目前为止,神经网络仅很少用于空间统计问题,但它们的使用开始扎根。在这项工作中,我们论证了神经网络和GP之间的等价性,并演示了如何实现神经网络对大型空间数据的克里格。我们比较了NNs与GP近似在各种大空间高斯、非高斯和二进制数据应用中的计算效率和预测能力,这些应用的大小可达$n={10^{6}}$。我们的研究结果表明,在短期预测中,完全连接的神经网络的表现与最先进的、近似gp的模型相似,但在长期预测中可能会受到影响。
{"title":"On the Use of Deep Neural Networks for Large-Scale Spatial Prediction","authors":"Skyler Gray, Matthew J. Heaton, D. Bolintineanu, A. Olson","doi":"10.6339/22-jds1070","DOIUrl":"https://doi.org/10.6339/22-jds1070","url":null,"abstract":"For spatial kriging (prediction), the Gaussian process (GP) has been the go-to tool of spatial statisticians for decades. However, the GP is plagued by computational intractability, rendering it infeasible for use on large spatial data sets. Neural networks (NNs), on the other hand, have arisen as a flexible and computationally feasible approach for capturing nonlinear relationships. To date, however, NNs have only been scarcely used for problems in spatial statistics but their use is beginning to take root. In this work, we argue for equivalence between a NN and a GP and demonstrate how to implement NNs for kriging from large spatial data. We compare the computational efficacy and predictive power of NNs with that of GP approximations across a variety of big spatial Gaussian, non-Gaussian and binary data applications of up to size $n={10^{6}}$. Our results suggest that fully-connected NNs perform similarly to state-of-the-art, GP-approximated models for short-range predictions but can suffer for longer range predictions.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Integration of Social Determinants of Health Data into the Largest, Not-for-Profit Health System in South Florida 整合健康数据的社会决定因素到最大的,非营利性的卫生系统在南佛罗里达州
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1063
Lourdes M. Rojas, Gregory L. Vincent, D. Parris
Social determinants of health (SDOH) are the conditions in which people are born, grow, work, and live. Although evidence suggests that SDOH influence a range of health outcomes, health systems lack the infrastructure to access and act upon this information. The purpose of this manuscript is to explain the methodology that a health system used to: 1) identify and integrate publicly available SDOH data into the health systems’ Data Warehouse, 2) integrate a HIPAA compliant geocoding software (via DeGAUSS), and 3) visualize data to inform SDOH projects (via Tableau). First, authors engaged key stakeholders across the health system to convey the implications of SDOH data for our patient population and identify variables of interest. As a result, fourteen publicly available data sets, accounting for >30,800 variables representing national, state, county, and census tract information over 2016–2019, were cleaned and integrated into our Data Warehouse. To pilot the data visualization, we created county and census tract level maps for our service areas and plotted common SDOH metrics (e.g., income, education, insurance status, etc.). This practical, methodological integration of SDOH data at a large health system demonstrated feasibility. Ultimately, we will repeat this process system wide to further understand the risk burden in our patient population and improve our prediction models – allowing us to become better partners with our community.
健康的社会决定因素是指人们出生、成长、工作和生活的条件。尽管有证据表明,SDOH影响一系列健康结果,但卫生系统缺乏获取这些信息并据此采取行动的基础设施。本文的目的是解释卫生系统用来:1)识别和集成公共可用的SDOH数据到卫生系统的数据仓库,2)集成符合HIPAA的地理编码软件(通过DeGAUSS),以及3)可视化数据以通知SDOH项目(通过Tableau)的方法。首先,作者与整个卫生系统的关键利益相关者接触,以传达SDOH数据对患者群体的影响,并确定感兴趣的变量。结果,14个公开可用的数据集被清理并集成到我们的数据仓库中,这些数据集占了2016-2019年期间代表国家、州、县和人口普查区信息的bb30,800个变量。为了试验数据可视化,我们为我们的服务区域创建了县和人口普查区级别地图,并绘制了常见的SDOH指标(例如,收入、教育、保险状况等)。在一个大型卫生系统中,这种实用的、方法学的SDOH数据整合证明了可行性。最终,我们将在整个系统内重复这一过程,以进一步了解我们患者群体的风险负担,并改进我们的预测模型,使我们成为我们社区更好的合作伙伴。
{"title":"Integration of Social Determinants of Health Data into the Largest, Not-for-Profit Health System in South Florida","authors":"Lourdes M. Rojas, Gregory L. Vincent, D. Parris","doi":"10.6339/22-jds1063","DOIUrl":"https://doi.org/10.6339/22-jds1063","url":null,"abstract":"Social determinants of health (SDOH) are the conditions in which people are born, grow, work, and live. Although evidence suggests that SDOH influence a range of health outcomes, health systems lack the infrastructure to access and act upon this information. The purpose of this manuscript is to explain the methodology that a health system used to: 1) identify and integrate publicly available SDOH data into the health systems’ Data Warehouse, 2) integrate a HIPAA compliant geocoding software (via DeGAUSS), and 3) visualize data to inform SDOH projects (via Tableau). First, authors engaged key stakeholders across the health system to convey the implications of SDOH data for our patient population and identify variables of interest. As a result, fourteen publicly available data sets, accounting for >30,800 variables representing national, state, county, and census tract information over 2016–2019, were cleaned and integrated into our Data Warehouse. To pilot the data visualization, we created county and census tract level maps for our service areas and plotted common SDOH metrics (e.g., income, education, insurance status, etc.). This practical, methodological integration of SDOH data at a large health system demonstrated feasibility. Ultimately, we will repeat this process system wide to further understand the risk burden in our patient population and improve our prediction models – allowing us to become better partners with our community.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1