首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Active Data Science for Improving Clinical Risk Prediction 积极的数据科学改善临床风险预测
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1078
D. Ankerst, Matthias Neumair
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
临床风险预测模型通常以事后和被动的方式开发,利用已完成的临床试验或回顾性队列的方便数据。这些模型的影响往往在发表后才结束,而不是在患者身上。在一个越来越透明的数据科学时代,临床风险预测领域正在迅速发展。基于前列腺活检协作小组(PBCG)过去十年的集体经验,本文提出以下四项数据科学驱动的策略,以提高临床风险预测,以造福临床实践和研究。第一个建议的策略是按照与临床试验相同的标准,积极设计前瞻性数据收集、监测、分析和验证风险工具,以提高培训数据的质量。第二个建议是在网上提供风险工具和模型公式。用户友好的风险工具将为患者及其临床医生带来定量信息,以改进基于知识的决策。正如过去的经验所证明的那样,在线工具加快了独立验证,提供了有关工具是否可推广到新人群的有用信息。第三个建议是动态更新和本地化风险工具,以适应不断变化的人口和临床景观。第四个策略是适应跨队列的系统缺失数据模式,以便最大限度地提高模型训练中的统计能力,同时也适应最终用户端的缺失信息,以便最大限度地提高公众的效用。
{"title":"Active Data Science for Improving Clinical Risk Prediction","authors":"D. Ankerst, Matthias Neumair","doi":"10.6339/22-jds1078","DOIUrl":"https://doi.org/10.6339/22-jds1078","url":null,"abstract":"Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
‘You Draw It’: Implementation of Visually Fitted Trends with r2d3 “你画它”:使用r2d3实现视觉拟合趋势
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1083
Emily A. Robinson, Réka Howard, Susan Vanderplas
How do statistical regression results compare to intuitive, visually fitted results? Fitting lines by eye through a set of points has been explored since the 20th century. Common methods of fitting trends by eye involve maneuvering a string, black thread, or ruler until the fit is suitable, then drawing the line through the set of points. In 2015, the New York Times introduced an interactive feature, called ‘You Draw It,’ where readers are asked to input their own assumptions about various metrics and compare how these assumptions relate to reality. This research is intended to implement ‘You Draw It’, adapted from the New York Times, as a way to measure the patterns we see in data. In this paper, we describe the adaptation of an old tool for graphical testing and evaluation, eye-fitting, for use in modern web-applications suitable for testing statistical graphics. We present an empirical evaluation of this testing method for linear regression, and briefly discuss an extension of this method to non-linear applications.
统计回归结果与直观的、视觉拟合的结果相比如何?自20世纪以来,人们一直在探索通过一组点来拟合直线。用肉眼拟合趋势的常用方法包括:用绳子、黑线或直尺转动,直到拟合合适,然后在一组点之间画线。2015年,《纽约时报》推出了一个名为“你画出来”(You Draw It)的互动功能,要求读者输入自己对各种指标的假设,并比较这些假设与现实的关系。这项研究旨在实施“你画它”,改编自纽约时报,作为一种衡量我们在数据中看到的模式的方法。在本文中,我们描述了一种用于图形测试和评估的旧工具,eye-fitting,用于适合测试统计图形的现代web应用程序。我们提出了线性回归检验方法的经验评价,并简要讨论了该方法在非线性应用中的推广。
{"title":"‘You Draw It’: Implementation of Visually Fitted Trends with r2d3","authors":"Emily A. Robinson, Réka Howard, Susan Vanderplas","doi":"10.6339/22-jds1083","DOIUrl":"https://doi.org/10.6339/22-jds1083","url":null,"abstract":"How do statistical regression results compare to intuitive, visually fitted results? Fitting lines by eye through a set of points has been explored since the 20th century. Common methods of fitting trends by eye involve maneuvering a string, black thread, or ruler until the fit is suitable, then drawing the line through the set of points. In 2015, the New York Times introduced an interactive feature, called ‘You Draw It,’ where readers are asked to input their own assumptions about various metrics and compare how these assumptions relate to reality. This research is intended to implement ‘You Draw It’, adapted from the New York Times, as a way to measure the patterns we see in data. In this paper, we describe the adaptation of an old tool for graphical testing and evaluation, eye-fitting, for use in modern web-applications suitable for testing statistical graphics. We present an empirical evaluation of this testing method for linear regression, and briefly discuss an extension of this method to non-linear applications.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Addressing the Impact of the COVID-19 Pandemic on Survival Outcomes in Randomized Phase III Oncology Trials 解决COVID-19大流行对随机III期肿瘤试验生存结果的影响
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1079
Jiabu Ye, Binbing Yu, H. Mann, A. Sabin, Z. Szíjgyártó, David Wright, P. Mukhopadhyay, C. Massacesi, S. Ghiorghiu, R. Iacona
We assessed the impact of the coronavirus disease 2019 (COVID-19) pandemic on the statistical analysis of time-to-event outcomes in late-phase oncology trials. Using a simulated case study that mimics a Phase III ongoing trial during the pandemic, we evaluated the impact of COVID-19-related deaths, time off-treatment and missed clinical visits due to the pandemic, on overall survival and/or progression-free survival in terms of test size (also referred to as Type 1 error rate or alpha level), power, and hazard ratio (HR) estimates. We found that COVID-19-related deaths would impact both size and power, and lead to biased HR estimates; the impact would be more severe if there was an imbalance in COVID-19-related deaths between the study arms. Approaches censoring COVID-19-related deaths may mitigate the impact on power and HR estimation, especially if study data cut-off was extended to recover censoring-related event loss. The impact of COVID-19-related time off-treatment would be modest for power, and moderate for size and HR estimation. Different rules of censoring cancer progression times result in a slight difference in the power for the analysis of progression-free survival. The simulations provided valuable information for determining whether clinical-trial modifications should be required for ongoing trials during the COVID-19 pandemic.
我们评估了2019冠状病毒病(COVID-19)大流行对晚期肿瘤学试验中事件发生时间结局统计分析的影响。通过模拟大流行期间正在进行的III期试验的模拟案例研究,我们评估了与covid -19相关的死亡、因大流行而中断治疗的时间和错过的临床就诊对总体生存和/或无进展生存的影响,包括测试规模(也称为1型错误率或α水平)、功率和风险比(HR)估计。我们发现,与covid -19相关的死亡会影响规模和功率,并导致有偏见的人力资源估计;如果研究小组之间与covid -19相关的死亡人数不平衡,影响将更加严重。审查covid -19相关死亡的方法可以减轻对功率和人力资源估计的影响,特别是如果延长研究数据截止时间以恢复与审查相关的事件损失。与covid -19相关的停工时间对功率的影响不大,对规模和人力资源估计的影响也不大。审查癌症进展时间的不同规则导致无进展生存分析的能力略有不同。模拟为确定COVID-19大流行期间正在进行的试验是否需要修改临床试验提供了有价值的信息。
{"title":"Addressing the Impact of the COVID-19 Pandemic on Survival Outcomes in Randomized Phase III Oncology Trials","authors":"Jiabu Ye, Binbing Yu, H. Mann, A. Sabin, Z. Szíjgyártó, David Wright, P. Mukhopadhyay, C. Massacesi, S. Ghiorghiu, R. Iacona","doi":"10.6339/22-jds1079","DOIUrl":"https://doi.org/10.6339/22-jds1079","url":null,"abstract":"We assessed the impact of the coronavirus disease 2019 (COVID-19) pandemic on the statistical analysis of time-to-event outcomes in late-phase oncology trials. Using a simulated case study that mimics a Phase III ongoing trial during the pandemic, we evaluated the impact of COVID-19-related deaths, time off-treatment and missed clinical visits due to the pandemic, on overall survival and/or progression-free survival in terms of test size (also referred to as Type 1 error rate or alpha level), power, and hazard ratio (HR) estimates. We found that COVID-19-related deaths would impact both size and power, and lead to biased HR estimates; the impact would be more severe if there was an imbalance in COVID-19-related deaths between the study arms. Approaches censoring COVID-19-related deaths may mitigate the impact on power and HR estimation, especially if study data cut-off was extended to recover censoring-related event loss. The impact of COVID-19-related time off-treatment would be modest for power, and moderate for size and HR estimation. Different rules of censoring cancer progression times result in a slight difference in the power for the analysis of progression-free survival. The simulations provided valuable information for determining whether clinical-trial modifications should be required for ongoing trials during the COVID-19 pandemic.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying Prerequisite Courses in Undergraduate Biology Using Machine Learning 利用机器学习确定本科生物学的必修课程
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1068
Youngjin Lee
Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.
许多攻读科学、技术、工程和数学(STEM)学位课程的本科生中途退学或转专业。先前的研究表明,学生在预科课程中的表现对STEM学生的流失很重要。这项研究分析了人口统计信息、ACT/SAT分数和学生在大一课程中的表现,以开发预测他们成功获得生物学学士学位的机器学习模型。与Logistic回归(LR)、k近邻(KNN)和神经网络(NN)模型相比,基于随机森林(RF)和极端梯度增强(XGBoost)的预测模型在AUC(曲线下面积)方面表现出更好的性能,具有更好的敏感性和特异性。一种可解释的机器学习方法被称为分解,用于确定重要的大一课程,这些课程可能对学生在生物学学位课程和学生水平上的成功产生更大的影响。在项目层面确定更重要的课程可以帮助项目协调员优先考虑他们的努力,以解决学生流失问题,而在学生层面确定更重要的课程可以帮助学术顾问为学生提供更个性化的、数据驱动的指导。
{"title":"Identifying Prerequisite Courses in Undergraduate Biology Using Machine Learning","authors":"Youngjin Lee","doi":"10.6339/22-jds1068","DOIUrl":"https://doi.org/10.6339/22-jds1068","url":null,"abstract":"Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"491 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hybrid Monitoring Procedure for Detecting Abnormality with Application to Energy Consumption Data 一种用于能耗数据异常检测的混合监测程序
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1039
Daeyoung Lim, Ming-Hui Chen, N. Ravishanker, Mark Bolduc, Brian McKeon, Stanley Nolan
The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with finite sample sizes whereas the naive plug-in method tends to exceed the specified level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.
大型机构能源基础设施的复杂性越来越需要数据驱动的能源使用监测。本文提出了一种混合监测算法,用于使用统计假设检验检测消费激增,利用后验分布及其不确定性信息在参数估计中引入随机性,同时保留频率检验框架。这种混合方法被设计成与Neyman-Pearson检验渐近等价。我们通过广泛的模拟研究表明,混合方法即使在有限的样本量下也能控制1型错误率,而朴素的插件方法往往会超过指定的水平,从而导致过度的测试。该方法应用于康涅狄格大学的天然气使用数据。
{"title":"A Hybrid Monitoring Procedure for Detecting Abnormality with Application to Energy Consumption Data","authors":"Daeyoung Lim, Ming-Hui Chen, N. Ravishanker, Mark Bolduc, Brian McKeon, Stanley Nolan","doi":"10.6339/22-jds1039","DOIUrl":"https://doi.org/10.6339/22-jds1039","url":null,"abstract":"The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with finite sample sizes whereas the naive plug-in method tends to exceed the specified level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data Science Applications and Implications in Legal Studies: A Perspective Through Topic Modelling 数据科学在法律研究中的应用和意义:通过主题建模的视角
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1058
Jinzhe Tan, Huan Wan, Ping Yan, Zheng Hua Zhu
Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.
法律和法律研究一直是数据科学应用的一个令人兴奋的新领域,而技术的进步也对法律实践产生了深远的影响。例如,法律行业已经积累了丰富的高质量文本、图像和其他数字化格式,可供数据科学家进一步处理和分析。另一方面,数据科学的日益普及对法律从业者、监管机构甚至公众都是一个真正的挑战,并引发了学术界长期以来围绕隐私保护和算法歧视等问题的辩论。本文从Web of science平台上收集了1236篇涉及法律和数据科学的期刊文章,从英文期刊发表的角度来了解这一跨学科研究领域的模式和趋势。我们发现,随着时间的推移,出版物数量有明显的增长趋势,高影响力的法律和政治学期刊也有很强的影响力。然后,我们使用潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)作为主题建模方法,根据一致性度量将摘要分为四个主题。确定的四个主题确认了这一跨学科领域的挑战和机遇,并有助于为未来的研究提供方向。
{"title":"Data Science Applications and Implications in Legal Studies: A Perspective Through Topic Modelling","authors":"Jinzhe Tan, Huan Wan, Ping Yan, Zheng Hua Zhu","doi":"10.6339/22-jds1058","DOIUrl":"https://doi.org/10.6339/22-jds1058","url":null,"abstract":"Law and legal studies has been an exciting new field for data science applications whereas the technological advancement also has profound implications for legal practice. For example, the legal industry has accumulated a rich body of high quality texts, images and other digitised formats, which are ready to be further processed and analysed by data scientists. On the other hand, the increasing popularity of data science has been a genuine challenge to legal practitioners, regulators and even general public and has motivated a long-lasting debate in the academia focusing on issues such as privacy protection and algorithmic discrimination. This paper collects 1236 journal articles involving both law and data science from the platform Web of Science to understand the patterns and trends of this interdisciplinary research field in terms of English journal publications. We find a clear trend of increasing publication volume over time and a strong presence of high-impact law and political science journals. We then use the Latent Dirichlet Allocation (LDA) as a topic modelling method to classify the abstracts into four topics based on the coherence measure. The four topics identified confirm that both challenges and opportunities have been investigated in this interdisciplinary field and help offer directions for future research.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Editorial: Large-Scale Spatial Data Science 社论:大规模空间数据科学
Pub Date : 2022-01-01 DOI: 10.6339/22-jds204edi
Sameh Abdulah, S. Castruccio, M. Genton, Ying Sun
This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.
本期特刊收录了八篇关于“大规模空间数据科学”的文章。在气候科学和环境应用等许多研究领域,复杂和大规模的时空数据的数据科学已经成为必不可少的。由于收集的数据量不断增加,传统的统计方法趋于崩溃,适合大规模空间数据的计算效率方法和可扩展算法对于应对与大数据相关的许多挑战变得至关重要。本期特刊旨在重点介绍大规模空间数据科学领域的一些最新发展。这些研究论文展示了先进的统计方法和机器学习方法,用于解决现代数据科学应用中出现的复杂和大规模问题。Abdulah等人(2022)报告了由阿卜杜拉国王科技大学(KAUST)组织的第二次大型数据集空间统计竞赛的结果。使用exeostat软件生成了非常大的数据集(多达100万),以设计在具有挑战性的环境下进行大规模预测的竞赛,包括单变量非平稳空间过程、单变量平稳时空过程和双变量平稳空间过程。作者详细描述了每种情况下的数据生成过程,并公开了这些有价值的数据集。他们回顾了全球14支参赛队伍使用的方法,分析了比赛结果,并评估了每支队伍的表现。
{"title":"Editorial: Large-Scale Spatial Data Science","authors":"Sameh Abdulah, S. Castruccio, M. Genton, Ying Sun","doi":"10.6339/22-jds204edi","DOIUrl":"https://doi.org/10.6339/22-jds204edi","url":null,"abstract":"This special issue features eight articles on “Large-Scale Spatial Data Science.” Data science for complex and large-scale spatial and spatio-temporal data has become essential in many research fields, such as climate science and environmental applications. Due to the ever-increasing amounts of data collected, traditional statistical approaches tend to break down and computa-tionally efficient methods and scalable algorithms that are suitable for large-scale spatial data have become crucial to cope with many challenges associated with big data. This special issue aims at highlighting some of the latest developments in the area of large-scale spatial data science. The research papers presented showcase advanced statistical methods and machine learn-ing approaches for solving complex and large-scale problems arising from modern data science applications. Abdulah et al. (2022) reported the results of the second competition on spatial statistics for large datasets organized by the King Abdullah University of Science and Technology (KAUST). Very large datasets (up to 1 million in size) were generated with the ExaGeoStat software to design the competition on large-scale predictions in challenging settings, including univariate nonstationary spatial processes, univariate stationary space-time processes, and bivariate stationary spatial processes. The authors described the data generation process in detail in each setting and made these valuable datasets publicly available. They reviewed the methods used by fourteen competing teams worldwide, analyzed the results of the competition, and assessed the performance of each team.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Creating a Census County Assessment Tool for Visualizing Census Data 创建一个用于可视化普查数据的普查县评估工具
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1082
Izzy Youngs, R. Prevost, Christopher Dick
The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.
开发2020年人口普查县评估工具是为了帮助十年一次的人口普查数据用户识别人口和住房指标中预期人口普查计数与公布计数之间的偏差。该工具还为每个县提供了可能导致人口普查收集问题的因素的背景数据,如自我反应率和COVID-19感染率。该工具将这些信息汇编成一份可下载的报告,并包括与数据收集过程和专家有关的其他当地数据源,以寻求更多援助。
{"title":"Creating a Census County Assessment Tool for Visualizing Census Data","authors":"Izzy Youngs, R. Prevost, Christopher Dick","doi":"10.6339/22-jds1082","DOIUrl":"https://doi.org/10.6339/22-jds1082","url":null,"abstract":"The 2020 Census County Assessment Tool was developed to assist decennial census data users in identifying deviations between expected census counts and the released counts across population and housing indicators. The tool also offers contextual data for each county on factors which could have contributed to census collection issues, such as self-response rates and COVID-19 infection rates. The tool compiles this information into a downloadable report and includes additional local data sources relevant to the data collection process and experts to seek more assistance.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploring Rural Shrink Smart Through Guided Discovery Dashboards 通过引导发现仪表板探索农村收缩智能
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1080
Denise Bradford, Susan Vanderplas
Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.
许多小地方和农村地区正在萎缩。交互式仪表板是数据可视化和探索性数据工具上下文最常见的用例。在我们的论文中,我们将使用爱荷华州的数据来探索仪表板如何在小型和农村地区使用的具体范围,以使新手分析师能够做出数据驱动的决策。我们的框架将提出一些研究方向,以更好地支持小型和农村地区的萎缩,使用交互式仪表板设计,实施和日常分析师的使用。
{"title":"Exploring Rural Shrink Smart Through Guided Discovery Dashboards","authors":"Denise Bradford, Susan Vanderplas","doi":"10.6339/22-jds1080","DOIUrl":"https://doi.org/10.6339/22-jds1080","url":null,"abstract":"Many small and rural places are shrinking. Interactive dashboards are the most common use cases for data visualization and context for exploratory data tools. In our paper, we will use Iowa data to explore the specific scope of how dashboards are used in small and rural area to empower novice analysts to make data-driven decisions. Our framework will suggest a number of research directions to better support small and rural places from shrinking using an interactive dashboard design, implementation and use for the every day analyst.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Inference for Optimal Differential Privacy Procedures for Frequency Tables 频率表的最优微分隐私过程的推理
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1044
Chengcheng Li, Na Wang, Gongjun Xu
When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.
在向公众发布数据时,一个至关重要的问题是有可能暴露为数据集做出贡献的个人的个人信息。已经提出了许多机制来保护个人隐私,尽管很少有人关注对改变的隐私保护数据集进行实际有效的推断。对于频率表,隐私保护导向的扰动通常导致负细胞计数。发布这样的表格会削弱用户对这些数据集有用性的信心。本文的重点是发布单向频率表。我们推荐一个最优的机制,满足ϵ-differential隐私(DP),而不会遭受负细胞计数。这个过程是最优的,因为在给定的隐私约束下,预期效用是最大化的。对于DP隐私保护数据,还开发了测试拟合优度的有效推理程序。特别地,我们提出了最优过程的去偏检验统计量,并推导了它的渐近分布。此外,我们还介绍了常用的拉普拉斯和高斯机制的测试程序,它们为零分布提供了良好的有限样本近似。此外,还提供了隐私制度的衰减率要求,以使推理过程有效。我们进一步考虑常见的用户实践,例如合并相关或相邻的单元格,或集成跨不同数据源获得的统计信息,并在这些操作发生时推导出有效的测试过程。仿真研究表明,即使在样本量相对较小的情况下,我们的推断结果也保持良好。与当前的领域标准进行了比较,包括拉普拉斯,高斯(都有/没有用零替换负细胞计数的后处理)和二项式- β麦克卢尔-瑞特机制。最后,我们将我们的方法应用于国家早期发展和学习中心(NCEDL)的多州研究数据,以证明其实际适用性。
{"title":"Inference for Optimal Differential Privacy Procedures for Frequency Tables","authors":"Chengcheng Li, Na Wang, Gongjun Xu","doi":"10.6339/22-jds1044","DOIUrl":"https://doi.org/10.6339/22-jds1044","url":null,"abstract":"When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1