首页 > 最新文献

Journal of data science : JDS最新文献

英文 中文
Do Americans Think the Digital Economy is Fair? Using Supervised Learning to Explore Evaluations of Predictive Automation 美国人认为数字经济公平吗?使用监督学习探索预测自动化的评估
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1053
E. Lehoucq
Predictive automation is a pervasive and archetypical example of the digital economy. Studying how Americans evaluate predictive automation is important because it affects corporate and state governance. However, we have relevant questions unanswered. We lack comparisons across use cases using a nationally representative sample. We also have yet to determine what are the key predictors of evaluations of predictive automation. This article uses the American Trends Panel’s 2018 wave ($n=4,594$) to study whether American adults think predictive automation is fair across four use cases: helping credit decisions, assisting parole decisions, filtering job applicants based on interview videos, and assessing job candidates based on resumes. Results from lasso regressions trained with 112 predictors reveal that people’s evaluations of predictive automation align with their views about social media, technology, and politics.
预测性自动化是数字经济的一个普遍而典型的例子。研究美国人如何评价预测性自动化很重要,因为它会影响公司和国家治理。但是,我们也有一些有关问题没有得到解答。我们缺乏使用全国代表性样本的用例之间的比较。我们还需要确定预测自动化评估的关键预测因素是什么。本文使用美国趋势小组2018年的浪潮($n=4,594$)来研究美国成年人是否认为预测性自动化在四个用例中是公平的:帮助信贷决策,协助假释决策,根据面试视频过滤求职者,以及根据简历评估求职者。用112个预测因子训练的套索回归结果显示,人们对预测自动化的评价与他们对社交媒体、技术和政治的看法一致。
{"title":"Do Americans Think the Digital Economy is Fair? Using Supervised Learning to Explore Evaluations of Predictive Automation","authors":"E. Lehoucq","doi":"10.6339/22-jds1053","DOIUrl":"https://doi.org/10.6339/22-jds1053","url":null,"abstract":"Predictive automation is a pervasive and archetypical example of the digital economy. Studying how Americans evaluate predictive automation is important because it affects corporate and state governance. However, we have relevant questions unanswered. We lack comparisons across use cases using a nationally representative sample. We also have yet to determine what are the key predictors of evaluations of predictive automation. This article uses the American Trends Panel’s 2018 wave ($n=4,594$) to study whether American adults think predictive automation is fair across four use cases: helping credit decisions, assisting parole decisions, filtering job applicants based on interview videos, and assessing job candidates based on resumes. Results from lasso regressions trained with 112 predictors reveal that people’s evaluations of predictive automation align with their views about social media, technology, and politics.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-Dimensional Nonlinear Spatio-Temporal Filtering by Compressing Hierarchical Sparse Cholesky Factors 压缩分层稀疏Cholesky因子的高维非线性时空滤波
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1071
Anirban Chakraborty, M. Katzfuss
Spatio-temporal filtering is a common and challenging task in many environmental applications, where the evolution is often nonlinear and the dimension of the spatial state may be very high. We propose a scalable filtering approach based on a hierarchical sparse Cholesky representation of the filtering covariance matrix. At each time point, we compress the sparse Cholesky factor into a dense matrix with a small number of columns. After applying the evolution to each of these columns, we decompress to obtain a hierarchical sparse Cholesky factor of the forecast covariance, which can then be updated based on newly available data. We illustrate the Cholesky evolution via an equivalent representation in terms of spatial basis functions. We also demonstrate the advantage of our method in numerical comparisons, including using a high-dimensional and nonlinear Lorenz model.
在许多环境应用中,时空滤波是一项常见且具有挑战性的任务,其中演化通常是非线性的,并且空间状态的维数可能非常高。我们提出了一种基于滤波协方差矩阵的分层稀疏Cholesky表示的可扩展滤波方法。在每个时间点,我们将稀疏的Cholesky因子压缩成具有少量列的密集矩阵。在对这些列中的每一列应用进化后,我们解压缩以获得预测协方差的分层稀疏Cholesky因子,然后可以根据新的可用数据更新该因子。我们通过空间基函数的等价表示来说明Cholesky演化。我们还证明了我们的方法在数值比较中的优势,包括使用高维和非线性洛伦兹模型。
{"title":"High-Dimensional Nonlinear Spatio-Temporal Filtering by Compressing Hierarchical Sparse Cholesky Factors","authors":"Anirban Chakraborty, M. Katzfuss","doi":"10.6339/22-jds1071","DOIUrl":"https://doi.org/10.6339/22-jds1071","url":null,"abstract":"Spatio-temporal filtering is a common and challenging task in many environmental applications, where the evolution is often nonlinear and the dimension of the spatial state may be very high. We propose a scalable filtering approach based on a hierarchical sparse Cholesky representation of the filtering covariance matrix. At each time point, we compress the sparse Cholesky factor into a dense matrix with a small number of columns. After applying the evolution to each of these columns, we decompress to obtain a hierarchical sparse Cholesky factor of the forecast covariance, which can then be updated based on newly available data. We illustrate the Cholesky evolution via an equivalent representation in terms of spatial basis functions. We also demonstrate the advantage of our method in numerical comparisons, including using a high-dimensional and nonlinear Lorenz model.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees 基于karhunen - lo<e:1>展开和最小生成树的监督空间区划
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1077
Ranadeep Daw, C. Wikle
The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.
本文提出了一种空间域数据监督区域化的方法。在多个尺度上定义空间过程会导致著名的生态谬误问题。在这里,我们使用生态谬误作为最小化标准的基础,以获得预期区域。空间过程的karhunen - lo展开维持了多个分辨率实现之间的关系。具体来说,我们使用karhunen - lo展开式来定义区划误差,从而使生态谬误最小化。利用空间位置和数据形成的最小生成树实现连续区域化。然后,区域化变得类似于从最小生成树修剪边。通过模拟和实际数据实例对该方法进行了论证。
{"title":"Supervised Spatial Regionalization using the Karhunen-Loève Expansion and Minimum Spanning Trees","authors":"Ranadeep Daw, C. Wikle","doi":"10.6339/22-jds1077","DOIUrl":"https://doi.org/10.6339/22-jds1077","url":null,"abstract":"The article presents a methodology for supervised regionalization of data on a spatial domain. Defining a spatial process at multiple scales leads to the famous ecological fallacy problem. Here, we use the ecological fallacy as the basis for a minimization criterion to obtain the intended regions. The Karhunen-Loève Expansion of the spatial process maintains the relationship between the realizations from multiple resolutions. Specifically, we use the Karhunen-Loève Expansion to define the regionalization error so that the ecological fallacy is minimized. The contiguous regionalization is done using the minimum spanning tree formed from the spatial locations and the data. Then, regionalization becomes similar to pruning edges from the minimum spanning tree. The methodology is demonstrated using simulated and real data examples.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Use of Deep Neural Networks for Large-Scale Spatial Prediction 深度神经网络在大尺度空间预测中的应用
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1070
Skyler Gray, Matthew J. Heaton, D. Bolintineanu, A. Olson
For spatial kriging (prediction), the Gaussian process (GP) has been the go-to tool of spatial statisticians for decades. However, the GP is plagued by computational intractability, rendering it infeasible for use on large spatial data sets. Neural networks (NNs), on the other hand, have arisen as a flexible and computationally feasible approach for capturing nonlinear relationships. To date, however, NNs have only been scarcely used for problems in spatial statistics but their use is beginning to take root. In this work, we argue for equivalence between a NN and a GP and demonstrate how to implement NNs for kriging from large spatial data. We compare the computational efficacy and predictive power of NNs with that of GP approximations across a variety of big spatial Gaussian, non-Gaussian and binary data applications of up to size $n={10^{6}}$. Our results suggest that fully-connected NNs perform similarly to state-of-the-art, GP-approximated models for short-range predictions but can suffer for longer range predictions.
对于空间克里格(预测),高斯过程(GP)几十年来一直是空间统计学家的首选工具。然而,GP的计算困难,使得它不适合用于大型空间数据集。另一方面,神经网络(NNs)作为一种灵活且计算可行的捕获非线性关系的方法而出现。然而,到目前为止,神经网络仅很少用于空间统计问题,但它们的使用开始扎根。在这项工作中,我们论证了神经网络和GP之间的等价性,并演示了如何实现神经网络对大型空间数据的克里格。我们比较了NNs与GP近似在各种大空间高斯、非高斯和二进制数据应用中的计算效率和预测能力,这些应用的大小可达$n={10^{6}}$。我们的研究结果表明,在短期预测中,完全连接的神经网络的表现与最先进的、近似gp的模型相似,但在长期预测中可能会受到影响。
{"title":"On the Use of Deep Neural Networks for Large-Scale Spatial Prediction","authors":"Skyler Gray, Matthew J. Heaton, D. Bolintineanu, A. Olson","doi":"10.6339/22-jds1070","DOIUrl":"https://doi.org/10.6339/22-jds1070","url":null,"abstract":"For spatial kriging (prediction), the Gaussian process (GP) has been the go-to tool of spatial statisticians for decades. However, the GP is plagued by computational intractability, rendering it infeasible for use on large spatial data sets. Neural networks (NNs), on the other hand, have arisen as a flexible and computationally feasible approach for capturing nonlinear relationships. To date, however, NNs have only been scarcely used for problems in spatial statistics but their use is beginning to take root. In this work, we argue for equivalence between a NN and a GP and demonstrate how to implement NNs for kriging from large spatial data. We compare the computational efficacy and predictive power of NNs with that of GP approximations across a variety of big spatial Gaussian, non-Gaussian and binary data applications of up to size $n={10^{6}}$. Our results suggest that fully-connected NNs perform similarly to state-of-the-art, GP-approximated models for short-range predictions but can suffer for longer range predictions.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Integration of Social Determinants of Health Data into the Largest, Not-for-Profit Health System in South Florida 整合健康数据的社会决定因素到最大的,非营利性的卫生系统在南佛罗里达州
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1063
Lourdes M. Rojas, Gregory L. Vincent, D. Parris
Social determinants of health (SDOH) are the conditions in which people are born, grow, work, and live. Although evidence suggests that SDOH influence a range of health outcomes, health systems lack the infrastructure to access and act upon this information. The purpose of this manuscript is to explain the methodology that a health system used to: 1) identify and integrate publicly available SDOH data into the health systems’ Data Warehouse, 2) integrate a HIPAA compliant geocoding software (via DeGAUSS), and 3) visualize data to inform SDOH projects (via Tableau). First, authors engaged key stakeholders across the health system to convey the implications of SDOH data for our patient population and identify variables of interest. As a result, fourteen publicly available data sets, accounting for >30,800 variables representing national, state, county, and census tract information over 2016–2019, were cleaned and integrated into our Data Warehouse. To pilot the data visualization, we created county and census tract level maps for our service areas and plotted common SDOH metrics (e.g., income, education, insurance status, etc.). This practical, methodological integration of SDOH data at a large health system demonstrated feasibility. Ultimately, we will repeat this process system wide to further understand the risk burden in our patient population and improve our prediction models – allowing us to become better partners with our community.
健康的社会决定因素是指人们出生、成长、工作和生活的条件。尽管有证据表明,SDOH影响一系列健康结果,但卫生系统缺乏获取这些信息并据此采取行动的基础设施。本文的目的是解释卫生系统用来:1)识别和集成公共可用的SDOH数据到卫生系统的数据仓库,2)集成符合HIPAA的地理编码软件(通过DeGAUSS),以及3)可视化数据以通知SDOH项目(通过Tableau)的方法。首先,作者与整个卫生系统的关键利益相关者接触,以传达SDOH数据对患者群体的影响,并确定感兴趣的变量。结果,14个公开可用的数据集被清理并集成到我们的数据仓库中,这些数据集占了2016-2019年期间代表国家、州、县和人口普查区信息的bb30,800个变量。为了试验数据可视化,我们为我们的服务区域创建了县和人口普查区级别地图,并绘制了常见的SDOH指标(例如,收入、教育、保险状况等)。在一个大型卫生系统中,这种实用的、方法学的SDOH数据整合证明了可行性。最终,我们将在整个系统内重复这一过程,以进一步了解我们患者群体的风险负担,并改进我们的预测模型,使我们成为我们社区更好的合作伙伴。
{"title":"Integration of Social Determinants of Health Data into the Largest, Not-for-Profit Health System in South Florida","authors":"Lourdes M. Rojas, Gregory L. Vincent, D. Parris","doi":"10.6339/22-jds1063","DOIUrl":"https://doi.org/10.6339/22-jds1063","url":null,"abstract":"Social determinants of health (SDOH) are the conditions in which people are born, grow, work, and live. Although evidence suggests that SDOH influence a range of health outcomes, health systems lack the infrastructure to access and act upon this information. The purpose of this manuscript is to explain the methodology that a health system used to: 1) identify and integrate publicly available SDOH data into the health systems’ Data Warehouse, 2) integrate a HIPAA compliant geocoding software (via DeGAUSS), and 3) visualize data to inform SDOH projects (via Tableau). First, authors engaged key stakeholders across the health system to convey the implications of SDOH data for our patient population and identify variables of interest. As a result, fourteen publicly available data sets, accounting for >30,800 variables representing national, state, county, and census tract information over 2016–2019, were cleaned and integrated into our Data Warehouse. To pilot the data visualization, we created county and census tract level maps for our service areas and plotted common SDOH metrics (e.g., income, education, insurance status, etc.). This practical, methodological integration of SDOH data at a large health system demonstrated feasibility. Ultimately, we will repeat this process system wide to further understand the risk burden in our patient population and improve our prediction models – allowing us to become better partners with our community.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Active Data Science for Improving Clinical Risk Prediction 积极的数据科学改善临床风险预测
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1078
D. Ankerst, Matthias Neumair
Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.
临床风险预测模型通常以事后和被动的方式开发,利用已完成的临床试验或回顾性队列的方便数据。这些模型的影响往往在发表后才结束,而不是在患者身上。在一个越来越透明的数据科学时代,临床风险预测领域正在迅速发展。基于前列腺活检协作小组(PBCG)过去十年的集体经验,本文提出以下四项数据科学驱动的策略,以提高临床风险预测,以造福临床实践和研究。第一个建议的策略是按照与临床试验相同的标准,积极设计前瞻性数据收集、监测、分析和验证风险工具,以提高培训数据的质量。第二个建议是在网上提供风险工具和模型公式。用户友好的风险工具将为患者及其临床医生带来定量信息,以改进基于知识的决策。正如过去的经验所证明的那样,在线工具加快了独立验证,提供了有关工具是否可推广到新人群的有用信息。第三个建议是动态更新和本地化风险工具,以适应不断变化的人口和临床景观。第四个策略是适应跨队列的系统缺失数据模式,以便最大限度地提高模型训练中的统计能力,同时也适应最终用户端的缺失信息,以便最大限度地提高公众的效用。
{"title":"Active Data Science for Improving Clinical Risk Prediction","authors":"D. Ankerst, Matthias Neumair","doi":"10.6339/22-jds1078","DOIUrl":"https://doi.org/10.6339/22-jds1078","url":null,"abstract":"Clinical risk prediction models are commonly developed in a post-hoc and passive fashion, capitalizing on convenient data from completed clinical trials or retrospective cohorts. Impacts of the models often end at their publication rather than with the patients. The field of clinical risk prediction is rapidly improving in a progressively more transparent data science era. Based on collective experience over the past decade by the Prostate Biopsy Collaborative Group (PBCG), this paper proposes the following four data science-driven strategies for improving clinical risk prediction to the benefit of clinical practice and research. The first proposed strategy is to actively design prospective data collection, monitoring, analysis and validation of risk tools following the same standards as for clinical trials in order to elevate the quality of training data. The second suggestion is to make risk tools and model formulas available online. User-friendly risk tools will bring quantitative information to patients and their clinicians for improved knowledge-based decision-making. As past experience testifies, online tools expedite independent validation, providing helpful information as to whether the tools are generalizable to new populations. The third proposal is to dynamically update and localize risk tools to adapt to changing demographic and clinical landscapes. The fourth strategy is to accommodate systematic missing data patterns across cohorts in order to maximize the statistical power in model training, as well as to accommodate missing information on the end-user side too, in order to maximize utility for the public.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Hybrid Monitoring Procedure for Detecting Abnormality with Application to Energy Consumption Data 一种用于能耗数据异常检测的混合监测程序
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1039
Daeyoung Lim, Ming-Hui Chen, N. Ravishanker, Mark Bolduc, Brian McKeon, Stanley Nolan
The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with finite sample sizes whereas the naive plug-in method tends to exceed the specified level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.
大型机构能源基础设施的复杂性越来越需要数据驱动的能源使用监测。本文提出了一种混合监测算法,用于使用统计假设检验检测消费激增,利用后验分布及其不确定性信息在参数估计中引入随机性,同时保留频率检验框架。这种混合方法被设计成与Neyman-Pearson检验渐近等价。我们通过广泛的模拟研究表明,混合方法即使在有限的样本量下也能控制1型错误率,而朴素的插件方法往往会超过指定的水平,从而导致过度的测试。该方法应用于康涅狄格大学的天然气使用数据。
{"title":"A Hybrid Monitoring Procedure for Detecting Abnormality with Application to Energy Consumption Data","authors":"Daeyoung Lim, Ming-Hui Chen, N. Ravishanker, Mark Bolduc, Brian McKeon, Stanley Nolan","doi":"10.6339/22-jds1039","DOIUrl":"https://doi.org/10.6339/22-jds1039","url":null,"abstract":"The complexity of energy infrastructure at large institutions increasingly calls for data-driven monitoring of energy usage. This article presents a hybrid monitoring algorithm for detecting consumption surges using statistical hypothesis testing, leveraging the posterior distribution and its information about uncertainty to introduce randomness in the parameter estimates, while retaining the frequentist testing framework. This hybrid approach is designed to be asymptotically equivalent to the Neyman-Pearson test. We show via extensive simulation studies that the hybrid approach enjoys control over type-1 error rate even with finite sample sizes whereas the naive plug-in method tends to exceed the specified level, resulting in overpowered tests. The proposed method is applied to the natural gas usage data at the University of Connecticut.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Identifying Prerequisite Courses in Undergraduate Biology Using Machine Learning 利用机器学习确定本科生物学的必修课程
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1068
Youngjin Lee
Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.
许多攻读科学、技术、工程和数学(STEM)学位课程的本科生中途退学或转专业。先前的研究表明,学生在预科课程中的表现对STEM学生的流失很重要。这项研究分析了人口统计信息、ACT/SAT分数和学生在大一课程中的表现,以开发预测他们成功获得生物学学士学位的机器学习模型。与Logistic回归(LR)、k近邻(KNN)和神经网络(NN)模型相比,基于随机森林(RF)和极端梯度增强(XGBoost)的预测模型在AUC(曲线下面积)方面表现出更好的性能,具有更好的敏感性和特异性。一种可解释的机器学习方法被称为分解,用于确定重要的大一课程,这些课程可能对学生在生物学学位课程和学生水平上的成功产生更大的影响。在项目层面确定更重要的课程可以帮助项目协调员优先考虑他们的努力,以解决学生流失问题,而在学生层面确定更重要的课程可以帮助学术顾问为学生提供更个性化的、数据驱动的指导。
{"title":"Identifying Prerequisite Courses in Undergraduate Biology Using Machine Learning","authors":"Youngjin Lee","doi":"10.6339/22-jds1068","DOIUrl":"https://doi.org/10.6339/22-jds1068","url":null,"abstract":"Many undergraduate students who matriculated in Science, Technology, Engineering and Mathematics (STEM) degree programs drop out or switch their major. Previous studies indicate that performance of students in prerequisite courses is important for attrition of students in STEM. This study analyzed demographic information, ACT/SAT score, and performance of students in freshman year courses to develop machine learning models predicting their success in earning a bachelor’s degree in biology. The predictive model based on Random Forest (RF) and Extreme Gradient Boosting (XGBoost) showed a better performance in terms of AUC (Area Under the Curve) with more balanced sensitivity and specificity than Logistic Regression (LR), K-Nearest Neighbor (KNN), and Neural Network (NN) models. An explainable machine learning approach called break-down was employed to identify important freshman year courses that could have a larger impact on student success at the biology degree program and student levels. More important courses identified at the program level can help program coordinators to prioritize their effort in addressing student attrition while more important courses identified at the student level can help academic advisors to provide more personalized, data-driven guidance to students.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"491 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
‘You Draw It’: Implementation of Visually Fitted Trends with r2d3 “你画它”:使用r2d3实现视觉拟合趋势
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1083
Emily A. Robinson, Réka Howard, Susan Vanderplas
How do statistical regression results compare to intuitive, visually fitted results? Fitting lines by eye through a set of points has been explored since the 20th century. Common methods of fitting trends by eye involve maneuvering a string, black thread, or ruler until the fit is suitable, then drawing the line through the set of points. In 2015, the New York Times introduced an interactive feature, called ‘You Draw It,’ where readers are asked to input their own assumptions about various metrics and compare how these assumptions relate to reality. This research is intended to implement ‘You Draw It’, adapted from the New York Times, as a way to measure the patterns we see in data. In this paper, we describe the adaptation of an old tool for graphical testing and evaluation, eye-fitting, for use in modern web-applications suitable for testing statistical graphics. We present an empirical evaluation of this testing method for linear regression, and briefly discuss an extension of this method to non-linear applications.
统计回归结果与直观的、视觉拟合的结果相比如何?自20世纪以来,人们一直在探索通过一组点来拟合直线。用肉眼拟合趋势的常用方法包括:用绳子、黑线或直尺转动,直到拟合合适,然后在一组点之间画线。2015年,《纽约时报》推出了一个名为“你画出来”(You Draw It)的互动功能,要求读者输入自己对各种指标的假设,并比较这些假设与现实的关系。这项研究旨在实施“你画它”,改编自纽约时报,作为一种衡量我们在数据中看到的模式的方法。在本文中,我们描述了一种用于图形测试和评估的旧工具,eye-fitting,用于适合测试统计图形的现代web应用程序。我们提出了线性回归检验方法的经验评价,并简要讨论了该方法在非线性应用中的推广。
{"title":"‘You Draw It’: Implementation of Visually Fitted Trends with r2d3","authors":"Emily A. Robinson, Réka Howard, Susan Vanderplas","doi":"10.6339/22-jds1083","DOIUrl":"https://doi.org/10.6339/22-jds1083","url":null,"abstract":"How do statistical regression results compare to intuitive, visually fitted results? Fitting lines by eye through a set of points has been explored since the 20th century. Common methods of fitting trends by eye involve maneuvering a string, black thread, or ruler until the fit is suitable, then drawing the line through the set of points. In 2015, the New York Times introduced an interactive feature, called ‘You Draw It,’ where readers are asked to input their own assumptions about various metrics and compare how these assumptions relate to reality. This research is intended to implement ‘You Draw It’, adapted from the New York Times, as a way to measure the patterns we see in data. In this paper, we describe the adaptation of an old tool for graphical testing and evaluation, eye-fitting, for use in modern web-applications suitable for testing statistical graphics. We present an empirical evaluation of this testing method for linear regression, and briefly discuss an extension of this method to non-linear applications.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Addressing the Impact of the COVID-19 Pandemic on Survival Outcomes in Randomized Phase III Oncology Trials 解决COVID-19大流行对随机III期肿瘤试验生存结果的影响
Pub Date : 2022-01-01 DOI: 10.6339/22-jds1079
Jiabu Ye, Binbing Yu, H. Mann, A. Sabin, Z. Szíjgyártó, David Wright, P. Mukhopadhyay, C. Massacesi, S. Ghiorghiu, R. Iacona
We assessed the impact of the coronavirus disease 2019 (COVID-19) pandemic on the statistical analysis of time-to-event outcomes in late-phase oncology trials. Using a simulated case study that mimics a Phase III ongoing trial during the pandemic, we evaluated the impact of COVID-19-related deaths, time off-treatment and missed clinical visits due to the pandemic, on overall survival and/or progression-free survival in terms of test size (also referred to as Type 1 error rate or alpha level), power, and hazard ratio (HR) estimates. We found that COVID-19-related deaths would impact both size and power, and lead to biased HR estimates; the impact would be more severe if there was an imbalance in COVID-19-related deaths between the study arms. Approaches censoring COVID-19-related deaths may mitigate the impact on power and HR estimation, especially if study data cut-off was extended to recover censoring-related event loss. The impact of COVID-19-related time off-treatment would be modest for power, and moderate for size and HR estimation. Different rules of censoring cancer progression times result in a slight difference in the power for the analysis of progression-free survival. The simulations provided valuable information for determining whether clinical-trial modifications should be required for ongoing trials during the COVID-19 pandemic.
我们评估了2019冠状病毒病(COVID-19)大流行对晚期肿瘤学试验中事件发生时间结局统计分析的影响。通过模拟大流行期间正在进行的III期试验的模拟案例研究,我们评估了与covid -19相关的死亡、因大流行而中断治疗的时间和错过的临床就诊对总体生存和/或无进展生存的影响,包括测试规模(也称为1型错误率或α水平)、功率和风险比(HR)估计。我们发现,与covid -19相关的死亡会影响规模和功率,并导致有偏见的人力资源估计;如果研究小组之间与covid -19相关的死亡人数不平衡,影响将更加严重。审查covid -19相关死亡的方法可以减轻对功率和人力资源估计的影响,特别是如果延长研究数据截止时间以恢复与审查相关的事件损失。与covid -19相关的停工时间对功率的影响不大,对规模和人力资源估计的影响也不大。审查癌症进展时间的不同规则导致无进展生存分析的能力略有不同。模拟为确定COVID-19大流行期间正在进行的试验是否需要修改临床试验提供了有价值的信息。
{"title":"Addressing the Impact of the COVID-19 Pandemic on Survival Outcomes in Randomized Phase III Oncology Trials","authors":"Jiabu Ye, Binbing Yu, H. Mann, A. Sabin, Z. Szíjgyártó, David Wright, P. Mukhopadhyay, C. Massacesi, S. Ghiorghiu, R. Iacona","doi":"10.6339/22-jds1079","DOIUrl":"https://doi.org/10.6339/22-jds1079","url":null,"abstract":"We assessed the impact of the coronavirus disease 2019 (COVID-19) pandemic on the statistical analysis of time-to-event outcomes in late-phase oncology trials. Using a simulated case study that mimics a Phase III ongoing trial during the pandemic, we evaluated the impact of COVID-19-related deaths, time off-treatment and missed clinical visits due to the pandemic, on overall survival and/or progression-free survival in terms of test size (also referred to as Type 1 error rate or alpha level), power, and hazard ratio (HR) estimates. We found that COVID-19-related deaths would impact both size and power, and lead to biased HR estimates; the impact would be more severe if there was an imbalance in COVID-19-related deaths between the study arms. Approaches censoring COVID-19-related deaths may mitigate the impact on power and HR estimation, especially if study data cut-off was extended to recover censoring-related event loss. The impact of COVID-19-related time off-treatment would be modest for power, and moderate for size and HR estimation. Different rules of censoring cancer progression times result in a slight difference in the power for the analysis of progression-free survival. The simulations provided valuable information for determining whether clinical-trial modifications should be required for ongoing trials during the COVID-19 pandemic.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71320957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of data science : JDS
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1