N. Horton, Rohan Alexander, M. Parker, A. Piekut, Colin W. Rundel
{"title":"再现性和负责任的工作流程在数据科学与统计课程中日益重要","authors":"N. Horton, Rohan Alexander, M. Parker, A. Piekut, Colin W. Rundel","doi":"10.1080/26939169.2022.2141001","DOIUrl":null,"url":null,"abstract":"Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. 2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM 2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund 2022). In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science. The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden 2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden 2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process. Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion 2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin 2014). Cut and paste errors are all too common in many workflows (Perkel 2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly com-","PeriodicalId":34851,"journal":{"name":"Journal of Statistics and Data Science Education","volume":"30 1","pages":"207 - 208"},"PeriodicalIF":1.5000,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"The Growing Importance of Reproducibility and Responsible Workflow in the Data Science and Statistics Curriculum\",\"authors\":\"N. Horton, Rohan Alexander, M. Parker, A. Piekut, Colin W. Rundel\",\"doi\":\"10.1080/26939169.2022.2141001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. 2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM 2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund 2022). In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science. The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden 2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden 2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process. Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion 2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin 2014). Cut and paste errors are all too common in many workflows (Perkel 2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly com-\",\"PeriodicalId\":34851,\"journal\":{\"name\":\"Journal of Statistics and Data Science Education\",\"volume\":\"30 1\",\"pages\":\"207 - 208\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2022-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Statistics and Data Science Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/26939169.2022.2141001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistics and Data Science Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/26939169.2022.2141001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 6
摘要
现代统计学和数据科学使用迭代的数据分析过程来解决问题,并以可重复的方式从数据中提取意义。诸如PPDAC(问题、计划、数据、分析、结论)周期(n.d)等模型已在许多中学和高等学校的课堂上广泛采用(参见Lee et al. 2022的评论)。数据分析周期的重要性也在统计学专业指南(2014年ASA课程指南)、本科数据科学课程(ACM 2021)以及数据科学课程和教材(例如,Wickham和Grolemund 2022)中得到了描述和加强。2018年,美国国家科学院、工程院和医学院的“本科生数据科学”共识研究(NASEM 2018)通过确定工作流程和可重复性作为毕业生所需数据敏锐度组成部分的重要性,拓宽了数据分析周期的定义。该报告指出,“由于数据科学的团队性质以及科学可重复性和可复制性的更广泛意义,记录、逐步改进、共享和推广这些工作流程是数据科学实践的重要组成部分。”该报告还将可重复性和工作流程问题与科学的道德行为联系起来。其他人能够对我们的发现有信心的重要性是建立在统计学和数据科学的基础上的(Parashar, Heroux, and Stodden 2022)。例如,在理论研究中,定理和它们的证明一起被引入。由于统计已经变得更加依赖计算方法,因此需要创新,以确保基于数据和代码的索赔具有相同的严谨性。努力促进科学的可重复性(NASEM 2019;Parashar, Heroux, and Stodden(2022)和加速科学发现(NASEM 2021)强调了可重复性和工作流程在更广泛的科学过程中的重要性。健壮的工作流很重要。例如,英国的COVID-19计数被低估了,因为使用Excel的方式导致数据丢失(Kelion 2020)。经济学家卡门·莱因哈特(Carmen Reinhart)和肯尼斯·罗格夫(Kenneth Rogoff)在Excel中犯了错误,导致误算了GDP增长率(Herndon, Ash, and Pollin 2014)。剪切和粘贴错误在许多工作流中都很常见(Perkel 2022)。最初在心理学中发现的可再现性危机,现在被认为在物理科学和社会科学的大部分领域都受到了影响。为解决这一危机而采取的措施,包括改进报告方法、代码和数据共享、版本控制,正在日益普及
The Growing Importance of Reproducibility and Responsible Workflow in the Data Science and Statistics Curriculum
Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. 2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM 2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund 2022). In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science. The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden 2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden 2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process. Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion 2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin 2014). Cut and paste errors are all too common in many workflows (Perkel 2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly com-