N. Horton, Rohan Alexander, M. Parker, A. Piekut, Colin W. Rundel
{"title":"The Growing Importance of Reproducibility and Responsible Workflow in the Data Science and Statistics Curriculum","authors":"N. Horton, Rohan Alexander, M. Parker, A. Piekut, Colin W. Rundel","doi":"10.1080/26939169.2022.2141001","DOIUrl":null,"url":null,"abstract":"Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. 2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM 2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund 2022). In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science. The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden 2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden 2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process. Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion 2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin 2014). Cut and paste errors are all too common in many workflows (Perkel 2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly com-","PeriodicalId":34851,"journal":{"name":"Journal of Statistics and Data Science Education","volume":"30 1","pages":"207 - 208"},"PeriodicalIF":1.5000,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistics and Data Science Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/26939169.2022.2141001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 6
Abstract
Modern statistics and data science uses an iterative data analysis process to solve problems and extract meaning from data in a reproducible manner. Models such as the PPDAC (Problem, Plan, Data, Analysis, Conclusion) Cycle (n.d) have been widely adopted in many secondary and post-secondary classrooms (see the review by Lee et al. 2022). The importance of the data analysis cycle has also been described and reinforced in guidelines for statistics majors (ASA Curriculum Guidelines 2014), undergraduate data science curricula (ACM 2021), and in data science courses and teaching materials (e.g., Wickham and Grolemund 2022). In 2018, the National Academies of Science, Engineering, and Medicine’s “Data Science for Undergraduates” consensus study (NASEM 2018) broadened the definition of the data analysis cycle by identifying the importance of workflow and reproducibility as a component of data acumen needed in our graduates. The report noted that “documenting, incrementally improving, sharing, and generalizing such workflows are an important part of data science practice owing to the team nature of data science and broader significance of scientific reproducibility and replicability.” The report also tied issues of reproducibility and workflow to the ethical conduct of science. The importance of others being able to have confidence in our findings is built into the foundations of statistics and data science (Parashar, Heroux, and Stodden 2022). For instance, in theoretical research, theorems are introduced along with their proof. As statistics has changed to rely more on computational methods, innovation is needed to ensure that the same level of rigor characterizes claims based on data and code. Efforts to foster reproducibility in science (NASEM 2019; Parashar, Heroux, and Stodden 2022) and to accelerate scientific discoveries (NASEM 2021) have highlighted the importance of reproducibility and workflow within the broader scientific process. Robust workflows matter. For instance, COVID-19 counts in the United Kingdom were underestimated because the way that Excel was used resulted in dropped data (Kelion 2020). The economists Carmen Reinhart and Kenneth Rogoff made Excel errors that resulted in miscalculated GDP growth rates (Herndon, Ash, and Pollin 2014). Cut and paste errors are all too common in many workflows (Perkel 2022). The reproducibility crisis that was first identified in psychology is now known to afflict much of the physical and social sciences. Steps taken to address this crisis, including improved reporting of methods, code and data sharing, version control, are increasingly com-