致《统计学与数据科学教育杂志》的一封信——对Albert Y.Kim和Adriana Escobedo Land“OkCupid Data for Introduction Statistics and Data Science Courses”的评论

IF 1.6 Q2 EDUCATION, SCIENTIFIC DISCIPLINES Journal of Statistics and Data Science Education Pub Date : 2021-08-06 DOI:10.1080/26939169.2021.1930812

Tiffany Xiao, Yifan Ma

{"title":"致《统计学与数据科学教育杂志》的一封信——对Albert Y.Kim和Adriana Escobedo Land“OkCupid Data for Introduction Statistics and Data Science Courses”的评论","authors":"Tiffany Xiao, Yifan Ma","doi":"10.1080/26939169.2021.1930812","DOIUrl":null,"url":null,"abstract":"As Big Data continues to rise in popularity, so does an increased need for protection against potential misuses of data. We are a group of undergraduate Statistical and Data Science major students from Smith College that are actively engaged in ethical discussions concerning the use of data in our society. It can be challenging to predict future trends and technologies in data science that could cause concerns. However, we believe that some essential protections and procedures should be in place to help prevent misuses of data. In particular, we are writing to you to address our concerns with the article “OkCupid Data for Introductory Statistics and Data Science Courses” by Albert Y. Kim and Adriana Escobedo-Land that was published in your journal (Kim and Escobedo-Land 2015). In light of ethical concerns surrounding the article, herein we describe the background of how the dataset was found to contain identifiable information. We communicated this to the authors, who correspondingly corrected the article. In our opinion, there is no doubt that the dataset presented in the article holds pedagogical value as well as research value. One aspect of the educational value of the dataset is the fact that the context of possible analysis could better drive students’ interests. The research value of the data lies within the self-reported nature of the dataset, which usually is the private property of corporations and could be hard to obtain for researchers in universities. Another context in which the pedagogical value of the dataset remains is where students could use this as a case study in discussions of the ethical implications of such data, even practicing anonymization skills with the data. However, we do believe that for the dataset to be used for pedagogical purposes, further anonymizations to the dataset were necessary. Some ways that datasets like this one could be better anonymized in the future include removing unimportant variables that have identification power disproportionate to their value to research. For example, in the case of the OkCupid dataset associated with the paper, the time the data was collected could be removed, since this fact is not particularly essential but can be used for identification. Other sources of concern for this dataset are the variables that reveal geographical and temporal information on individuals. Another method could","PeriodicalId":34851,"journal":{"name":"Journal of Statistics and Data Science Education","volume":"29 1","pages":"214 - 215"},"PeriodicalIF":1.6000,"publicationDate":"2021-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/26939169.2021.1930812","citationCount":"2","resultStr":"{\"title\":\"A Letter to the Journal of Statistics and Data Science Education — A Call for Review of “OkCupid Data for Introductory Statistics and Data Science Courses” by Albert Y. Kim and Adriana Escobedo-Land\",\"authors\":\"Tiffany Xiao, Yifan Ma\",\"doi\":\"10.1080/26939169.2021.1930812\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As Big Data continues to rise in popularity, so does an increased need for protection against potential misuses of data. We are a group of undergraduate Statistical and Data Science major students from Smith College that are actively engaged in ethical discussions concerning the use of data in our society. It can be challenging to predict future trends and technologies in data science that could cause concerns. However, we believe that some essential protections and procedures should be in place to help prevent misuses of data. In particular, we are writing to you to address our concerns with the article “OkCupid Data for Introductory Statistics and Data Science Courses” by Albert Y. Kim and Adriana Escobedo-Land that was published in your journal (Kim and Escobedo-Land 2015). In light of ethical concerns surrounding the article, herein we describe the background of how the dataset was found to contain identifiable information. We communicated this to the authors, who correspondingly corrected the article. In our opinion, there is no doubt that the dataset presented in the article holds pedagogical value as well as research value. One aspect of the educational value of the dataset is the fact that the context of possible analysis could better drive students’ interests. The research value of the data lies within the self-reported nature of the dataset, which usually is the private property of corporations and could be hard to obtain for researchers in universities. Another context in which the pedagogical value of the dataset remains is where students could use this as a case study in discussions of the ethical implications of such data, even practicing anonymization skills with the data. However, we do believe that for the dataset to be used for pedagogical purposes, further anonymizations to the dataset were necessary. Some ways that datasets like this one could be better anonymized in the future include removing unimportant variables that have identification power disproportionate to their value to research. For example, in the case of the OkCupid dataset associated with the paper, the time the data was collected could be removed, since this fact is not particularly essential but can be used for identification. Other sources of concern for this dataset are the variables that reveal geographical and temporal information on individuals. Another method could\",\"PeriodicalId\":34851,\"journal\":{\"name\":\"Journal of Statistics and Data Science Education\",\"volume\":\"29 1\",\"pages\":\"214 - 215\"},\"PeriodicalIF\":1.6000,\"publicationDate\":\"2021-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1080/26939169.2021.1930812\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Statistics and Data Science Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/26939169.2021.1930812\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Statistics and Data Science Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/26939169.2021.1930812","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 2

摘要

随着大数据的不断普及，对防止潜在数据滥用的保护需求也在增加。我们是史密斯学院统计学和数据科学专业的本科生，他们积极参与有关我们社会中数据使用的伦理讨论。预测数据科学中可能引起关注的未来趋势和技术可能具有挑战性。然而，我们认为，应该制定一些基本的保护措施和程序，以帮助防止数据滥用。特别是，我们写信给您，以解决我们对Albert Y.Kim和Adriana Escobedo Land在您的期刊（Kim和Escobedo Land2015）上发表的文章“OkCupid Data for Introduction Statistics and Data Science Courses”的担忧。鉴于围绕这篇文章的伦理问题，我们在这里描述了如何发现数据集包含可识别信息的背景。我们把这件事告诉了作者，他们相应地更正了这篇文章。在我们看来，毫无疑问，文章中提供的数据集具有教学价值和研究价值。数据集的教育价值的一个方面是，可能的分析背景可以更好地激发学生的兴趣。数据的研究价值在于数据集的自我报告性质，数据集通常是公司的私有财产，大学的研究人员可能很难获得。数据集的另一个教学价值仍然存在的背景是，学生可以将其作为案例研究，讨论此类数据的道德含义，甚至可以使用数据练习匿名化技能。然而，我们确实认为，为了将数据集用于教学目的，有必要对数据集进行进一步的匿名化。像这样的数据集在未来可以更好地匿名化的一些方法包括删除不重要的变量，这些变量的识别能力与其研究价值不成比例。例如，在与论文相关的OkCupid数据集的情况下，数据收集的时间可以被删除，因为这一事实不是特别重要，但可以用于识别。该数据集关注的其他来源是揭示个人地理和时间信息的变量。另一种方法可以

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Letter to the Journal of Statistics and Data Science Education — A Call for Review of “OkCupid Data for Introductory Statistics and Data Science Courses” by Albert Y. Kim and Adriana Escobedo-Land

As Big Data continues to rise in popularity, so does an increased need for protection against potential misuses of data. We are a group of undergraduate Statistical and Data Science major students from Smith College that are actively engaged in ethical discussions concerning the use of data in our society. It can be challenging to predict future trends and technologies in data science that could cause concerns. However, we believe that some essential protections and procedures should be in place to help prevent misuses of data. In particular, we are writing to you to address our concerns with the article “OkCupid Data for Introductory Statistics and Data Science Courses” by Albert Y. Kim and Adriana Escobedo-Land that was published in your journal (Kim and Escobedo-Land 2015). In light of ethical concerns surrounding the article, herein we describe the background of how the dataset was found to contain identifiable information. We communicated this to the authors, who correspondingly corrected the article. In our opinion, there is no doubt that the dataset presented in the article holds pedagogical value as well as research value. One aspect of the educational value of the dataset is the fact that the context of possible analysis could better drive students’ interests. The research value of the data lies within the self-reported nature of the dataset, which usually is the private property of corporations and could be hard to obtain for researchers in universities. Another context in which the pedagogical value of the dataset remains is where students could use this as a case study in discussions of the ethical implications of such data, even practicing anonymization skills with the data. However, we do believe that for the dataset to be used for pedagogical purposes, further anonymizations to the dataset were necessary. Some ways that datasets like this one could be better anonymized in the future include removing unimportant variables that have identification power disproportionate to their value to research. For example, in the case of the OkCupid dataset associated with the paper, the time the data was collected could be removed, since this fact is not particularly essential but can be used for identification. Other sources of concern for this dataset are the variables that reveal geographical and temporal information on individuals. Another method could

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊