首页 > 最新文献

The 9 Pitfalls of Data Science最新文献

英文 中文
Worshiping Computers 崇拜电脑
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0005
Gary Smith, Jay Cordes
Computer software, particularly deep neural networks and Monte Carlo simulations, are extremely useful for the specific tasks that they have been designed to do, and they will get even better, much better. However, we should not assume that computers are smarter than us just because they can tell us the first 2000 digits of pi or show us a street map of every city in the world. One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They can’t pass simple tests like the Winograd Schema Challenge because they do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters.
计算机软件,特别是深度神经网络和蒙特卡罗模拟,对于它们被设计用来完成的特定任务非常有用,而且它们会变得更好,好得多。然而,我们不应该仅仅因为计算机能告诉我们圆周率的前2000位,或者给我们看世界上每个城市的街道地图,就认为它们比我们聪明。关于计算机的一个矛盾之处在于,它们可以擅长人类认为困难的事情(比如计算平方根),而在人类认为容易的事情(比如识别停车标志)上却失败了。它们无法通过像Winograd图式挑战这样简单的测试,因为它们不像人类那样理解世界。他们既没有常识也没有智慧。他们是我们的工具,不是我们的主人。
{"title":"Worshiping Computers","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0005","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0005","url":null,"abstract":"Computer software, particularly deep neural networks and Monte Carlo simulations, are extremely useful for the specific tasks that they have been designed to do, and they will get even better, much better. However, we should not assume that computers are smarter than us just because they can tell us the first 2000 digits of pi or show us a street map of every city in the world. One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They can’t pass simple tests like the Winograd Schema Challenge because they do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Doing Harm 做伤害
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0010
Gary Smith, Jason L. Cordes
An unfortunate reality in the age of big data is Big Brother monitoring us incessantly. Big Brother is indeed watching, but it is big business as well as big government collecting detailed information about everything we do so that they can predict our actions and manipulate our behavior. Big business and big government monitor our credit cards, checking accounts, computers, and telephones, watch us on surveillance cameras, and purchase data from firms dedicated to finding out everything they can about each and every one of us. Good data scientists proceed cautiously, respectful of our rights and our privacy. The Golden Rule applies to data science: treat others as you would like to be treated.
在大数据时代,一个不幸的现实是“老大哥”不停地监视我们。“老大哥”确实在监视着我们,但这是一个大企业,也是一个大政府,它收集我们所做的每件事的详细信息,以便预测我们的行为并操纵我们的行为。大企业和大政府监控我们的信用卡、支票账户、电脑和电话,通过监控摄像头监视我们,并从那些致力于找出我们每个人的一切信息的公司购买数据。优秀的数据科学家谨慎行事,尊重我们的权利和隐私。黄金法则适用于数据科学:你希望别人怎样对待你,你就怎样对待别人。
{"title":"Doing Harm","authors":"Gary Smith, Jason L. Cordes","doi":"10.1093/oso/9780198844396.003.0010","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0010","url":null,"abstract":"An unfortunate reality in the age of big data is Big Brother monitoring us incessantly. Big Brother is indeed watching, but it is big business as well as big government collecting detailed information about everything we do so that they can predict our actions and manipulate our behavior. Big business and big government monitor our credit cards, checking accounts, computers, and telephones, watch us on surveillance cameras, and purchase data from firms dedicated to finding out everything they can about each and every one of us. Good data scientists proceed cautiously, respectful of our rights and our privacy. The Golden Rule applies to data science: treat others as you would like to be treated.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133504302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Putting Data Before Theory 数据优先于理论
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0003
G. Smith, J. Cordes
The traditional statistical analysis of data follows what has come to be known as the scientific method: collecting reliable data to test plausible theories. Data mining goes in the other direction, analyzing data without being motivated or encumbered by theories. The fundamental problem with data mining is simple: We think that data patterns are unusual and therefore meaningful. Patterns are, in fact, inevitable and therefore meaningless. This is why data mining is not usually knowledge discovery, but noise discovery. Finding correlations is easy. Good data scientists are not seduced by discovered patterns because they don’t put data before theory. They do not commit Texas Sharpshooter Fallacies or fall into the Feynman Trap.
传统的数据统计分析遵循被称为科学的方法:收集可靠的数据来检验貌似合理的理论。数据挖掘走的是另一个方向,它在不受理论驱动或阻碍的情况下分析数据。数据挖掘的基本问题很简单:我们认为数据模式是不寻常的,因此有意义。事实上,模式是不可避免的,因此毫无意义。这就是为什么数据挖掘通常不是知识发现,而是噪音发现。找到相关性很容易。优秀的数据科学家不会被发现的模式所诱惑,因为他们不会把数据放在理论之前。他们不会犯德州神枪手谬论或落入费曼陷阱。
{"title":"Putting Data Before Theory","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0003","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0003","url":null,"abstract":"The traditional statistical analysis of data follows what has come to be known as the scientific method: collecting reliable data to test plausible theories. Data mining goes in the other direction, analyzing data without being motivated or encumbered by theories. The fundamental problem with data mining is simple: We think that data patterns are unusual and therefore meaningful. Patterns are, in fact, inevitable and therefore meaningless. This is why data mining is not usually knowledge discovery, but noise discovery. Finding correlations is easy. Good data scientists are not seduced by discovered patterns because they don’t put data before theory. They do not commit Texas Sharpshooter Fallacies or fall into the Feynman Trap.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127555168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confusing Correlation with Causation 混淆因果关系
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0008
G. Smith, J. Cordes
There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.
可以从数据中提取预测值的层次结构。在层次结构的顶端是因果关系,可以通过随机和控制实验或自然实验来证实。其次,最好是提前建立已知的或假设的关系,然后测试它们并估计它们的相对重要性。低一级是在历史数据中发现的关联,在考虑它们是否有意义之后,在新数据上进行测试。在层次结构的底部,很少或没有价值的是在历史数据中发现的关联,这些关联没有得到专家意见的证实,也没有经过新数据的检验。使用“相关性就足够了”方法的数据科学家应该记住,数据越多,搜索越多,发现的统计关系越有可能是巧合和无用的。
{"title":"Confusing Correlation with Causation","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0008","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0008","url":null,"abstract":"There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114067108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Being Surprised by Regression Toward the Mean 对趋均数回归感到惊讶
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0009
G. Smith, J. Cordes
We are predisposed to discount the role of luck in our lives—to believe that successes are earned and failures deserved. We misinterpret the temporary as permanent and invent theories to explain noise. We overreact when the unexpected happens, and are too quick to make the unexpected the new expected. The key to understanding regression toward the mean is to look behind the data—to recognize that when we see something remarkable, luck was most likely involved and, so, the underlying phenomenon is not as remarkable as it seems. Not to be confused with the gambler’s fallacy where good luck is followed by bad luck, regression toward the mean states that extremely good luck is generally followed by less extreme luck. The Sports Illustrated jinx is nothing more than this. Whenever there is uncertainty, people often make flawed decisions due to an insufficient appreciation of regression toward the mean.
我们倾向于低估运气在生活中的作用——认为成功是努力得来的,失败是罪有应得。我们把暂时误解为永久,并发明理论来解释噪音。当意想不到的事情发生时,我们反应过度,并且太快地把意想不到的事情变成新的预期。理解趋均数回归的关键是看数据背后的东西——认识到,当我们看到一些非同寻常的东西时,很可能是运气的作用,因此,潜在的现象并不像看起来那么引人注目。不要与赌徒的“好运气后是坏运气”的谬论相混淆,向均值回归表明,极好的运气之后通常是不太好的运气。《体育画报》的厄运不过如此。每当存在不确定性时,由于对回归均值的认识不足,人们往往会做出错误的决定。
{"title":"Being Surprised by Regression Toward the Mean","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0009","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0009","url":null,"abstract":"We are predisposed to discount the role of luck in our lives—to believe that successes are earned and failures deserved. We misinterpret the temporary as permanent and invent theories to explain noise. We overreact when the unexpected happens, and are too quick to make the unexpected the new expected. The key to understanding regression toward the mean is to look behind the data—to recognize that when we see something remarkable, luck was most likely involved and, so, the underlying phenomenon is not as remarkable as it seems. Not to be confused with the gambler’s fallacy where good luck is followed by bad luck, regression toward the mean states that extremely good luck is generally followed by less extreme luck. The Sports Illustrated jinx is nothing more than this. Whenever there is uncertainty, people often make flawed decisions due to an insufficient appreciation of regression toward the mean.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122726860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Case Study 案例研究
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0011
G. Smith, J. Cordes
In the 1970s banks began selling mortgages to public and private mortgage funds that sell shares to investors. In the late 1990s and early 2000s, many mortgages to “subprime” borrowers with low credit ratings and modest income were approved because banks and mortgage brokers made money by making loans and then selling them, and didn’t care if borrowers defaulted. Matters were complicated by financial engineering and compliant rating agencies. The Great Recession resulted from many people falling into several of the pitfalls of data science. They fooled themselves, they worshipped mathematics, they used bad data, they tortured data, and they did harm.
20世纪70年代,银行开始向向投资者出售股票的公共和私人抵押贷款基金出售抵押贷款。在20世纪90年代末和21世纪初,许多信用评级较低、收入不高的“次级”借款人获得了贷款,因为银行和抵押贷款经纪人通过发放贷款然后出售来赚钱,他们不在乎借款人是否违约。金融工程和顺从的评级机构让事情变得更加复杂。大衰退是由于许多人掉进了数据科学的几个陷阱。他们欺骗自己,他们崇拜数学,他们使用错误的数据,他们扭曲数据,他们造成伤害。
{"title":"Case Study","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0011","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0011","url":null,"abstract":"In the 1970s banks began selling mortgages to public and private mortgage funds that sell shares to investors. In the late 1990s and early 2000s, many mortgages to “subprime” borrowers with low credit ratings and modest income were approved because banks and mortgage brokers made money by making loans and then selling them, and didn’t care if borrowers defaulted. Matters were complicated by financial engineering and compliant rating agencies. The Great Recession resulted from many people falling into several of the pitfalls of data science. They fooled themselves, they worshipped mathematics, they used bad data, they tortured data, and they did harm.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121857621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fooling Yourself 欺骗你自己
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0007
Gary Smith, Jay Cordes
Clowns fool themselves. Scientists don’t. Often, the easiest way to differentiate a data clown from a data scientist is to track the successes and failures of their predictions. Clowns avoid experimentation out of fear that they’re wrong, or wait until after seeing the data before revealing what they expected to find. Scientists share their theories, question their assumptions, and seek opportunities to run experiments that will verify or contradict themselves. Most new theories are not correct and will not be supported by experiments (randomized controlled trials). Scientists are comfortable with that reality and don’t try to ram a square peg in a round hole by torturing data or mangling theories. They know that science works, but only if it’s done right.
小丑愚弄自己。科学家们不喜欢。通常,区分数据小丑和数据科学家的最简单方法是跟踪他们预测的成功和失败。小丑避免实验,因为他们害怕自己是错的,或者等到看到数据后才透露他们期望发现的东西。科学家们分享他们的理论,质疑他们的假设,并寻找机会进行实验,以验证或反驳自己。大多数新理论都是不正确的,也不会得到实验(随机对照试验)的支持。科学家们对这一现实感到满意,不会试图通过折磨数据或破坏理论来解决问题。他们知道科学是有效的,但前提是要做得正确。
{"title":"Fooling Yourself","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0007","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0007","url":null,"abstract":"Clowns fool themselves. Scientists don’t. Often, the easiest way to differentiate a data clown from a data scientist is to track the successes and failures of their predictions. Clowns avoid experimentation out of fear that they’re wrong, or wait until after seeing the data before revealing what they expected to find. Scientists share their theories, question their assumptions, and seek opportunities to run experiments that will verify or contradict themselves. Most new theories are not correct and will not be supported by experiments (randomized controlled trials). Scientists are comfortable with that reality and don’t try to ram a square peg in a round hole by torturing data or mangling theories. They know that science works, but only if it’s done right.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Torturing Data 折磨的数据
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0006
Gary Smith, Jay Cordes
Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.
追求名声和资金的研究人员可能会进行钓鱼探险(p-hacking),或者对数据进行折磨,以发现新奇的、具有挑衅性的结果,这些结果将被大众媒体报道。挑衅性的发现之所以具有挑衅性,是因为它们是新颖的、出乎意料的,而它们往往是新颖的、出乎意料的,因为它们根本不是真的。发布效应(或文件抽屉效应)隐藏了失败,并造成了复制危机。在大众媒体上报道的研究往往是错误的——这愚弄了人们,破坏了科学研究的可信度。
{"title":"Torturing Data","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0006","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0006","url":null,"abstract":"Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131970208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Worshiping Math 崇拜数学
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0004
Gary Smith, Jay Cordes
Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. For example, analysts often assume a normal distribution and disregard the fat tails that warn of “black swans.” Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use explanatory variables that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise.
一般来说,数据挖掘工具往往在数学上很复杂,但往往会做出不合理的假设。例如,分析师通常假设呈正态分布,而忽略警告“黑天鹅”的肥尾。很多时候,假设都隐藏在数学中,而使用这些工具的人对数学的印象更深刻,而不是对假设的好奇。优秀的数据科学家不会被数学蒙蔽,而是会使用有意义的解释变量。优秀的数据科学家使用数学,但不崇拜它。他们知道数学是一种无价的工具,但它不能代替常识、智慧或专业知识。
{"title":"Worshiping Math","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0004","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0004","url":null,"abstract":"Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. For example, analysts often assume a normal distribution and disregard the fat tails that warn of “black swans.” Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use explanatory variables that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127203908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Bad Data 使用坏数据
Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0002
G. Smith, J. Cordes
Good data scientists consider the reliability of the data, while data clowns don’t. Reported data sometimes systematically misrepresent the phenomena being recorded. Data can be deformed by extremely unusual data—outliers—which can be clerical errors, measurement errors, or flukes that can mislead us if not corrected. Other times, outliers are valuable data. We should always consider if data are skewed by unusual events or distorted by unreported “silent data.” If something is surprising about top-ranked groups, look at the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make clowns out of anyone.
优秀的数据科学家会考虑数据的可靠性,而数据小丑则不会。报告的数据有时会系统性地歪曲所记录的现象。数据可能会被极不寻常的数据异常值所扭曲,这些异常值可能是文书错误、测量错误或侥幸,如果不加以纠正,可能会误导我们。其他时候,异常值是有价值的数据。我们应该始终考虑数据是否因异常事件或未报告的“沉默数据”而扭曲。如果说排名靠前的群体有什么令人惊讶的地方,那就看看排名靠后的群体吧。考虑生存偏差和自我选择偏差的可能性。不完整、不准确或不可靠的数据会让任何人变成小丑。
{"title":"Using Bad Data","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0002","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0002","url":null,"abstract":"Good data scientists consider the reliability of the data, while data clowns don’t. Reported data sometimes systematically misrepresent the phenomena being recorded. Data can be deformed by extremely unusual data—outliers—which can be clerical errors, measurement errors, or flukes that can mislead us if not corrected. Other times, outliers are valuable data. We should always consider if data are skewed by unusual events or distorted by unreported “silent data.” If something is surprising about top-ranked groups, look at the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make clowns out of anyone.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113938805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
The 9 Pitfalls of Data Science
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1