从关联分析到因果发现

MLSDA '13 Pub Date : 2013-12-02 DOI:10.1145/2542652.2542659
Jiuyong Li
{"title":"从关联分析到因果发现","authors":"Jiuyong Li","doi":"10.1145/2542652.2542659","DOIUrl":null,"url":null,"abstract":"Association analysis is an important technique in data mining, and it has been widely used in many application areas [6]. However, associations found in data can be spurious and do not reflect the ‘true’ relationships between the variables under consideration. For example, it is easily for hundreds or thousands of association rules to be generated even in a small data set, but most of them could be spurious and have no practical meaning [11, 21, 22]. This has hindered the applications of association analysis to solving real world problems. While the development of efficient techniques for finding association patterns in data, especially in large data sets, is well underway, the problem for identifying non-spurious associations has become prominent. Causal relationships imply the real data generating mechanisms and how the outcome would change when the cause is changed, so finding them has been the ultimate goals of many scientific explorations and social studies [18]. The gold standard for causal discover is randomised controlled trials (RCTs) [4, 16]. However, a RCT is infeasible in many real world applications, particularly in the case of high dimensional problem of a large number of potential causes. As part of the efforts on causal discovery, statisticians have studied various methods for testing a hypothetical causal relationship based on observational data [16]. However, these methods are designed for validating a known candidate causal relationship and they are incapable of dealing with a large number of potential causes either. Although an association between two variables does not always imply causation, it is well known that associations are indicators for causal relationships [7]. Therefore a practical approach to causal discovery in large data sets could start with association analysis of the data. A question is then whether we can filter out associations that do not have causal indications. Note that this objective is different from that of mining interesting associations [9, 20] or discovering statistically sound associations [5, 21] because interestingness criteria do not measure causality and a test of statistical significance only determines if an association is due to random chance. We have integrated two statis-","PeriodicalId":248909,"journal":{"name":"MLSDA '13","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From Association Analysis to Causal Discovery\",\"authors\":\"Jiuyong Li\",\"doi\":\"10.1145/2542652.2542659\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Association analysis is an important technique in data mining, and it has been widely used in many application areas [6]. However, associations found in data can be spurious and do not reflect the ‘true’ relationships between the variables under consideration. For example, it is easily for hundreds or thousands of association rules to be generated even in a small data set, but most of them could be spurious and have no practical meaning [11, 21, 22]. This has hindered the applications of association analysis to solving real world problems. While the development of efficient techniques for finding association patterns in data, especially in large data sets, is well underway, the problem for identifying non-spurious associations has become prominent. Causal relationships imply the real data generating mechanisms and how the outcome would change when the cause is changed, so finding them has been the ultimate goals of many scientific explorations and social studies [18]. The gold standard for causal discover is randomised controlled trials (RCTs) [4, 16]. However, a RCT is infeasible in many real world applications, particularly in the case of high dimensional problem of a large number of potential causes. As part of the efforts on causal discovery, statisticians have studied various methods for testing a hypothetical causal relationship based on observational data [16]. However, these methods are designed for validating a known candidate causal relationship and they are incapable of dealing with a large number of potential causes either. Although an association between two variables does not always imply causation, it is well known that associations are indicators for causal relationships [7]. Therefore a practical approach to causal discovery in large data sets could start with association analysis of the data. A question is then whether we can filter out associations that do not have causal indications. Note that this objective is different from that of mining interesting associations [9, 20] or discovering statistically sound associations [5, 21] because interestingness criteria do not measure causality and a test of statistical significance only determines if an association is due to random chance. We have integrated two statis-\",\"PeriodicalId\":248909,\"journal\":{\"name\":\"MLSDA '13\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"MLSDA '13\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2542652.2542659\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"MLSDA '13","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2542652.2542659","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

关联分析是数据挖掘中的一项重要技术,在许多应用领域得到了广泛的应用[6]。然而,在数据中发现的关联可能是虚假的,并不能反映所考虑的变量之间的“真实”关系。例如,即使在一个很小的数据集中,也很容易产生成百上千的关联规则,但其中大多数可能是虚假的,没有实际意义[11,21,22]。这阻碍了关联分析在解决现实世界问题中的应用。虽然在数据中(特别是在大型数据集中)查找关联模式的有效技术的开发正在顺利进行,但识别非虚假关联的问题已经变得突出。因果关系意味着真实的数据产生机制,以及当原因发生变化时结果会如何变化,因此找到因果关系一直是许多科学探索和社会研究的最终目标[18]。因果发现的黄金标准是随机对照试验(RCTs)[4,16]。然而,RCT在许多实际应用中是不可行的,特别是在具有大量潜在原因的高维问题的情况下。作为因果关系发现工作的一部分,统计学家研究了基于观测数据检验假设因果关系的各种方法[16]。然而,这些方法是为验证已知的候选因果关系而设计的,它们也无法处理大量的潜在原因。虽然两个变量之间的关联并不总是意味着因果关系,但众所周知,关联是因果关系的指标[7]。因此,在大型数据集中发现因果关系的实用方法可以从数据的关联分析开始。那么问题来了,我们是否可以过滤掉那些没有因果关系的联想。请注意,这个目标不同于挖掘有趣的关联[9,20]或发现统计上合理的关联[5,21],因为有趣性标准不衡量因果关系,统计显著性检验只确定关联是否由于随机机会。我们把两种状态结合起来
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
From Association Analysis to Causal Discovery
Association analysis is an important technique in data mining, and it has been widely used in many application areas [6]. However, associations found in data can be spurious and do not reflect the ‘true’ relationships between the variables under consideration. For example, it is easily for hundreds or thousands of association rules to be generated even in a small data set, but most of them could be spurious and have no practical meaning [11, 21, 22]. This has hindered the applications of association analysis to solving real world problems. While the development of efficient techniques for finding association patterns in data, especially in large data sets, is well underway, the problem for identifying non-spurious associations has become prominent. Causal relationships imply the real data generating mechanisms and how the outcome would change when the cause is changed, so finding them has been the ultimate goals of many scientific explorations and social studies [18]. The gold standard for causal discover is randomised controlled trials (RCTs) [4, 16]. However, a RCT is infeasible in many real world applications, particularly in the case of high dimensional problem of a large number of potential causes. As part of the efforts on causal discovery, statisticians have studied various methods for testing a hypothetical causal relationship based on observational data [16]. However, these methods are designed for validating a known candidate causal relationship and they are incapable of dealing with a large number of potential causes either. Although an association between two variables does not always imply causation, it is well known that associations are indicators for causal relationships [7]. Therefore a practical approach to causal discovery in large data sets could start with association analysis of the data. A question is then whether we can filter out associations that do not have causal indications. Note that this objective is different from that of mining interesting associations [9, 20] or discovering statistically sound associations [5, 21] because interestingness criteria do not measure causality and a test of statistical significance only determines if an association is due to random chance. We have integrated two statis-
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Light-weight Online Predictive Data Aggregation for Wireless Sensor Networks Predicting Petroleum Reservoir Properties from Downhole Sensor Data using an Ensemble Model of Neural Networks Ensemble Feature Ranking for Shellfish Farm Closure Cause Identification The MOA Data Stream Mining Tool: A Mid-Term Report From Association Analysis to Causal Discovery
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1