Towards exploratory hypothesis testing and analysis

2011 IEEE 27th International Conference on Data Engineering Pub Date : 2011-04-11 DOI:10.1109/ICDE.2011.5767907

Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee

{"title":"Towards exploratory hypothesis testing and analysis","authors":"Guimei Liu, Mengling Feng, Yue Wang, L. Wong, See-Kiong Ng, Tzia Liang Mah, E. Lee","doi":"10.1109/ICDE.2011.5767907","DOIUrl":null,"url":null,"abstract":"Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 27th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2011.5767907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Hypothesis testing is a well-established tool for scientific discovery. Conventional hypothesis testing is carried out in a hypothesis-driven manner. A scientist must first formulate a hypothesis based on his/her knowledge and experience, and then devise a variety of experiments to test it. Given the rapid growth of data, it has become virtually impossible for a person to manually inspect all the data to find all the interesting hypotheses for testing. In this paper, we propose and develop a data-driven system for automatic hypothesis testing and analysis. We define a hypothesis as a comparison between two or more sub-populations. We find sub-populations for comparison using frequent pattern mining techniques and then pair them up for statistical testing. We also generate additional information for further analysis of the hypotheses that are deemed significant. We conducted a set of experiments to show the efficiency of the proposed algorithms, and the usefulness of the generated hypotheses. The results show that our system can help users (1) identify significant hypotheses; (2) isolate the reasons behind significant hypotheses; and (3) find confounding factors that form Simpson's Paradoxes with discovered significant hypotheses.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

走向探索性假设检验和分析

假设检验是科学发现的一种行之有效的工具。传统的假设检验是以假设驱动的方式进行的。一个科学家必须首先根据他/她的知识和经验提出一个假设，然后设计各种各样的实验来验证它。考虑到数据的快速增长，一个人几乎不可能手动检查所有数据以找到所有有趣的假设进行测试。在本文中，我们提出并开发了一个数据驱动的自动假设检验和分析系统。我们将假设定义为两个或多个亚种群之间的比较。我们使用频繁的模式挖掘技术找到子种群进行比较，然后将它们配对进行统计测试。我们还生成额外的信息，以进一步分析被认为重要的假设。我们进行了一组实验来证明所提出算法的效率，以及所生成假设的有用性。结果表明，我们的系统可以帮助用户(1)识别重要假设;(2)分离重大假设背后的原因;(3)用已发现的重要假设找出形成辛普森悖论的混杂因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE 27th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Advanced search, visualization and tagging of sensor metadata Bidirectional mining of non-redundant recurrent rules from a sequence database Web-scale information extraction with vertex Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins Dynamic prioritization of database queries