{"title":"OASIS:一个可解释的,有限样本的有效替代皮尔逊X2的科学发现","authors":"Tavor Z Baharav, David Tse, Julia Salzman","doi":"10.1101/2023.03.16.533008","DOIUrl":null,"url":null,"abstract":"<p><p>Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's <math><msup><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msup></math> test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's <math><msup><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msup></math> test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.</p>","PeriodicalId":17038,"journal":{"name":"Journal of Semiconductors","volume":"37 1","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634974/pdf/","citationCount":"0","resultStr":"{\"title\":\"<ArticleTitle xmlns:ns0=\\\"http://www.w3.org/1998/Math/MathML\\\">OASIS: An interpretable, finite-sample valid alternative to Pearson's <ns0:math><ns0:msup><ns0:mrow><ns0:mi>X</ns0:mi></ns0:mrow><ns0:mrow><ns0:mn>2</ns0:mn></ns0:mrow></ns0:msup></ns0:math> for scientific discovery.\",\"authors\":\"Tavor Z Baharav, David Tse, Julia Salzman\",\"doi\":\"10.1101/2023.03.16.533008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's <math><msup><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msup></math> test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's <math><msup><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msup></math> test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.</p>\",\"PeriodicalId\":17038,\"journal\":{\"name\":\"Journal of Semiconductors\",\"volume\":\"37 1\",\"pages\":\"\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2023-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634974/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Semiconductors\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.03.16.533008\",\"RegionNum\":4,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"PHYSICS, CONDENSED MATTER\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Semiconductors","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.03.16.533008","RegionNum":4,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, CONDENSED MATTER","Score":null,"Total":0}
引用次数: 0
摘要
列联表,即以计数矩阵表示的数据,在定量研究和数据科学应用中无处不在。然而,现有的统计检验是不够的,因为没有一种方法在计算效率和统计有效性上同时适用于有限数量的观察结果。在这项工作中,受最近在无参考基因组推断中的应用(1)的启发,我们开发了OASIS (Optimized Adaptive Statistic for Inferring Structure),这是一个列联表的统计检验系列。OASIS在归一化数据矩阵中构造了一个线性的检验统计量,通过经典浓度不等式提供了封闭形式的p值界。在此过程中,OASIS提供了表的分解,为其拒绝null提供了可解释性。我们推导了OASIS检验统计量的渐近分布,表明这些有限样本界正确地表征了检验统计量的p值直至方差项。基因组测序数据的实验突出了OASIS的能力和可解释性。基于OASIS显著性调用的相同方法可以从头检测SARS-CoV-2和结核分枝杆菌菌株,这是现有方法无法实现的。我们在模拟中证明,OASIS对过度分散具有鲁棒性,过度分散是基因组数据(如单细胞rna测序)的常见特征,在公认的噪声模型下,OASIS仍然可以很好地控制错误发现率,而Pearson 's X2检验始终拒绝零值。此外,我们在合成数据上显示,OASIS在某些情况下比Pearson的X2检验更强大,包括一些重要的两组替代方案,我们用近似的功率计算证实了这一点。列联表在定量研究和数据科学应用中无处不在。然而,现有的统计测试不足;没有一个提供鲁棒性,计算效率高的推理和控制I型误差。在这项工作中,受到基因组学中无参考推断的最新进展的激励,我们提出了一个列联表测试家族,称为OASIS。OASIS利用线性检验统计量,可以计算封闭形式的p值边界,以及标准的渐近正态性结果。OASIS为被拒绝的假设提供了表的分区,为其拒绝null提供了可解释性。在基因组应用中,OASIS在SARS-CoV-2和结核分枝杆菌中进行无参考和无元数据的变异检测,并在单细胞rna测序中表现出强大的性能,所有任务都没有现有的解决方案。
OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery.
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson's test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson's test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.