首页 > 最新文献

arXiv: Methodology最新文献

英文 中文
Analysis and Simulation of Extremes and Rare Events in Complex Systems 复杂系统中极端和罕见事件的分析与模拟
Pub Date : 2020-05-11 DOI: 10.1007/978-3-030-51264-4_7
Meagan Carney, H. Kantz, M. Nicol
{"title":"Analysis and Simulation of Extremes and Rare Events in Complex Systems","authors":"Meagan Carney, H. Kantz, M. Nicol","doi":"10.1007/978-3-030-51264-4_7","DOIUrl":"https://doi.org/10.1007/978-3-030-51264-4_7","url":null,"abstract":"","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116815409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Nonparametric sequential change-point detection for multivariate time series based on empirical distribution functions 基于经验分布函数的多变量时间序列非参数序列变化点检测
Pub Date : 2020-04-26 DOI: 10.1214/21-EJS1798
I. Kojadinovic, Ghislain Verdier
The aim of sequential change-point detection is to issue an alarm when it is thought that certain probabilistic properties of the monitored observations have changed. This work is concerned with nonparametric, closed-end testing procedures based on differences of empirical distribution functions that are designed to be particularly sensitive to changes in the comtemporary distribution of multivariate time series. The proposed detectors are adaptations of statistics used in a posteriori (offline) change-point testing and involve a weighting allowing to give more importance to recent observations. The resulting sequential change-point detection procedures are carried out by comparing the detectors to threshold functions estimated through resampling such that the probability of false alarm remains approximately constant over the monitoring period. A generic result on the asymptotic validity of such a way of estimating a threshold function is stated. As a corollary, the asymptotic validity of the studied sequential tests based on empirical distribution functions is proven when these are carried out using a dependent multiplier bootstrap for multivariate time series. Large-scale Monte Carlo experiments demonstrate the good finite-sample properties of the resulting procedures. The application of the derived sequential tests is illustrated on financial data.
序列变化点检测的目的是在被监测的观测值的某些概率属性发生变化时发出警报。这项工作涉及基于经验分布函数差异的非参数封闭测试程序,这些分布函数被设计为对多变量时间序列当代分布的变化特别敏感。所提出的检测器是对后验(离线)更改点测试中使用的统计数据的改编,并涉及允许对最近的观察给予更多重视的加权。通过将检测器与通过重采样估计的阈值函数进行比较,从而使假警报的概率在监测期间大致保持不变,从而执行所产生的顺序变化点检测程序。给出了这种估计阈值函数的方法的渐近有效性的一般结果。作为一个推论,当使用多元时间序列的相关乘数自举进行时,证明了所研究的基于经验分布函数的序列检验的渐近有效性。大规模蒙特卡罗实验证明了所得到的程序具有良好的有限样本特性。推导出的序列检验方法在财务数据上的应用。
{"title":"Nonparametric sequential change-point detection for multivariate time series based on empirical distribution functions","authors":"I. Kojadinovic, Ghislain Verdier","doi":"10.1214/21-EJS1798","DOIUrl":"https://doi.org/10.1214/21-EJS1798","url":null,"abstract":"The aim of sequential change-point detection is to issue an alarm when it is thought that certain probabilistic properties of the monitored observations have changed. This work is concerned with nonparametric, closed-end testing procedures based on differences of empirical distribution functions that are designed to be particularly sensitive to changes in the comtemporary distribution of multivariate time series. The proposed detectors are adaptations of statistics used in a posteriori (offline) change-point testing and involve a weighting allowing to give more importance to recent observations. The resulting sequential change-point detection procedures are carried out by comparing the detectors to threshold functions estimated through resampling such that the probability of false alarm remains approximately constant over the monitoring period. A generic result on the asymptotic validity of such a way of estimating a threshold function is stated. As a corollary, the asymptotic validity of the studied sequential tests based on empirical distribution functions is proven when these are carried out using a dependent multiplier bootstrap for multivariate time series. Large-scale Monte Carlo experiments demonstrate the good finite-sample properties of the resulting procedures. The application of the derived sequential tests is illustrated on financial data.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115074170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Modeling Nonstationary and Asymmetric Multivariate Spatial Covariances via Deformations 基于变形的非平稳和非对称多元空间协方差建模
Pub Date : 2020-04-18 DOI: 10.5705/ss.202020.0156
Quan Vu, A. Zammit‐Mangion, N. Cressie
Multivariate spatial-statistical models are useful for modeling environmental and socio-demographic processes. The most commonly used models for multivariate spatial covariances assume both stationarity and symmetry for the cross-covariances, but these assumptions are rarely tenable in practice. In this article we introduce a new and highly flexible class of nonstationary and asymmetric multivariate spatial covariance models that are constructed by modeling the simpler and more familiar stationary and symmetric multivariate covariances on a warped domain. Inspired by recent developments in the univariate case, we propose modeling the warping function as a composition of a number of simple injective warping functions in a deep-learning framework. Importantly, covariance-model validity is guaranteed by construction. We establish the types of warpings that allow for symmetry and asymmetry, and we use likelihood-based methods for inference that are computationally efficient. The utility of this new class of models is shown through various data illustrations, including a simulation study on nonstationary data and an application on ocean temperatures at two different depths.
多元空间统计模型对环境和社会人口过程建模很有用。最常用的多元空间协方差模型假设交叉协方差的平稳性和对称性,但这些假设在实践中很少成立。在本文中,我们介绍了一类新的高度灵活的非平稳和非对称多元空间协方差模型,这些模型是通过在一个扭曲域上对更简单和更熟悉的平稳和对称多元协方差进行建模而构建的。受单变量情况的最新发展的启发,我们提出在深度学习框架中将扭曲函数建模为许多简单的注入扭曲函数的组合。重要的是,协方差模型的有效性通过构造得到保证。我们建立了允许对称和不对称的扭曲类型,并且我们使用基于可能性的方法进行计算效率高的推理。这类新模型的效用通过各种数据插图来展示,包括对非平稳数据的模拟研究和对两个不同深度的海洋温度的应用。
{"title":"Modeling Nonstationary and Asymmetric Multivariate Spatial Covariances via Deformations","authors":"Quan Vu, A. Zammit‐Mangion, N. Cressie","doi":"10.5705/ss.202020.0156","DOIUrl":"https://doi.org/10.5705/ss.202020.0156","url":null,"abstract":"Multivariate spatial-statistical models are useful for modeling environmental and socio-demographic processes. The most commonly used models for multivariate spatial covariances assume both stationarity and symmetry for the cross-covariances, but these assumptions are rarely tenable in practice. In this article we introduce a new and highly flexible class of nonstationary and asymmetric multivariate spatial covariance models that are constructed by modeling the simpler and more familiar stationary and symmetric multivariate covariances on a warped domain. Inspired by recent developments in the univariate case, we propose modeling the warping function as a composition of a number of simple injective warping functions in a deep-learning framework. Importantly, covariance-model validity is guaranteed by construction. We establish the types of warpings that allow for symmetry and asymmetry, and we use likelihood-based methods for inference that are computationally efficient. The utility of this new class of models is shown through various data illustrations, including a simulation study on nonstationary data and an application on ocean temperatures at two different depths.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133264848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Stratification and Optimal Resampling for Sequential Monte Carlo 序贯蒙特卡罗的分层和最优重采样
Pub Date : 2020-04-04 DOI: 10.1093/BIOMET/ASAB004
Yichao Li, Wenshuo Wang, Ke Deng, Jun S. Liu
Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Chen 1998), optimal resampling (Fearnhead and Clifford 2003), stratified resampling (Kitagawa 1996), and optimal transport resampling (Reich 2013). We show that, in the one dimensional case, optimal transport resampling is equivalent to stratified resampling on the sorted particles, and they both minimize the resampling variance as well as the expected squared energy distance between the original and resampled empirical distributions; in the multidimensional case, the variance of stratified resampling after sorting particles using Hilbert curve (Gerber et al. 2019) in $mathbb{R}^d$ is $O(m^{-(1+2/d)})$, an improved rate compared to the original $O(m^{-(1+1/d)})$, where $m$ is the number of particles. This improved rate is the lowest for ordered stratified resampling schemes, as conjectured in Gerber et al. (2019). We also present an almost sure bound on the Wasserstein distance between the original and Hilbert-curve-resampled empirical distributions. In light of these theoretical results, we propose the stratified multiple-descendant growth (SMG) algorithm, which allows us to explore the sample space more efficiently compared to the standard i.i.d. multiple-descendant sampling-resampling approach as measured by the Wasserstein metric. Numerical evidence is provided to demonstrate the effectiveness of our proposed method.
序列蒙特卡罗(SMC),也被称为粒子滤波,作为一种对动态系统进行推理的强大计算工具已被广泛接受。SMC的关键步骤是重采样,它起着引导算法走向未来动态的作用。已经提出并在实践中使用了几种策略,包括多项重采样、残差重采样(Liu and Chen 1998)、最优重采样(Fearnhead and Clifford 2003)、分层重采样(Kitagawa 1996)和最优输运重采样(Reich 2013)。我们发现,在一维情况下,最优输运重采样等同于对已排序粒子进行分层重采样,它们都最小化了重采样方差以及原始和重采样经验分布之间的期望能量距离的平方;在多维情况下,使用希尔伯特曲线(Gerber et al. 2019)在$mathbb{R}^d$中对粒子进行排序后的分层重采样方差为$O(m^{-(1+2/d)})$,与原始的$O(m^{-(1+1/d)})$相比,提高了速率,其中$m$为粒子数。根据Gerber等人(2019)的推测,对于有序分层重采样方案,这一改进率是最低的。我们还提出了原始和hilbert曲线重新抽样的经验分布之间的Wasserstein距离的几乎确定的界限。根据这些理论结果,我们提出了分层多后代生长(SMG)算法,与使用Wasserstein度量测量的标准i - id多后代采样-重采样方法相比,该算法使我们能够更有效地探索样本空间。数值实验证明了该方法的有效性。
{"title":"Stratification and Optimal Resampling for Sequential Monte Carlo","authors":"Yichao Li, Wenshuo Wang, Ke Deng, Jun S. Liu","doi":"10.1093/BIOMET/ASAB004","DOIUrl":"https://doi.org/10.1093/BIOMET/ASAB004","url":null,"abstract":"Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Chen 1998), optimal resampling (Fearnhead and Clifford 2003), stratified resampling (Kitagawa 1996), and optimal transport resampling (Reich 2013). We show that, in the one dimensional case, optimal transport resampling is equivalent to stratified resampling on the sorted particles, and they both minimize the resampling variance as well as the expected squared energy distance between the original and resampled empirical distributions; in the multidimensional case, the variance of stratified resampling after sorting particles using Hilbert curve (Gerber et al. 2019) in $mathbb{R}^d$ is $O(m^{-(1+2/d)})$, an improved rate compared to the original $O(m^{-(1+1/d)})$, where $m$ is the number of particles. This improved rate is the lowest for ordered stratified resampling schemes, as conjectured in Gerber et al. (2019). We also present an almost sure bound on the Wasserstein distance between the original and Hilbert-curve-resampled empirical distributions. In light of these theoretical results, we propose the stratified multiple-descendant growth (SMG) algorithm, which allows us to explore the sample space more efficiently compared to the standard i.i.d. multiple-descendant sampling-resampling approach as measured by the Wasserstein metric. Numerical evidence is provided to demonstrate the effectiveness of our proposed method.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131324779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A novel change-point approach for the detection of gas emission sources using remotely contained concentration data 一种新的变化点方法,用于检测气体排放源使用远程包含浓度数据
Pub Date : 2020-04-02 DOI: 10.1214/20-aoas1345
I. Eckley, C. Kirch, S. Weber
Motivated by an example from remote sensing of gas emission sources, we derive two novel change point procedures for multivariate time series where, in contrast to classical change point literature, the changes are not required to be aligned in the different components of the time series. Instead the change points are described by a functional relationship where the precise shape depends on unknown parameters of interest such as the source of the gas emission in the above example. Two different types of tests and the corresponding estimators for the unknown parameters describing the change locations are proposed. We derive the null asymptotics for both tests under weak assumptions on the error time series and show asymptotic consistency under alternatives. Furthermore, we prove consistency for the corresponding estimators of the parameters of interest. The small sample behavior of the methodology is assessed by means of a simulation study and the above remote sensing example analyzed in detail.
以气体排放源遥感为例,我们推导了两种新的多元时间序列变化点方法,与传统的变化点文献不同,这些变化点不需要在时间序列的不同组成部分中对齐。相反,变化点是通过函数关系来描述的,其中精确的形状取决于未知的感兴趣的参数,例如上述示例中的气体排放源。提出了两种不同类型的测试和描述变化位置的未知参数的相应估计。在误差时间序列的弱假设条件下,我们得到了两个检验的零渐近性,并证明了在备选条件下的渐近一致性。进一步,我们证明了相关参数的估计量的相合性。通过模拟研究评估了该方法的小样本行为,并对上述遥感实例进行了详细分析。
{"title":"A novel change-point approach for the detection of gas emission sources using remotely contained concentration data","authors":"I. Eckley, C. Kirch, S. Weber","doi":"10.1214/20-aoas1345","DOIUrl":"https://doi.org/10.1214/20-aoas1345","url":null,"abstract":"Motivated by an example from remote sensing of gas emission sources, we derive two novel change point procedures for multivariate time series where, in contrast to classical change point literature, the changes are not required to be aligned in the different components of the time series. Instead the change points are described by a functional relationship where the precise shape depends on unknown parameters of interest such as the source of the gas emission in the above example. Two different types of tests and the corresponding estimators for the unknown parameters describing the change locations are proposed. We derive the null asymptotics for both tests under weak assumptions on the error time series and show asymptotic consistency under alternatives. Furthermore, we prove consistency for the corresponding estimators of the parameters of interest. The small sample behavior of the methodology is assessed by means of a simulation study and the above remote sensing example analyzed in detail.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"325 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122818475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bootstrap p-values reduce type 1 error of the robust rank-order test of difference in medians 自举p值减少了中位数差异的鲁棒秩序检验的类型1误差
Pub Date : 2020-03-09 DOI: 10.17632/397FM8XDZ2.1
Nirvik Sinha
The robust rank-order test (Fligner and Policello, 1981) was designed as an improvement of the non-parametric Wilcoxon-Mann-Whitney U-test to be more appropriate when the samples being compared have unequal variance. However, it tends to be excessively liberal when the samples are asymmetric. This is likely because the test statistic is assumed to have a standard normal distribution for sample sizes > 12. This work proposes an on-the-fly method to obtain the distribution of the test statistic from which the critical/p-value may be computed directly. The method of likelihood maximization is used to estimate the parameters of the parent distributions of the samples being compared. Using these estimated populations, the null distribution of the test statistic is obtained by the Monte-Carlo method. Simulations are performed to compare the proposed method with that of standard normal approximation of the test statistic. For small sample sizes (<= 20), the Monte-Carlo method outperforms the normal approximation method. This is especially true for low values of significance levels (< 5%). Additionally, when the smaller sample has the larger standard deviation, the Monte-Carlo method outperforms the normal approximation method even for large sample sizes (= 40/60). The two methods do not differ in power. Finally, a Monte-Carlo sample size of 10^4 is found to be sufficient to obtain the aforementioned relative improvements in performance. Thus, the results of this study pave the way for development of a toolbox to perform the robust rank-order test in a distribution-free manner.
稳健秩序检验(Fligner and Policello, 1981)被设计为对非参数Wilcoxon-Mann-Whitney u检验的改进,以便在被比较样本方差不等时更适用。然而,当样本不对称时,它往往过于自由。这可能是因为假设检验统计量在样本量> 12时具有标准正态分布。这项工作提出了一种实时方法来获得检验统计量的分布,从中可以直接计算临界/p值。使用似然最大化方法估计被比较样本的母分布的参数。利用这些估计的总体,用蒙特卡罗方法得到检验统计量的零分布。仿真比较了该方法与标准正态逼近检验统计量的方法。对于小样本量(<= 20),蒙特卡罗方法优于正态近似方法。对于显著性水平的低值(< 5%)尤其如此。此外,当样本越小标准差越大时,即使样本量较大(= 40/60),蒙特卡罗方法也优于正态近似方法。这两种方法的威力没有差别。最后,发现蒙特卡罗样本大小为10^4足以获得上述性能的相对改进。因此,本研究的结果为开发一个工具箱以无分布的方式进行稳健秩序检验铺平了道路。
{"title":"Bootstrap p-values reduce type 1 error of the robust rank-order test of difference in medians","authors":"Nirvik Sinha","doi":"10.17632/397FM8XDZ2.1","DOIUrl":"https://doi.org/10.17632/397FM8XDZ2.1","url":null,"abstract":"The robust rank-order test (Fligner and Policello, 1981) was designed as an improvement of the non-parametric Wilcoxon-Mann-Whitney U-test to be more appropriate when the samples being compared have unequal variance. However, it tends to be excessively liberal when the samples are asymmetric. This is likely because the test statistic is assumed to have a standard normal distribution for sample sizes > 12. This work proposes an on-the-fly method to obtain the distribution of the test statistic from which the critical/p-value may be computed directly. The method of likelihood maximization is used to estimate the parameters of the parent distributions of the samples being compared. Using these estimated populations, the null distribution of the test statistic is obtained by the Monte-Carlo method. Simulations are performed to compare the proposed method with that of standard normal approximation of the test statistic. For small sample sizes (<= 20), the Monte-Carlo method outperforms the normal approximation method. This is especially true for low values of significance levels (< 5%). Additionally, when the smaller sample has the larger standard deviation, the Monte-Carlo method outperforms the normal approximation method even for large sample sizes (= 40/60). The two methods do not differ in power. Finally, a Monte-Carlo sample size of 10^4 is found to be sufficient to obtain the aforementioned relative improvements in performance. Thus, the results of this study pave the way for development of a toolbox to perform the robust rank-order test in a distribution-free manner.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127579468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data 异质单细胞蛋白质组学数据中基于最近邻的病毒重构非参数检验
Pub Date : 2020-03-05 DOI: 10.1214/20-aoas1362
Trambak Banerjee, B. Bhattacharya, Gourab Mukherjee
An important problem in contemporary immunology studies based on single-cell protein expression data is to determine whether cellular expressions are remodeled post infection by a pathogen. One natural approach for detecting such changes is to use non-parametric two-sample statistical tests. However, in single-cell studies, direct application of these tests is often inadequate because single-cell level expression data from uninfected populations often contains attributes of several latent sub-populations with highly heterogeneous characteristics. As a result, viruses often infect these different sub-populations at different rates in which case the traditional nonparametric two-sample tests for checking similarity in distributions are no longer conservative. We propose a new nonparametric method for Testing Remodeling Under Heterogeneity (TRUH) that can accurately detect changes in the infected samples compared to possibly heterogeneous uninfected samples. Our testing framework is based on composite nulls and is designed to allow the null model to encompass the possibility that the infected samples, though unaltered by the virus, might be dominantly arising from under-represented sub-populations in the baseline data. The TRUH statistic, which uses nearest neighbor projections of the infected samples into the baseline uninfected population, is calibrated using a novel bootstrap algorithm. We demonstrate the non-asymptotic performance of the test via simulation experiments and derive the large sample limit of the test statistic, which provides theoretical support towards consistent asymptotic calibration of the test. We use the TRUH statistic for studying remodeling in tonsillar T cells under different types of HIV infection and find that unlike traditional tests, TRUH based statistical inference conforms to the biologically validated immunological theories on HIV infection.
基于单细胞蛋白表达数据的当代免疫学研究的一个重要问题是确定细胞表达是否在病原体感染后被重塑。检测这种变化的一种自然方法是使用非参数双样本统计检验。然而,在单细胞研究中,这些测试的直接应用往往是不够的,因为来自未感染群体的单细胞水平表达数据往往包含几个具有高度异质性特征的潜在亚群体的属性。结果,病毒常常以不同的速率感染这些不同的亚群,在这种情况下,传统的用于检查分布相似性的非参数双样本检验不再保守。我们提出了一种新的非参数方法来测试异质性下的重塑(TRUH),可以准确地检测感染样本与可能异质性的未感染样本之间的变化。我们的测试框架以复合零值为基础,旨在使零值模型包含这样一种可能性,即感染样本虽然未被病毒改变,但可能主要来自基线数据中代表性不足的亚群。TRUH统计数据使用最近邻的感染样本投影到基线未感染人群中,使用一种新的自引导算法进行校准。我们通过模拟实验证明了检验的非渐近性能,并推导了检验统计量的大样本极限,为检验的一致渐近校准提供了理论支持。我们使用TRUH统计量研究了不同类型HIV感染下扁桃体T细胞的重塑,发现与传统测试不同,基于TRUH的统计推断符合HIV感染的生物学验证免疫学理论。
{"title":"A nearest-neighbor based nonparametric test for viral remodeling in heterogeneous single-cell proteomic data","authors":"Trambak Banerjee, B. Bhattacharya, Gourab Mukherjee","doi":"10.1214/20-aoas1362","DOIUrl":"https://doi.org/10.1214/20-aoas1362","url":null,"abstract":"An important problem in contemporary immunology studies based on single-cell protein expression data is to determine whether cellular expressions are remodeled post infection by a pathogen. One natural approach for detecting such changes is to use non-parametric two-sample statistical tests. However, in single-cell studies, direct application of these tests is often inadequate because single-cell level expression data from uninfected populations often contains attributes of several latent sub-populations with highly heterogeneous characteristics. As a result, viruses often infect these different sub-populations at different rates in which case the traditional nonparametric two-sample tests for checking similarity in distributions are no longer conservative. We propose a new nonparametric method for Testing Remodeling Under Heterogeneity (TRUH) that can accurately detect changes in the infected samples compared to possibly heterogeneous uninfected samples. Our testing framework is based on composite nulls and is designed to allow the null model to encompass the possibility that the infected samples, though unaltered by the virus, might be dominantly arising from under-represented sub-populations in the baseline data. The TRUH statistic, which uses nearest neighbor projections of the infected samples into the baseline uninfected population, is calibrated using a novel bootstrap algorithm. We demonstrate the non-asymptotic performance of the test via simulation experiments and derive the large sample limit of the test statistic, which provides theoretical support towards consistent asymptotic calibration of the test. We use the TRUH statistic for studying remodeling in tonsillar T cells under different types of HIV infection and find that unlike traditional tests, TRUH based statistical inference conforms to the biologically validated immunological theories on HIV infection.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125897887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Finite space Kantorovich problem with an MCMC of table moves 有限空间Kantorovich问题与表移动的MCMC
Pub Date : 2020-02-24 DOI: 10.1214/21-EJS1804
Giovanni Pistone, Fabio Rapallo, M. Rogantin
In Optimal Transport (OT) on a finite metric space, one defines a distance on the probability simplex that extends the distance on the ground space. The distance is the value of a Linear Programming (LP) problem on the set of nonegative-valued 2-way tables with assigned probability functions as margins. We apply to this case the methodology of moves from Algebraic Statistics (AS) and use it to derive an Monte Carlo Markov Chain solution algorithm.
在有限度量空间上的最优传输(OT)中,定义了概率单纯形上的距离,该距离扩展了地面空间上的距离。距离是一个线性规划(LP)问题在一组非负值的2路表上的值,这些表以指定的概率函数作为边界。我们将代数统计(AS)的移动方法应用于这种情况,并使用它来推导蒙特卡洛马尔可夫链解算法。
{"title":"Finite space Kantorovich problem with an MCMC of table moves","authors":"Giovanni Pistone, Fabio Rapallo, M. Rogantin","doi":"10.1214/21-EJS1804","DOIUrl":"https://doi.org/10.1214/21-EJS1804","url":null,"abstract":"In Optimal Transport (OT) on a finite metric space, one defines a distance on the probability simplex that extends the distance on the ground space. The distance is the value of a Linear Programming (LP) problem on the set of nonegative-valued 2-way tables with assigned probability functions as margins. We apply to this case the methodology of moves from Algebraic Statistics (AS) and use it to derive an Monte Carlo Markov Chain solution algorithm.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114449020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Thou Shalt Not Reject the P-value 你不能拒绝p值
Pub Date : 2020-02-17 DOI: 10.13140/RG.2.2.18014.59206/1
Oliver Y. Ch'en, Raúl G. Saraiva, G. Nagels, Huy P Phan, Tom Schwantje, H. Cao, Jiangtao Gou, Jenna M. Reinen, Bin Xiong, M. Vos
Since its debut in the 18th century, the P-value has been an important part of hypothesis testing-based scientific discoveries. As the statistical engine accelerates, questions are beginning to be raised, asking to what extent scientific discoveries based on a P-value are reliable and reproducible, and the voice calling for adjusting the significance level or banning the P-value has been increasingly heard. Inspired by these questions and discussions, here we enquire into the useful roles and misuses of the P-value in scientific studies. For common misuses and misinterpretations, we provide modest recommendations for practitioners. Additionally, we compare statistical significance with clinical relevance. In parallel, we review the Bayesian alternatives for seeking evidence. Finally, we discuss the promises and risks of using meta-analysis to pool P-values from multiple studies to aggregate evidence. Taken together, the P-value underpins a useful probabilistic decision-making system and provides evidence at a continuous scale. But its interpretation must be contextual, considering the scientific question, experimental design (including model specification, sample size, and significance level), statistical power, effect size, and reproducibility.
自18世纪首次出现以来,p值一直是基于假设检验的科学发现的重要组成部分。随着统计引擎的加速发展,人们开始质疑,基于p值的科学发现在多大程度上是可靠的和可重复的,要求调整显著性水平或禁止p值的声音也越来越多。受这些问题和讨论的启发,我们在这里探讨p值在科学研究中的有用作用和误用。对于常见的误用和误解,我们为从业者提供了适度的建议。此外,我们比较统计学意义与临床相关性。同时,我们回顾了寻找证据的贝叶斯替代方法。最后,我们讨论了使用荟萃分析汇集多个研究的p值以汇总证据的前景和风险。综上所述,p值支撑了一个有用的概率决策系统,并在连续尺度上提供了证据。但它的解释必须与上下文相关,考虑到科学问题、实验设计(包括模型规格、样本量和显著性水平)、统计能力、效应大小和可重复性。
{"title":"Thou Shalt Not Reject the P-value","authors":"Oliver Y. Ch'en, Raúl G. Saraiva, G. Nagels, Huy P Phan, Tom Schwantje, H. Cao, Jiangtao Gou, Jenna M. Reinen, Bin Xiong, M. Vos","doi":"10.13140/RG.2.2.18014.59206/1","DOIUrl":"https://doi.org/10.13140/RG.2.2.18014.59206/1","url":null,"abstract":"Since its debut in the 18th century, the P-value has been an important part of hypothesis testing-based scientific discoveries. As the statistical engine accelerates, questions are beginning to be raised, asking to what extent scientific discoveries based on a P-value are reliable and reproducible, and the voice calling for adjusting the significance level or banning the P-value has been increasingly heard. Inspired by these questions and discussions, here we enquire into the useful roles and misuses of the P-value in scientific studies. For common misuses and misinterpretations, we provide modest recommendations for practitioners. Additionally, we compare statistical significance with clinical relevance. In parallel, we review the Bayesian alternatives for seeking evidence. Finally, we discuss the promises and risks of using meta-analysis to pool P-values from multiple studies to aggregate evidence. Taken together, the P-value underpins a useful probabilistic decision-making system and provides evidence at a continuous scale. But its interpretation must be contextual, considering the scientific question, experimental design (including model specification, sample size, and significance level), statistical power, effect size, and reproducibility.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127676322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Computationally efficient univariate filtering for massive data. 计算效率高的海量数据单变量滤波。
Pub Date : 2020-02-11 DOI: 10.1285/I20705948V13N2P390
M. Tsagris, A. Alenazi, S. Fafalios
The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.
大规模、海量、大数据的广泛可用性增加了数据分析的计算成本。其中一个例子是单变量滤波的计算成本,它通常涉及拟合许多单变量回归模型,并且对于许多变量选择算法至关重要,以减少预测变量的数量。本文展示了如何通过使用分数检验或简单的Pearson相关性(或二元响应的t检验)来显著降低计算成本。广泛的蒙特卡罗模拟研究将展示它们与似然比检验相比的优缺点,并通过真实数据的示例说明分数检验和对数似然比检验在现实场景下的性能。根据所使用的回归模型,分数测试比对数似然比测试快30 - 60,000倍,并且产生几乎相同的结果。因此,本文强烈建议在处理大规模数据、海量数据、大数据,甚至是几万甚至更大样本量的数据时,用分数检验代替对数似然比检验。
{"title":"Computationally efficient univariate filtering for massive data.","authors":"M. Tsagris, A. Alenazi, S. Fafalios","doi":"10.1285/I20705948V13N2P390","DOIUrl":"https://doi.org/10.1285/I20705948V13N2P390","url":null,"abstract":"The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121522752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv: Methodology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1