arXiv - STAT - Methodology最新文献

英文中文

Dynamic Bayesian Networks with Conditional Dynamics in Edge Addition and Deletion 边缘增删条件动态贝叶斯网络

arXiv - STAT - Methodology

Pub Date : 2024-09-13 DOI: arxiv-2409.08965

Lupe S. H. Chan, Amanda M. Y. Chu, Mike K. P. So

This study presents a dynamic Bayesian network framework that facilitatesintuitive gradual edge changes. We use two conditional dynamics to model theedge addition and deletion, and edge selection separately. Unlike previousresearch that uses a mixture network approach, which restricts the number ofpossible edge changes, or structural priors to induce gradual changes, whichcan lead to unclear network evolution, our model induces more frequent andintuitive edge change dynamics. We employ Markov chain Monte Carlo (MCMC)sampling to estimate the model structures and parameters and demonstrate themodel's effectiveness in a portfolio selection application.

本研究提出了一个动态贝叶斯网络框架，它有助于直观地逐步改变边缘。我们使用两种条件动力学分别对边缘的添加、删除和边缘选择进行建模。以往的研究使用混合网络方法或结构先验来诱导渐变，前者限制了可能发生的边缘变化的数量，后者可能导致网络演化不清晰，与此不同，我们的模型诱导了更频繁、更直观的边缘变化动态。我们采用马尔科夫链蒙特卡罗（MCMC）采样来估计模型结构和参数，并在投资组合选择应用中展示了模型的有效性。

引用次数: 0

Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler 用吉布斯弹跳粒子采样器解决大规模线性逆问题的融合 $L_{1/2}$ 先验

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07874

Xiongwen Ke, Yanan Fan, Qingping Zhou

In this paper, we study Bayesian approach for solving large scale linearinverse problems arising in various scientific and engineering fields. Wepropose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promotingproperties and show that it can be formulated as a Gaussian mixture Markovrandom field. Since the density function of this family of prior is neitherlog-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods cannot be applied to sample the posterior. Thus, we present a Gibbs sampler inwhich all the conditional posteriors involved have closed form expressions. TheGibbs sampler works well for small size problems but it is computationallyintractable for large scale problems due to the need for sample highdimensional Gaussian distribution. To reduce the computation burden, weconstruct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewisedeterministic Markov process. This new sampler combines elements of Gibbssampler with bouncy particle sampler and its computation complexity is an orderof magnitude smaller. We show that the new sampler converges to the targetdistribution. With computed tomography examples, we demonstrate that theproposed method shows competitive performance with existing popular Bayesianmethods and is highly efficient in large scale problems.

本文研究了解决各种科学和工程领域中出现的大规模线性逆问题的贝叶斯方法。我们提出了一种具有边缘保留和稀疏性促进特性的融合 $L_{1/2}$ 先验，并证明它可以表述为高斯混合马尔可夫随机场。由于这一系列先验的密度函数既非 log-concave 也非 Lipschitz，因此基于梯度的马尔可夫链蒙特卡罗方法无法用于后验采样。因此，我们提出了一种吉布斯采样器，其中涉及的所有条件后验都有封闭的表达式。吉布斯采样器在处理小规模问题时效果很好，但在处理大规模问题时，由于需要对高维高斯分布进行采样，在计算上非常棘手。为了减轻计算负担，我们在片断确定性马尔可夫过程的基础上构建了吉布斯弹性粒子采样器（Gibbs-BPS）。这种新采样器结合了吉布斯采样器和弹跳粒子采样器的元素，其计算复杂度小了一个数量级。我们证明新采样器收敛于目标分布。我们以计算机断层扫描为例，证明了所提出的方法与现有流行的贝叶斯方法相比具有竞争性的性能，而且在大规模问题上非常高效。

{"title":"Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler","authors":"Xiongwen Ke, Yanan Fan, Qingping Zhou","doi":"arxiv-2409.07874","DOIUrl":"https://doi.org/arxiv-2409.07874","url":null,"abstract":"In this paper, we study Bayesian approach for solving large scale linear\u0000inverse problems arising in various scientific and engineering fields. We\u0000propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting\u0000properties and show that it can be formulated as a Gaussian mixture Markov\u0000random field. Since the density function of this family of prior is neither\u0000log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can\u0000not be applied to sample the posterior. Thus, we present a Gibbs sampler in\u0000which all the conditional posteriors involved have closed form expressions. The\u0000Gibbs sampler works well for small size problems but it is computationally\u0000intractable for large scale problems due to the need for sample high\u0000dimensional Gaussian distribution. To reduce the computation burden, we\u0000construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise\u0000deterministic Markov process. This new sampler combines elements of Gibbs\u0000sampler with bouncy particle sampler and its computation complexity is an order\u0000of magnitude smaller. We show that the new sampler converges to the target\u0000distribution. With computed tomography examples, we demonstrate that the\u0000proposed method shows competitive performance with existing popular Bayesian\u0000methods and is highly efficient in large scale problems.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Review of Recent Advances in Gaussian Process Regression Methods 高斯过程回归方法最新进展综述

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.08112

Chenyi Lyu, Xingchi Liu, Lyudmila Mihaylova

Gaussian process (GP) methods have been widely studied recently, especiallyfor large-scale systems with big data and even more extreme cases when data issparse. Key advantages of these methods consist in: 1) the ability to provideinherent ways to assess the impact of uncertainties (especially in the data,and environment) on the solutions, 2) have efficient factorisation basedimplementations and 3) can be implemented easily in distributed manners andhence provide scalable solutions. This paper reviews the recently developed keyfactorised GP methods such as the hierarchical off-diagonal low-rankapproximation methods and GP with Kronecker structures. An example illustratesthe performance of these methods with respect to accuracy and computationalcomplexity.

高斯过程（GP）方法最近得到了广泛的研究，尤其是针对具有海量数据的大型系统，以及数据稀少的更极端情况。这些方法的主要优势在于1) 能够提供评估不确定性（尤其是数据和环境中的不确定性）对解决方案影响的固有方法；2) 具有高效的基于因式分解的实现方法；3) 可以轻松地以分布式方式实现，从而提供可扩展的解决方案。本文回顾了最近开发的关键因子化 GP 方法，如分层离对角线低秩逼近方法和具有 Kronecker 结构的 GP。一个例子说明了这些方法在精度和计算复杂度方面的性能。

引用次数: 0

Community detection in multi-layer networks by regularized debiased spectral clustering 通过正则化去偏谱聚类检测多层网络中的群落

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07956

Huan Qing

Community detection is a crucial problem in the analysis of multi-layernetworks. In this work, we introduce a new method, called regularized debiasedsum of squared adjacency matrices (RDSoS), to detect latent communities inmulti-layer networks. RDSoS is developed based on a novel regularized Laplacianmatrix that regularizes the debiased sum of squared adjacency matrices. Incontrast, the classical regularized Laplacian matrix typically regularizes theadjacency matrix of a single-layer network. Therefore, at a high level, ourregularized Laplacian matrix extends the classical regularized Laplacian matrixto multi-layer networks. We establish the consistency property of RDSoS underthe multi-layer stochastic block model (MLSBM) and further extend RDSoS and itstheoretical results to the degree-corrected version of the MLSBM model. Theeffectiveness of the proposed methods is evaluated and demonstrated throughsynthetic and real datasets.

社群检测是多层网络分析中的一个关键问题。在这项工作中，我们引入了一种名为正则化邻接矩阵平方和（RDSoS）的新方法，用于检测多层网络中的潜在群落。RDSoS 是基于一种新颖的正则化拉普拉斯矩阵开发的，该矩阵对邻接矩阵平方的去偏和进行了正则化处理。与此相反，经典的正则化拉普拉斯矩阵通常正则化单层网络的邻接矩阵。因此，在高层次上，我们的正则化拉普拉斯矩阵将经典正则化拉普拉斯矩阵扩展到了多层网络。我们建立了 RDSoS 在多层随机块模型（MLSBM）下的一致性属性，并进一步将 RDSoS 及其理论结果扩展到多层随机块模型的度校正版本。通过合成数据集和真实数据集评估和证明了所提方法的有效性。

引用次数: 0

Multiple tests for restricted mean time lost with competing risks data 利用竞争风险数据对受限平均损失时间进行多重测试

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07917

Merle Munko, Dennis Dobler, Marc Ditzhaus

Easy-to-interpret effect estimands are highly desirable in survival analysis.In the competing risks framework, one good candidate is the restricted meantime lost (RMTL). It is defined as the area under the cumulative incidencefunction up to a prespecified time point and, thus, it summarizes thecumulative incidence function into a meaningful estimand. While existingRMTL-based tests are limited to two-sample comparisons and mostly to two eventtypes, we aim to develop general contrast tests for factorial designs and anarbitrary number of event types based on a Wald-type test statistic.Furthermore, we avoid the often-made, rather restrictive continuity assumptionon the event time distribution. This allows for ties in the data, which oftenoccur in practical applications, e.g., when event times are measured in wholedays. In addition, we develop more reliable tests for RMTL comparisons that arebased on a permutation approach to improve the small sample performance. In asecond step, multiple tests for RMTL comparisons are developed to test severalnull hypotheses simultaneously. Here, we incorporate the asymptotically exactdependence structure between the local test statistics to gain more power. Thesmall sample performance of the proposed testing procedures is analyzed insimulations and finally illustrated by analyzing a real data example aboutleukemia patients who underwent bone marrow transplantation.

在生存分析中，易于解释的效应估计值是非常可取的。在竞争风险框架中，一个很好的候选指标是受限时间损失（RMTL）。它被定义为截至预设时间点的累积发病率函数下的面积，因此，它将累积发病率函数概括为一个有意义的估计值。现有的基于 RMTL 的检验仅限于两个样本的比较，而且大多仅限于两种事件类型，而我们的目标是基于 Wald 类型的检验统计量，为因子设计和任意数量的事件类型开发通用的对比检验。这就允许了数据中的联系，而这种联系在实际应用中经常出现，例如，当事件时间以整日为单位进行测量时。此外，我们还开发了更可靠的 RMTL 比较检验，这些检验基于置换方法，以提高小样本性能。第二步，我们开发了 RMTL 比较的多重检验，以同时检验多个零假设。在这里，我们加入了局部检验统计量之间的渐近精确依赖结构，以获得更大的检验功率。我们通过模拟分析了所提出的检验程序的小样本性能，最后通过分析一个关于接受骨髓移植的白血病患者的真实数据示例进行了说明。

{"title":"Multiple tests for restricted mean time lost with competing risks data","authors":"Merle Munko, Dennis Dobler, Marc Ditzhaus","doi":"arxiv-2409.07917","DOIUrl":"https://doi.org/arxiv-2409.07917","url":null,"abstract":"Easy-to-interpret effect estimands are highly desirable in survival analysis.\u0000In the competing risks framework, one good candidate is the restricted mean\u0000time lost (RMTL). It is defined as the area under the cumulative incidence\u0000function up to a prespecified time point and, thus, it summarizes the\u0000cumulative incidence function into a meaningful estimand. While existing\u0000RMTL-based tests are limited to two-sample comparisons and mostly to two event\u0000types, we aim to develop general contrast tests for factorial designs and an\u0000arbitrary number of event types based on a Wald-type test statistic.\u0000Furthermore, we avoid the often-made, rather restrictive continuity assumption\u0000on the event time distribution. This allows for ties in the data, which often\u0000occur in practical applications, e.g., when event times are measured in whole\u0000days. In addition, we develop more reliable tests for RMTL comparisons that are\u0000based on a permutation approach to improve the small sample performance. In a\u0000second step, multiple tests for RMTL comparisons are developed to test several\u0000null hypotheses simultaneously. Here, we incorporate the asymptotically exact\u0000dependence structure between the local test statistics to gain more power. The\u0000small sample performance of the proposed testing procedures is analyzed in\u0000simulations and finally illustrated by analyzing a real data example about\u0000leukemia patients who underwent bone marrow transplantation.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"398 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Causal inference and racial bias in policing: New estimands and the importance of mobility data 警务中的因果推断和种族偏见：新的估算值和流动性数据的重要性

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.08059

Zhuochao Huang, Brenden Beck, Joseph Antonelli

Studying racial bias in policing is a critically important problem, but onethat comes with a number of inherent difficulties due to the nature of theavailable data. In this manuscript we tackle multiple key issues in the causalanalysis of racial bias in policing. First, we formalize race and placepolicing, the idea that individuals of one race are policed differently whenthey are in neighborhoods primarily made up of individuals of other races. Wedevelop an estimand to study this question rigorously, show the assumptionsnecessary for causal identification, and develop sensitivity analyses to assessrobustness to violations of key assumptions. Additionally, we investigatedifficulties with existing estimands targeting racial bias in policing. We showfor these estimands, and the estimands developed in this manuscript, thatestimation can benefit from incorporating mobility data into analyses. We applythese ideas to a study in New York City, where we find a large amount of racialbias, as well as race and place policing, and that these findings are robust tolarge violations of untestable assumptions. We additionally show that mobilitydata can make substantial impacts on the resulting estimates, suggesting itshould be used whenever possible in subsequent studies.

研究警务中的种族偏见是一个极其重要的问题，但由于现有数据的性质，这个问题本身就存在许多困难。在本手稿中，我们解决了警务中种族偏见因果分析的多个关键问题。首先，我们正式提出了种族与地方治安的概念，即当一个种族的人身处主要由其他种族的人组成的社区时，他们会受到不同的治安管理。我们开发了一种估算方法来严格研究这个问题，展示了因果识别所需的假设，并开发了敏感性分析来评估违反关键假设时的稳健性。此外，我们还调查了针对警务工作中种族偏见的现有估计指标存在的困难。我们表明，对于这些估计方法以及本手稿中开发的估计方法，将流动性数据纳入分析可使估计工作受益匪浅。我们将这些想法应用于纽约市的一项研究，在这项研究中，我们发现了大量的种族偏见以及种族和地点警务，而且这些发现对于大量违反无法检验的假设的情况是稳健的。此外，我们还表明，流动性数据会对得出的估计结果产生重大影响，因此建议在后续研究中尽可能使用流动性数据。

{"title":"Causal inference and racial bias in policing: New estimands and the importance of mobility data","authors":"Zhuochao Huang, Brenden Beck, Joseph Antonelli","doi":"arxiv-2409.08059","DOIUrl":"https://doi.org/arxiv-2409.08059","url":null,"abstract":"Studying racial bias in policing is a critically important problem, but one\u0000that comes with a number of inherent difficulties due to the nature of the\u0000available data. In this manuscript we tackle multiple key issues in the causal\u0000analysis of racial bias in policing. First, we formalize race and place\u0000policing, the idea that individuals of one race are policed differently when\u0000they are in neighborhoods primarily made up of individuals of other races. We\u0000develop an estimand to study this question rigorously, show the assumptions\u0000necessary for causal identification, and develop sensitivity analyses to assess\u0000robustness to violations of key assumptions. Additionally, we investigate\u0000difficulties with existing estimands targeting racial bias in policing. We show\u0000for these estimands, and the estimands developed in this manuscript, that\u0000estimation can benefit from incorporating mobility data into analyses. We apply\u0000these ideas to a study in New York City, where we find a large amount of racial\u0000bias, as well as race and place policing, and that these findings are robust to\u0000large violations of untestable assumptions. We additionally show that mobility\u0000data can make substantial impacts on the resulting estimates, suggesting it\u0000should be used whenever possible in subsequent studies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series 用于功能数据分类的随机样条树：环境时间序列的理论与应用

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07879

Donato Riccio, Fabrizio Maturo, Elvira Romano

Functional data analysis (FDA) and ensemble learning can be powerful toolsfor analyzing complex environmental time series. Recent literature hashighlighted the key role of diversity in enhancing accuracy and reducingvariance in ensemble methods.This paper introduces Randomized Spline Trees(RST), a novel algorithm that bridges these two approaches by incorporatingrandomized functional representations into the Random Forest framework. RSTgenerates diverse functional representations of input data using randomizedB-spline parameters, creating an ensemble of decision trees trained on thesevaried representations. We provide a theoretical analysis of how thisfunctional diversity contributes to reducing generalization error and presentempirical evaluations on six environmental time series classification tasksfrom the UCR Time Series Archive. Results show that RST variants outperformstandard Random Forests and Gradient Boosting on most datasets, improvingclassification accuracy by up to 14%. The success of RST demonstrates thepotential of adaptive functional representations in capturing complex temporalpatterns in environmental data. This work contributes to the growing field ofmachine learning techniques focused on functional data and opens new avenuesfor research in environmental time series analysis.

功能数据分析（FDA）和集合学习是分析复杂环境时间序列的有力工具。本文介绍了随机样条树（RST），这是一种新型算法，它将随机化函数表示纳入随机森林框架，从而在这两种方法之间架起了桥梁。RST 使用随机 B 样条参数生成输入数据的不同函数表示，并创建一个在这些不同表示上训练的决策树集合。我们从理论上分析了功能多样性如何有助于减少泛化误差，并对 UCR 时间序列档案中的六个环境时间序列分类任务进行了实证评估。结果表明，RST 变体在大多数数据集上的表现优于标准随机森林和梯度提升，分类准确率提高了 14%。RST 的成功证明了自适应函数表示法在捕捉环境数据中复杂时间模式方面的潜力。这项工作为不断发展的以功能数据为重点的机器学习技术领域做出了贡献，并为环境时间序列分析的研究开辟了新的途径。

{"title":"Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series","authors":"Donato Riccio, Fabrizio Maturo, Elvira Romano","doi":"arxiv-2409.07879","DOIUrl":"https://doi.org/arxiv-2409.07879","url":null,"abstract":"Functional data analysis (FDA) and ensemble learning can be powerful tools\u0000for analyzing complex environmental time series. Recent literature has\u0000highlighted the key role of diversity in enhancing accuracy and reducing\u0000variance in ensemble methods.This paper introduces Randomized Spline Trees\u0000(RST), a novel algorithm that bridges these two approaches by incorporating\u0000randomized functional representations into the Random Forest framework. RST\u0000generates diverse functional representations of input data using randomized\u0000B-spline parameters, creating an ensemble of decision trees trained on these\u0000varied representations. We provide a theoretical analysis of how this\u0000functional diversity contributes to reducing generalization error and present\u0000empirical evaluations on six environmental time series classification tasks\u0000from the UCR Time Series Archive. Results show that RST variants outperform\u0000standard Random Forests and Gradient Boosting on most datasets, improving\u0000classification accuracy by up to 14%. The success of RST demonstrates the\u0000potential of adaptive functional representations in capturing complex temporal\u0000patterns in environmental data. This work contributes to the growing field of\u0000machine learning techniques focused on functional data and opens new avenues\u0000for research in environmental time series analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study 右删失数据下用于双样本测试的机器学习：模拟研究

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.08201

Petr Philonenko, Sergey Postovalov

The focus of this study is to evaluate the effectiveness of Machine Learning(ML) methods for two-sample testing with right-censored observations. Toachieve this, we develop several ML-based methods with varying architecturesand implement them as two-sample tests. Each method is an ensemble (stacking)that combines predictions from classical two-sample tests. This paper presentsthe results of training the proposed ML methods, examines their statisticalpower compared to classical two-sample tests, analyzes the distribution of teststatistics for the proposed methods when the null hypothesis is true, andevaluates the significance of the features incorporated into the proposedmethods. All results from numerical experiments were obtained from a syntheticdataset generated using the Smirnov transform (Inverse Transform Sampling) andreplicated multiple times through Monte Carlo simulation. To test thetwo-sample problem with right-censored observations, one can use the proposedtwo-sample methods. All necessary materials (source code, example scripts,dataset, and samples) are available on GitHub and Hugging Face.

本研究的重点是评估机器学习（ML）方法在具有右删失观测值的双样本测试中的有效性。为了实现这一目标，我们开发了几种基于 ML 的方法，这些方法具有不同的架构，并将它们作为双样本检验方法来实施。每种方法都是一个集合（堆叠），结合了经典双样本检验的预测结果。本文介绍了所提出的 ML 方法的训练结果，考察了这些方法与经典双样本检验方法相比的统计能力，分析了所提出的方法在零假设为真时的检验统计量分布，并评估了所提出的方法中包含的特征的重要性。数值实验的所有结果均来自使用斯米尔诺夫变换（反变换采样）生成的合成数据集，并通过蒙特卡罗模拟进行了多次复制。要测试具有右删失观测值的双样本问题，可以使用建议的双样本方法。所有必要材料（源代码、示例脚本、数据集和样本）均可在 GitHub 和 Hugging Face 上获取。

{"title":"Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study","authors":"Petr Philonenko, Sergey Postovalov","doi":"arxiv-2409.08201","DOIUrl":"https://doi.org/arxiv-2409.08201","url":null,"abstract":"The focus of this study is to evaluate the effectiveness of Machine Learning\u0000(ML) methods for two-sample testing with right-censored observations. To\u0000achieve this, we develop several ML-based methods with varying architectures\u0000and implement them as two-sample tests. Each method is an ensemble (stacking)\u0000that combines predictions from classical two-sample tests. This paper presents\u0000the results of training the proposed ML methods, examines their statistical\u0000power compared to classical two-sample tests, analyzes the distribution of test\u0000statistics for the proposed methods when the null hypothesis is true, and\u0000evaluates the significance of the features incorporated into the proposed\u0000methods. All results from numerical experiments were obtained from a synthetic\u0000dataset generated using the Smirnov transform (Inverse Transform Sampling) and\u0000replicated multiple times through Monte Carlo simulation. To test the\u0000two-sample problem with right-censored observations, one can use the proposed\u0000two-sample methods. All necessary materials (source code, example scripts,\u0000dataset, and samples) are available on GitHub and Hugging Face.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cellwise outlier detection in heterogeneous populations 异质群体中的细胞离群点检测

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07881

Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar

Real-world applications may be affected by outlying values. In themodel-based clustering literature, several methodologies have been proposed todetect units that deviate from the majority of the data (rowwise outliers) andtrim them from the parameter estimates. However, the discarded observations canencompass valuable information in some observed features. Following the morerecent cellwise contamination paradigm, we introduce a Gaussian mixture modelfor cellwise outlier detection. The proposal is estimated via anExpectation-Maximization (EM) algorithm with an additional step for flaggingthe contaminated cells of a data matrix and then imputing -- instead ofdiscarding -- them before the parameter estimation. This procedure adheres tothe spirit of the EM algorithm by treating the contaminated cells as missingvalues. We analyze the performance of the proposed model in comparison withother existing methodologies through a simulation study with differentscenarios and illustrate its potential use for clustering, outlier detection,and imputation on three real data sets.

现实世界的应用可能会受到离群值的影响。在基于模型的聚类文献中，已经提出了几种方法来检测偏离大多数数据的单元（纵向离群值），并将其从参数估计中删除。然而，这些被丢弃的观测数据可能包含了某些观测特征的有价值信息。根据最近的单元污染范例，我们引入了一种高斯混合物模型用于单元离群值检测。该建议通过期望最大化（EM）算法进行估计，并在参数估计前增加了一个步骤，即标记数据矩阵中受污染的单元，然后将其归入（而不是丢弃）。这一过程秉承了 EM 算法的精神，将受污染的单元格视为缺失值。我们通过对不同情况的模拟研究，分析了所提模型与其他现有方法的性能比较，并在三个真实数据集上说明了该模型在聚类、离群点检测和估算方面的潜在用途。

{"title":"Cellwise outlier detection in heterogeneous populations","authors":"Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar","doi":"arxiv-2409.07881","DOIUrl":"https://doi.org/arxiv-2409.07881","url":null,"abstract":"Real-world applications may be affected by outlying values. In the\u0000model-based clustering literature, several methodologies have been proposed to\u0000detect units that deviate from the majority of the data (rowwise outliers) and\u0000trim them from the parameter estimates. However, the discarded observations can\u0000encompass valuable information in some observed features. Following the more\u0000recent cellwise contamination paradigm, we introduce a Gaussian mixture model\u0000for cellwise outlier detection. The proposal is estimated via an\u0000Expectation-Maximization (EM) algorithm with an additional step for flagging\u0000the contaminated cells of a data matrix and then imputing -- instead of\u0000discarding -- them before the parameter estimation. This procedure adheres to\u0000the spirit of the EM algorithm by treating the contaminated cells as missing\u0000values. We analyze the performance of the proposed model in comparison with\u0000other existing methodologies through a simulation study with different\u0000scenarios and illustrate its potential use for clustering, outlier detection,\u0000and imputation on three real data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust and efficient estimation in the presence of a randomly censored covariate 存在随机删减协变量时的稳健高效估计

arXiv - STAT - Methodology

Pub Date : 2024-09-12 DOI: arxiv-2409.07795

Seong-ho Lee, Brian D. Richardson, Yanyuan Ma, Karen S. Marder, Tanya P. Garcia

In Huntington's disease research, a current goal is to understand howsymptoms change prior to a clinical diagnosis. Statistically, this entailsmodeling symptom severity as a function of the covariate 'time untildiagnosis', which is often heavily right-censored in observational studies.Existing estimators that handle right-censored covariates have varyingstatistical efficiency and robustness to misspecified models for nuisancedistributions (those of the censored covariate and censoring variable). On oneextreme, complete case estimation, which utilizes uncensored data only, is freeof nuisance distribution models but discards informative censored observations.On the other extreme, maximum likelihood estimation is maximally efficient butinconsistent when the covariate's distribution is misspecified. We propose asemiparametric estimator that is robust and efficient. When the nuisancedistributions are modeled parametrically, the estimator is doubly robust, i.e.,consistent if at least one distribution is correctly specified, andsemiparametric efficient if both models are correctly specified. When thenuisance distributions are estimated via nonparametric or machine learningmethods, the estimator is consistent and semiparametric efficient. We showempirically that the proposed estimator, implemented in the R package sparcc,has its claimed properties, and we apply it to study Huntington's diseasesymptom trajectories using data from the Enroll-HD study.

亨廷顿氏病研究的当前目标是了解临床诊断前症状是如何变化的。从统计学角度来看，这需要将症状严重程度作为协变量 "诊断前时间 "的函数来建模，而在观察性研究中，"诊断前时间 "往往是严重右删失的。现有处理右删失协变量的估计器具有不同的统计效率和对滋扰分布（删失协变量和删失变量的分布）的错误模型的稳健性。从一个极端来看，只利用未删减数据的完全情况估计不受干扰分布模型的影响，但会丢弃有信息量的删减观测值；从另一个极端来看，最大似然估计具有最大效率，但在协变量分布被错误定义时却不一致。我们提出了一种稳健高效的参数估计方法。当被扰分布以参数方式建模时，估计器具有双重稳健性，即如果至少一个分布被正确指定，则估计器具有一致性；如果两个模型都被正确指定，则估计器具有半参数效率。当通过非参数或机器学习方法估计扰动分布时，估计器是一致的，并且是半参数有效的。我们用经验证明了在 R 软件包 sparcc 中实现的估计器具有所宣称的特性，并利用 Enroll-HD 研究的数据将其用于研究亨廷顿氏病的症状轨迹。

{"title":"Robust and efficient estimation in the presence of a randomly censored covariate","authors":"Seong-ho Lee, Brian D. Richardson, Yanyuan Ma, Karen S. Marder, Tanya P. Garcia","doi":"arxiv-2409.07795","DOIUrl":"https://doi.org/arxiv-2409.07795","url":null,"abstract":"In Huntington's disease research, a current goal is to understand how\u0000symptoms change prior to a clinical diagnosis. Statistically, this entails\u0000modeling symptom severity as a function of the covariate 'time until\u0000diagnosis', which is often heavily right-censored in observational studies.\u0000Existing estimators that handle right-censored covariates have varying\u0000statistical efficiency and robustness to misspecified models for nuisance\u0000distributions (those of the censored covariate and censoring variable). On one\u0000extreme, complete case estimation, which utilizes uncensored data only, is free\u0000of nuisance distribution models but discards informative censored observations.\u0000On the other extreme, maximum likelihood estimation is maximally efficient but\u0000inconsistent when the covariate's distribution is misspecified. We propose a\u0000semiparametric estimator that is robust and efficient. When the nuisance\u0000distributions are modeled parametrically, the estimator is doubly robust, i.e.,\u0000consistent if at least one distribution is correctly specified, and\u0000semiparametric efficient if both models are correctly specified. When the\u0000nuisance distributions are estimated via nonparametric or machine learning\u0000methods, the estimator is consistent and semiparametric efficient. We show\u0000empirically that the proposed estimator, implemented in the R package sparcc,\u0000has its claimed properties, and we apply it to study Huntington's disease\u0000symptom trajectories using data from the Enroll-HD study.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - STAT - Methodology

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀