Statistical Analysis and Data Mining最新文献_第3页

Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data 用于整合和分析多平台高维基因组学数据的贝叶斯收缩模型

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-06 DOI: 10.1002/sam.11682

Hao Xue, Sounak Chakraborty, Tanujit Dey

With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.

随着临床研究中来自多个平台的同一患者的生物医学数据（如表观基因组学、基因表达和临床特征）越来越多，人们越来越需要能够联合分析不同平台数据的统计方法，为临床研究提供互补信息。在本文中，我们提出了一种两阶段分层贝叶斯模型，该模型可整合来自不同平台的高维生物医学数据，从而筛选出与临床结果相关的生物标记物。在第一阶段，我们使用基于期望最大化的方法来学习表观基因组学（如基因甲基化）和基因表达之间的调控机制，同时考虑功能基因注释。在第二阶段，我们根据第一阶段学习到的调控机制对基因进行分组。然后，我们在结合临床特征的同时，应用分组惩罚来选择与临床结果显著相关的基因。模拟研究表明，与现有方法相比，我们基于模型的数据整合方法在选择预测变量时误报率较低。此外，基于胶质母细胞瘤（GBM）数据集的真实数据分析显示，与现有方法相比，我们的方法具有更高的准确性，可以检测出与胶质母细胞瘤存活率相关的基因。此外，现有文献证实，所选的大多数生物标志物对 GBM 的预后至关重要。

{"title":"Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data","authors":"Hao Xue, Sounak Chakraborty, Tanujit Dey","doi":"10.1002/sam.11682","DOIUrl":"https://doi.org/10.1002/sam.11682","url":null,"abstract":"With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"8 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140603167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expert‐in‐the‐loop design of integral nuclear data experiments 核数据整体实验的在环专家设计

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-02 DOI: 10.1002/sam.11677

Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel

Nuclear data are fundamental inputs to radiation transport codes used for reactor design and criticality safety. The design of experiments to reduce nuclear data uncertainty has been a challenge for many years, but advances in the sensitivity calculations of radiation transport codes within the last two decades have made optimal experimental design possible. The design of integral nuclear experiments poses numerous challenges not emphasized in classical optimal design, in particular, constrained design spaces (in both a statistical and engineering sense), severely under‐determined systems, and optimality uncertainty. We present a design pipeline to optimize critical experiments that uses constrained Bayesian optimization within an iterative expert‐in‐the‐loop framework. We show a successfully completed experiment campaign designed with this framework that involved two critical configurations and multiple measurements that targeted compensating errors in 239Pu nuclear data.

核数据是用于反应堆设计和临界安全的辐射输运代码的基本输入。多年来，降低核数据不确定性的实验设计一直是一项挑战，但在过去二十年中，辐射传输代码灵敏度计算的进步使得优化实验设计成为可能。整体核实验设计提出了许多经典优化设计中没有强调的挑战，特别是受限设计空间（统计和工程学意义上的）、严重欠定系统和优化不确定性。我们介绍了一种优化关键实验的设计管道，它在迭代专家在环框架内使用受限贝叶斯优化。我们展示了利用该框架设计的一次成功完成的实验活动，该活动涉及两个关键配置和针对 239Pu 核数据误差补偿的多次测量。

引用次数: 0

Hub‐aware random walk graph embedding methods for classification 用于分类的中枢感知随机漫步图嵌入方法

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-04-01 DOI: 10.1002/sam.11676

Aleksandar Tomčić, Miloš Savić, Miloš Radovanović

In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).

在过去二十年里，我们目睹了以图形或网络为结构的有价值大数据的大量增加。要将传统的机器学习和数据分析技术应用于此类数据，就必须将图转换为基于向量的表示法，以保留图最基本的结构特性。为此，文献中提出了大量图嵌入方法。其中大多数方法产生的通用嵌入适合于各种应用，如节点聚类、节点分类、图可视化和链接预测。在本文中，我们提出了两种基于随机游走的新型图嵌入算法，专门用于解决节点分类问题。所提算法的随机游走采样策略特别关注大规模图中对整体连通性起最关键作用的枢纽--高度节点。通过分析在真实世界网络嵌入上训练的三种分类算法的分类性能，对提出的方法进行了实验评估。结果表明，与目前最流行的生成通用图嵌入的随机漫步方法（node2vec）相比，我们的方法大大提高了所研究分类器的预测能力。

{"title":"Hub‐aware random walk graph embedding methods for classification","authors":"Aleksandar Tomčić, Miloš Savić, Miloš Radovanović","doi":"10.1002/sam.11676","DOIUrl":"https://doi.org/10.1002/sam.11676","url":null,"abstract":"In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"60 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications 分布尾部的有限混合模型：蒙特卡罗实验和经验应用

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-28 DOI: 10.1002/sam.11671

Marilena Furno, Francesco Caracciolo

The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.

有限混合物模型估计了由该估计器内生决定的数据集不同组别中各不相同的回归系数。接下来的分析将超越平均值，对每个组内因变量条件分布的尾部进行估计。虽然聚类减少了整体异质性，因为模型是针对相似观测值的组进行估计的，但尾部分析揭示了组内异质性和/或偏斜性。通过将内生决定的聚类与各组内的量子回归分析相结合，增强了有限混合物模型，并将重点放在因变量条件分布的尾部行为上。最后，通过蒙特卡罗实验和两个实证应用进行了分析。在著名的出生体重数据集中，有限混合模型识别并计算了不同组别的回归系数，每个组别在均值和尾部都有自己的特点。在家庭支出数据中，对组内和组间异质性的分析为价格弹性提供了有趣的经济学启示。事实证明，分组分析比不分组的估计模型更有效。通过将有限混合物方法扩展到尾部，对数据进行了更准确的调查，引入了一种强有力的工具来揭示组内异质性和非对称性的来源，否则就无法发现。与基于 OLS 的标准 FMM 相比，它提高了效率和解释力。

{"title":"The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications","authors":"Marilena Furno, Francesco Caracciolo","doi":"10.1002/sam.11671","DOIUrl":"https://doi.org/10.1002/sam.11671","url":null,"abstract":"The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart data augmentation: One equation is all you need 智能数据增强：只需一个等式

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-28 DOI: 10.1002/sam.11672

Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma

Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.

在机器学习分类问题中，类不平衡是一个常见而严峻的挑战，会导致预测准确率低下。虽然已经提出了许多方法（尤其是数据增强方法）来解决这一问题，但在一个数据集上行之有效的方法在另一个数据集上可能表现不佳。据我们所知，目前还没有一种处理类不平衡的最佳方法可以统一应用。在本文中，我们提出了一种名为智能数据增强（SDA）的方法，旨在以最佳方式增强不平衡数据，从而最大限度地提高下游分类的准确性。SDA 的关键新颖之处在于一个等式，它能带来一种增强方法，该方法统一了处理多级类不平衡的现有采样方法，并允许轻松微调。这一框架使得 SDA 可以被看作是传统方法的一般化，而传统方法又可以被看作是 SDA 的特例。在大量数据集上的实证结果表明，SDA 可以显著提高随机森林、多层感知器和基于直方图的梯度提升等最常用分类器的性能。

{"title":"Smart data augmentation: One equation is all you need","authors":"Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma","doi":"10.1002/sam.11672","DOIUrl":"https://doi.org/10.1002/sam.11672","url":null,"abstract":"Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Compositional variable selection in quantile regression for microbiome data with false discovery rate control 微生物组数据量化回归中的组成变量选择与错误发现率控制

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-28 DOI: 10.1002/sam.11674

Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan

Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.

高通量测序技术的进步激发了人们对确定与疾病相关的特定微生物类群的浓厚兴趣。无论是从了解生物学的角度，还是从开发疗法的生物医学角度来看，这些知识都是非常宝贵的，因为微生物组本身是可以改变的。尽管可以获得大量数据，但对微生物组组成数据的分析仍然困难重重。微生物群落中所有成分的相对丰度总和为 1 的特性给统计分析带来了挑战，尤其是在高维环境中，常见的研究主题是从众多嘈杂的特征中选择一小部分信号。受研究微生物组在宿主转录组学中作用的启发，我们提出了一种新方法来识别与宿主基因表达相关的微生物类群。除了适应微生物组数据的组成性质外，我们的方法还实现了受 FDR 控制的变量选择，并捕捉了由于异方差或非位置尺度协变量效应而导致的异质性。我们通过大量的数值模拟研究证明了我们的方法的优越性能，然后将其应用于真实世界的微生物组数据分析，以获得传统的基于均值的线性回归分析所忽略的新的生物学见解。

{"title":"Compositional variable selection in quantile regression for microbiome data with false discovery rate control","authors":"Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan","doi":"10.1002/sam.11674","DOIUrl":"https://doi.org/10.1002/sam.11674","url":null,"abstract":"Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases 高斯过程模型的非均匀主动学习及其在轨迹信息空气动力学数据库中的应用

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-27 DOI: 10.1002/sam.11675

Kevin R. Quinlan, Jagadeesh Movva, Brad Perfect

The ability to non‐uniformly weight the input space is desirable for many applications, and has been explored for space‐filling approaches. Increased interests in linking models, such as in a digital twinning framework, increases the need for sampling emulators where they are most likely to be evaluated. In particular, we apply non‐uniform sampling methods for the construction of aerodynamic databases. This paper combines non‐uniform weighting with active learning for Gaussian Processes (GPs) to develop a closed‐form solution to a non‐uniform active learning criterion. We accomplish this by utilizing a kernel density estimator as the weight function. We demonstrate the need and efficacy of this approach with an atmospheric entry example that accounts for both model uncertainty as well as the practical state space of the vehicle, as determined by forward modeling within the active learning loop.

对许多应用来说，对输入空间进行非均匀加权的能力是可取的，并已在空间填充方法中进行了探索。随着人们对数字孪生框架等模型链接兴趣的增加，在最有可能进行评估的地方对模拟器进行采样的需求也随之增加。特别是，我们将非均匀采样方法应用于空气动力学数据库的构建。本文将非均匀加权与高斯过程（GPs）的主动学习相结合，为非均匀主动学习准则开发了一种闭式解决方案。我们利用核密度估计器作为权重函数来实现这一目标。我们通过一个大气进入示例来证明这种方法的必要性和有效性，该示例既考虑了模型的不确定性，也考虑了车辆的实际状态空间，这是由主动学习环路中的前向建模决定的。

引用次数: 0

eRPCA: Robust Principal Component Analysis for Exponential Family Distributions eRPCA：指数族分布的稳健主成分分析

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-27 DOI: 10.1002/sam.11670

Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie

Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.

稳健主成分分析（RPCA）是一种广泛使用的方法，用于从被显著和稀疏异常值破坏的数据矩阵中恢复低秩结构。这些破坏可能源于遮挡、恶意篡改或其他异常原因，而这些破坏与低秩背景的联合识别对于流程监控和诊断至关重要。然而，现有的 RPCA 方法及其扩展在很大程度上没有考虑到数据矩阵的基本概率分布，而在许多应用中，数据矩阵的概率分布是已知的，并且可能是高度非高斯分布。因此，我们提出了一种名为 "指数族分布 RPCA"（）的新方法，当数据分布属于指数族时，该方法可以将数据分解为低秩稀疏矩阵。我们提出了一种新颖的交替方向乘法优化算法，可在其自然或规范参数化条件下进行高效分解。随后，我们在两个应用中展示了该算法的有效性：第一个应用是钢板缺陷检测，第二个应用是亚特兰大大都会地区的犯罪活动监控。

{"title":"eRPCA: Robust Principal Component Analysis for Exponential Family Distributions","authors":"Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie","doi":"10.1002/sam.11670","DOIUrl":"https://doi.org/10.1002/sam.11670","url":null,"abstract":"Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"71 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach 非参数量化器在在线手写签名验证中的应用：统计学习方法

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-26 DOI: 10.1002/sam.11673

Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos

This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the x ‐axis and y‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.

这项研究探索了非参数量化器在手写签名验证问题中的应用。我们使用了广泛应用于签名验证问题的 MCYT-100 （MCYT 指纹子语料库）数据库。我们对数据库中提供的 x 轴和 y 轴上的离散时间序列位置进行了预处理，并采用了基于熵、复杂度、费雪信息和趋势等非参数量化指标的时间因果信息。研究还建议利用所获得的时间序列来评估这些量化指标，应用每个序列位置的一阶导数和二阶导数来评估动态行为，分别观察其速度和加速度状态。MCYT-100 数据库中的特征通过逻辑回归、支持向量机 (SVM)、随机森林和极梯度提升 (XGBoost) 进行分类。量化指标被用作训练分类器的输入特征。为了评估非参数量化器区分伪造和真实签名的能力和影响，我们使用了变量选择标准，例如：信息增益、方差分析和方差膨胀因子。分类器的性能通过分类误差度量（如特异性和曲线下面积）进行评估。结果表明，SVM 和 XGBoost 分类器的性能最佳。

{"title":"Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach","authors":"Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos","doi":"10.1002/sam.11673","DOIUrl":"https://doi.org/10.1002/sam.11673","url":null,"abstract":"This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the x ‐axis and y‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online learning for streaming data classification in nonstationary environments 非稳态环境下流式数据分类的在线学习

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2024-03-09 DOI: 10.1002/sam.11669

Yujie Gai, Kang Meng, Xiaodi Wang

In this article, we implement the classification of nonstationary streaming data. Due to the inability to obtain full data in the context of streaming data, we adopt a strategy based on clustering structure for data classification. Specifically, this strategy involves dynamically maintaining clustering structures to update the model, thereby updating the objective function for classification. Simultaneously, incoming samples are monitored in real-time to identify the emergence of new classes or the presence of outliers. Moreover, this strategy can also deal with the concept drift problem, where the distribution of data changes with the inflow of data. Regarding the handling of novel instances, we introduce a buffer analysis mechanism to delay their processing, which in turn improves the prediction performance of the model. In the process of model updating, we also introduce a novel renewable strategy for the covariance matrix. Numerical simulations and experiments on datasets show that our method has significant advantages.

在本文中，我们实现了对非平稳流数据的分类。由于无法获得流数据的完整数据，我们采用了基于聚类结构的数据分类策略。具体来说，该策略包括动态维护聚类结构以更新模型，从而更新分类的目标函数。与此同时，对输入样本进行实时监控，以识别新类别的出现或异常值的存在。此外，这种策略还能处理概念漂移问题，即数据的分布会随着数据的流入而发生变化。关于新实例的处理，我们引入了一种缓冲分析机制，以延迟其处理时间，从而提高模型的预测性能。在模型更新过程中，我们还引入了一种新的协方差矩阵可再生策略。数值模拟和数据集实验表明，我们的方法具有显著优势。

引用次数: 0