With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.
{"title":"Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data","authors":"Hao Xue, Sounak Chakraborty, Tanujit Dey","doi":"10.1002/sam.11682","DOIUrl":"https://doi.org/10.1002/sam.11682","url":null,"abstract":"With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two‐stage hierarchical Bayesian model that integrates high‐dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization‐based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group‐wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model‐based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"8 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140603167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel
Nuclear data are fundamental inputs to radiation transport codes used for reactor design and criticality safety. The design of experiments to reduce nuclear data uncertainty has been a challenge for many years, but advances in the sensitivity calculations of radiation transport codes within the last two decades have made optimal experimental design possible. The design of integral nuclear experiments poses numerous challenges not emphasized in classical optimal design, in particular, constrained design spaces (in both a statistical and engineering sense), severely under‐determined systems, and optimality uncertainty. We present a design pipeline to optimize critical experiments that uses constrained Bayesian optimization within an iterative expert‐in‐the‐loop framework. We show a successfully completed experiment campaign designed with this framework that involved two critical configurations and multiple measurements that targeted compensating errors in 239Pu nuclear data.
{"title":"Expert‐in‐the‐loop design of integral nuclear data experiments","authors":"Isaac Michaud, Michael Grosskopf, Jesson Hutchinson, Scott Vander Wiel","doi":"10.1002/sam.11677","DOIUrl":"https://doi.org/10.1002/sam.11677","url":null,"abstract":"Nuclear data are fundamental inputs to radiation transport codes used for reactor design and criticality safety. The design of experiments to reduce nuclear data uncertainty has been a challenge for many years, but advances in the sensitivity calculations of radiation transport codes within the last two decades have made optimal experimental design possible. The design of integral nuclear experiments poses numerous challenges not emphasized in classical optimal design, in particular, constrained design spaces (in both a statistical and engineering sense), severely under‐determined systems, and optimality uncertainty. We present a design pipeline to optimize critical experiments that uses constrained Bayesian optimization within an iterative expert‐in‐the‐loop framework. We show a successfully completed experiment campaign designed with this framework that involved two critical configurations and multiple measurements that targeted compensating errors in <jats:sup>239</jats:sup>Pu nuclear data.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"46 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).
{"title":"Hub‐aware random walk graph embedding methods for classification","authors":"Aleksandar Tomčić, Miloš Savić, Miloš Radovanović","doi":"10.1002/sam.11676","DOIUrl":"https://doi.org/10.1002/sam.11676","url":null,"abstract":"In the last two decades, we are witnessing a huge increase of valuable big data structured in the form of graphs or networks. To apply traditional machine learning and data analytic techniques to such data it is necessary to transform graphs into vector‐based representations that preserve the most essential structural properties of graphs. For this purpose, a large number of graph embedding methods have been proposed in the literature. Most of them produce general‐purpose embeddings suitable for a variety of applications such as node clustering, node classification, graph visualization and link prediction. In this article, we propose two novel graph embedding algorithms based on random walks that are specifically designed for the node classification problem. Random walk sampling strategies of the proposed algorithms have been designed to pay special attention to hubs–high‐degree nodes that have the most critical role for the overall connectedness in large‐scale graphs. The proposed methods are experimentally evaluated by analyzing the classification performance of three classification algorithms trained on embeddings of real‐world networks. The obtained results indicate that our methods considerably improve the predictive power of examined classifiers compared with currently the most popular random walk method for generating general‐purpose graph embeddings (node2vec).","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"60 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140570935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.
{"title":"The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications","authors":"Marilena Furno, Francesco Caracciolo","doi":"10.1002/sam.11671","DOIUrl":"https://doi.org/10.1002/sam.11671","url":null,"abstract":"The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well‐known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS‐based FMM.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.
{"title":"Smart data augmentation: One equation is all you need","authors":"Yuhao Zhang, Lu Tang, Yuxiao Huang, Yan Ma","doi":"10.1002/sam.11672","DOIUrl":"https://doi.org/10.1002/sam.11672","url":null,"abstract":"Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"234 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan
Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.
{"title":"Compositional variable selection in quantile regression for microbiome data with false discovery rate control","authors":"Runze Li, Jin Mu, Songshan Yang, Cong Ye, Xiang Zhan","doi":"10.1002/sam.11674","DOIUrl":"https://doi.org/10.1002/sam.11674","url":null,"abstract":"Advancement in high‐throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high‐dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR‐controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non‐location‐scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real‐world microbiome data analysis to gain novel biological insights that are missed by traditional mean‐based linear regression analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ability to non‐uniformly weight the input space is desirable for many applications, and has been explored for space‐filling approaches. Increased interests in linking models, such as in a digital twinning framework, increases the need for sampling emulators where they are most likely to be evaluated. In particular, we apply non‐uniform sampling methods for the construction of aerodynamic databases. This paper combines non‐uniform weighting with active learning for Gaussian Processes (GPs) to develop a closed‐form solution to a non‐uniform active learning criterion. We accomplish this by utilizing a kernel density estimator as the weight function. We demonstrate the need and efficacy of this approach with an atmospheric entry example that accounts for both model uncertainty as well as the practical state space of the vehicle, as determined by forward modeling within the active learning loop.
{"title":"Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases","authors":"Kevin R. Quinlan, Jagadeesh Movva, Brad Perfect","doi":"10.1002/sam.11675","DOIUrl":"https://doi.org/10.1002/sam.11675","url":null,"abstract":"The ability to non‐uniformly weight the input space is desirable for many applications, and has been explored for space‐filling approaches. Increased interests in linking models, such as in a digital twinning framework, increases the need for sampling emulators where they are most likely to be evaluated. In particular, we apply non‐uniform sampling methods for the construction of aerodynamic databases. This paper combines non‐uniform weighting with active learning for Gaussian Processes (GPs) to develop a closed‐form solution to a non‐uniform active learning criterion. We accomplish this by utilizing a kernel density estimator as the weight function. We demonstrate the need and efficacy of this approach with an atmospheric entry example that accounts for both model uncertainty as well as the practical state space of the vehicle, as determined by forward modeling within the active learning loop.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.
{"title":"eRPCA: Robust Principal Component Analysis for Exponential Family Distributions","authors":"Xiaojun Zheng, Simon Mak, Liyan Xie, Yao Xie","doi":"10.1002/sam.11670","DOIUrl":"https://doi.org/10.1002/sam.11670","url":null,"abstract":"Robust principal component analysis (RPCA) is a widely used method for recovering low‐rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low‐rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non‐Gaussian. We thus propose a new method called RPCA for exponential family distributions (), which can perform the desired decomposition into low‐rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient decomposition, under either its natural or canonical parametrization. The effectiveness of is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"71 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos
This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the x ‐axis and y‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.
这项研究探索了非参数量化器在手写签名验证问题中的应用。我们使用了广泛应用于签名验证问题的 MCYT-100 (MCYT 指纹子语料库)数据库。我们对数据库中提供的 x 轴和 y 轴上的离散时间序列位置进行了预处理,并采用了基于熵、复杂度、费雪信息和趋势等非参数量化指标的时间因果信息。研究还建议利用所获得的时间序列来评估这些量化指标,应用每个序列位置的一阶导数和二阶导数来评估动态行为,分别观察其速度和加速度状态。MCYT-100 数据库中的特征通过逻辑回归、支持向量机 (SVM)、随机森林和极梯度提升 (XGBoost) 进行分类。量化指标被用作训练分类器的输入特征。为了评估非参数量化器区分伪造和真实签名的能力和影响,我们使用了变量选择标准,例如:信息增益、方差分析和方差膨胀因子。分类器的性能通过分类误差度量(如特异性和曲线下面积)进行评估。结果表明,SVM 和 XGBoost 分类器的性能最佳。
{"title":"Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach","authors":"Raydonal Ospina, Ranah Duarte Costa, Leandro Chaves Rêgo, Fernando Marmolejo‐Ramos","doi":"10.1002/sam.11673","DOIUrl":"https://doi.org/10.1002/sam.11673","url":null,"abstract":"This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the <jats:italic>x</jats:italic> ‐axis and <jats:italic>y</jats:italic>‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140312731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this article, we implement the classification of nonstationary streaming data. Due to the inability to obtain full data in the context of streaming data, we adopt a strategy based on clustering structure for data classification. Specifically, this strategy involves dynamically maintaining clustering structures to update the model, thereby updating the objective function for classification. Simultaneously, incoming samples are monitored in real-time to identify the emergence of new classes or the presence of outliers. Moreover, this strategy can also deal with the concept drift problem, where the distribution of data changes with the inflow of data. Regarding the handling of novel instances, we introduce a buffer analysis mechanism to delay their processing, which in turn improves the prediction performance of the model. In the process of model updating, we also introduce a novel renewable strategy for the covariance matrix. Numerical simulations and experiments on datasets show that our method has significant advantages.
{"title":"Online learning for streaming data classification in nonstationary environments","authors":"Yujie Gai, Kang Meng, Xiaodi Wang","doi":"10.1002/sam.11669","DOIUrl":"https://doi.org/10.1002/sam.11669","url":null,"abstract":"In this article, we implement the classification of nonstationary streaming data. Due to the inability to obtain full data in the context of streaming data, we adopt a strategy based on clustering structure for data classification. Specifically, this strategy involves dynamically maintaining clustering structures to update the model, thereby updating the objective function for classification. Simultaneously, incoming samples are monitored in real-time to identify the emergence of new classes or the presence of outliers. Moreover, this strategy can also deal with the concept drift problem, where the distribution of data changes with the inflow of data. Regarding the handling of novel instances, we introduce a buffer analysis mechanism to delay their processing, which in turn improves the prediction performance of the model. In the process of model updating, we also introduce a novel renewable strategy for the covariance matrix. Numerical simulations and experiments on datasets show that our method has significant advantages.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"36 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140070400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}