Advances in Data Analysis and Classification最新文献

英文中文

A consensus-constrained parsimonious Gaussian mixture model for clustering hyperspectral images 高光谱图像聚类的共识约束简约高斯混合模型

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2025-02-06 DOI: 10.1007/s11634-025-00623-y

Ganesh Babu, Aoife Gowen, Michael Fop, Isobel Claire Gormley

The use of hyperspectral imaging to investigate food samples has grown due to the improved performance and lower cost of instrumentation. Food engineers use hyperspectral images to classify the type and quality of a food sample, typically using classification methods. In order to train these methods, every pixel in each training image needs to be labelled. Typically, computationally cheap threshold-based approaches are used to label the pixels, and classification methods are trained based on those labels. However, threshold-based approaches are subjective and cannot be generalized across hyperspectral images taken in different conditions and of different foods. Here a consensus-constrained parsimonious Gaussian mixture model (ccPGMM) is proposed to label pixels in hyperspectral images using a model-based clustering approach. The ccPGMM utilizes information that is available on some pixels and specifies constraints on those pixels belonging to the same or different clusters while clustering the rest of the pixels in the image. A latent variable model is used to represent the high-dimensional data in terms of a small number of underlying latent factors. To ensure computational feasibility, a consensus clustering approach is employed, where the data are divided into multiple randomly selected subsets of variables and constrained clustering is applied to each data subset; the clustering results are then consolidated across all data subsets to provide a consensus clustering solution. The ccPGMM approach is applied to simulated datasets and real hyperspectral images of three types of puffed cereal, corn, rice, and wheat. Improved clustering performance and computational efficiency are demonstrated when compared to other current state-of-the-art approaches.

由于仪器性能的提高和成本的降低，使用高光谱成像来调查食品样品已经越来越多。食品工程师使用高光谱图像对食品样品的类型和质量进行分类，通常使用分类方法。为了训练这些方法，每个训练图像中的每个像素都需要标记。通常，使用计算成本较低的基于阈值的方法来标记像素，并且基于这些标记训练分类方法。然而，基于阈值的方法是主观的，不能在不同条件和不同食物的高光谱图像中进行推广。本文提出了一种共识约束的简约高斯混合模型（ccPGMM），利用基于模型的聚类方法对高光谱图像中的像素进行标记。ccPGMM利用某些像素上可用的信息，并在对图像中的其余像素进行聚类时，对属于相同或不同集群的那些像素指定约束。一个潜在变量模型被用来表示高维数据的少量潜在因素。为了保证计算的可行性，采用共识聚类方法，将数据随机分成多个变量子集，对每个数据子集进行约束聚类；然后跨所有数据子集合并聚类结果，以提供一致的聚类解决方案。将ccPGMM方法应用于玉米、水稻和小麦3种膨化谷物的模拟数据集和真实高光谱图像。与其他当前最先进的方法相比，展示了改进的聚类性能和计算效率。

{"title":"A consensus-constrained parsimonious Gaussian mixture model for clustering hyperspectral images","authors":"Ganesh Babu, Aoife Gowen, Michael Fop, Isobel Claire Gormley","doi":"10.1007/s11634-025-00623-y","DOIUrl":"10.1007/s11634-025-00623-y","url":null,"abstract":"<div><p>The use of hyperspectral imaging to investigate food samples has grown due to the improved performance and lower cost of instrumentation. Food engineers use hyperspectral images to classify the type and quality of a food sample, typically using classification methods. In order to train these methods, every pixel in each training image needs to be labelled. Typically, computationally cheap threshold-based approaches are used to label the pixels, and classification methods are trained based on those labels. However, threshold-based approaches are subjective and cannot be generalized across hyperspectral images taken in different conditions and of different foods. Here a consensus-constrained parsimonious Gaussian mixture model (ccPGMM) is proposed to label pixels in hyperspectral images using a model-based clustering approach. The ccPGMM utilizes information that is available on some pixels and specifies constraints on those pixels belonging to the same or different clusters while clustering the rest of the pixels in the image. A latent variable model is used to represent the high-dimensional data in terms of a small number of underlying latent factors. To ensure computational feasibility, a consensus clustering approach is employed, where the data are divided into multiple randomly selected subsets of variables and constrained clustering is applied to each data subset; the clustering results are then consolidated across all data subsets to provide a consensus clustering solution. The ccPGMM approach is applied to simulated datasets and real hyperspectral images of three types of puffed cereal, corn, rice, and wheat. Improved clustering performance and computational efficiency are demonstrated when compared to other current state-of-the-art approaches.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 classification and related methods”","pages":"323 - 359"},"PeriodicalIF":1.3,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-025-00623-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145162774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Clustering and classification of spatio-temporal data using spatial dynamic panel data models 利用空间动态面板数据模型对时空数据进行聚类和分类

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-12-23 DOI: 10.1007/s11634-024-00620-7

Giuseppe Feo, Francesco Giordano, Sara Milito, Marcella Niglio, Maria Lucia Parrella

The class of Spatial Dynamic Panel Data models has been proposed in the socio-econometric literature to analyze spatio-temporal data. In this paper we consider a particular variant of such models, where the set of spatial units is assumed to be partitioned into clusters and the parameters of the model are assumed to be homogeneous within clusters and heterogeneous across clusters. For this model, assuming that the true partition is unknown, we propose a new clustering procedure and a validation test, based on a multiple testing approach, that help to choose the best configuration of model, for a given observed dataset, by estimating the optimal number of clusters and the best partition of units. The validity of the proposed procedures has been shown both theoretically and empirically, on simulated and real data, also compared to alternative methods.

社会计量经济学文献中提出了一类空间动态面板数据模型来分析时空数据。在本文中，我们考虑了这种模型的一种特殊变体，其中假设空间单元集被划分为簇，并且假设模型的参数在簇内是同质的，在簇之间是异质的。对于这个模型，假设真正的划分是未知的，我们提出了一个新的聚类过程和一个验证测试，基于多重测试方法，帮助选择模型的最佳配置，对于给定的观测数据集，通过估计最优的聚类数量和最佳的单元划分。在模拟和实际数据上，并与其他方法进行了比较，证明了所提出方法的理论和经验有效性。

引用次数: 0

Correction: Characterisation and calibration of multiversal methods 校正：通用方法的表征和校准

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-12-10 DOI: 10.1007/s11634-024-00618-1

Giulio Giacomo Cantone, Venera Tomaselli

引用次数: 0

Reducing the dimensionality and granularity in hierarchical categorical variables 降低层次分类变量的维数和粒度

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-24 DOI: 10.1007/s11634-024-00614-5

Paul Wilsens, Katrien Antonio, Gerda Claeskens

Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves the balance between model fit and complexity. We apply our methodology to a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature.

分层分类变量通常表现为多个级别（高粒度）和每个级别内的许多类（高维）。当在预测模型中包含这些协变量时，这可能会导致过拟合和估计问题。在目前的文献中，层次协变量通常通过嵌套随机效应合并。然而，这不利于假设类对响应变量具有相同的影响。在本文中，我们提出了一种方法来获得一个层次分类变量的简化表示。我们将展示如何在分层设置中应用实体嵌入。随后，我们提出了一种自顶向下的聚类算法，该算法利用嵌入中编码的信息来降低层次分类变量的层次内维数和总体粒度。在模拟实验中，我们表明我们的方法可以有效地近似层次协变量对响应变量的影响的真实底层结构，并发现纳入简化的层次可以改善模型拟合和复杂性之间的平衡。我们将我们的方法应用于一个真实的数据集，发现简化的层次结构是对原始层次结构和文献中提出的简化结构的改进。

{"title":"Reducing the dimensionality and granularity in hierarchical categorical variables","authors":"Paul Wilsens, Katrien Antonio, Gerda Claeskens","doi":"10.1007/s11634-024-00614-5","DOIUrl":"10.1007/s11634-024-00614-5","url":null,"abstract":"<div><p>Hierarchical categorical variables often exhibit many levels (high granularity) and many classes within each level (high dimensionality). This may cause overfitting and estimation issues when including such covariates in a predictive model. In current literature, a hierarchical covariate is often incorporated via nested random effects. However, this does not facilitate the assumption of classes having the same effect on the response variable. In this paper, we propose a methodology to obtain a reduced representation of a hierarchical categorical variable. We show how entity embedding can be applied in a hierarchical setting. Subsequently, we propose a top-down clustering algorithm which leverages the information encoded in the embeddings to reduce both the within-level dimensionality as well as the overall granularity of the hierarchical categorical variable. In simulation experiments, we show that our methodology can effectively approximate the true underlying structure of a hierarchical covariate in terms of the effect on a response variable, and find that incorporating the reduced hierarchy improves the balance between model fit and complexity. We apply our methodology to a real dataset and find that the reduced hierarchy is an improvement over the original hierarchical structure and reduced structures proposed in the literature.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 4","pages":"1087 - 1118"},"PeriodicalIF":1.3,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised curve clustering using wavelets 基于小波的无监督曲线聚类

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-16 DOI: 10.1007/s11634-024-00612-7

Umberto Amato, Anestis Antoniadis, Italia De Feis, Irène Gijbels

Clustering univariate functional data is mostly based on projecting the curves onto an adequate basis and applying some distance or similarity models on the coefficients. The basis functions should be chosen depending on features of the function being estimated. Commonly used are Fourier, polynomial and splines, but these may not be well suited for curves that exhibit inhomogeneous behavior. Wavelets on the contrary are well suited for identifying highly discriminant local time and scale features, and are able to adapt to the data smoothness. In recent years, few methods, relying on wavelet-based similarity measures, have been proposed for clustering curves, observed on equidistant points. In this work, we present a non-equidistant design wavelet based method for non-parametrically estimating and clustering a large number of curves. The method consists of several crucial stages: fitting functional data by non-equispaced design wavelet regression, screening out nearly flat curves, denoising the remaining curves with wavelet thresholding, and finally clustering the denoised curves. Simulation studies compare our proposed method with some other functional clustering methods. The method is applied for clustering some real functional data profiles.

单变量函数数据聚类主要是基于将曲线投影到适当的基上，并在系数上应用一些距离或相似模型。基函数的选择应取决于被估计函数的特征。常用的是傅里叶、多项式和样条，但这些可能不太适合表现出非齐次行为的曲线。相反，小波则非常适合于识别高度判别的局部时间和尺度特征，并且能够适应数据的平滑性。近年来，基于小波相似性度量的聚类曲线方法很少被提出。在这项工作中，我们提出了一种基于非等距设计小波的非参数估计和聚类方法。该方法包括几个关键阶段：用非均衡设计小波回归拟合功能数据，筛选出接近平坦的曲线，用小波阈值去噪剩余的曲线，最后对去噪后的曲线进行聚类。仿真研究将我们提出的方法与其他功能聚类方法进行了比较。将该方法应用于实际功能数据的聚类。

{"title":"Unsupervised curve clustering using wavelets","authors":"Umberto Amato, Anestis Antoniadis, Italia De Feis, Irène Gijbels","doi":"10.1007/s11634-024-00612-7","DOIUrl":"10.1007/s11634-024-00612-7","url":null,"abstract":"<div><p>Clustering univariate functional data is mostly based on projecting the curves onto an adequate basis and applying some distance or similarity models on the coefficients. The basis functions should be chosen depending on features of the function being estimated. Commonly used are Fourier, polynomial and splines, but these may not be well suited for curves that exhibit inhomogeneous behavior. Wavelets on the contrary are well suited for identifying highly discriminant local time and scale features, and are able to adapt to the data smoothness. In recent years, few methods, relying on wavelet-based similarity measures, have been proposed for clustering curves, observed on equidistant points. In this work, we present a non-equidistant design wavelet based method for non-parametrically estimating and clustering a large number of curves. The method consists of several crucial stages: fitting functional data by non-equispaced design wavelet regression, screening out nearly flat curves, denoising the remaining curves with wavelet thresholding, and finally clustering the denoised curves. Simulation studies compare our proposed method with some other functional clustering methods. The method is applied for clustering some real functional data profiles.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 4","pages":"1051 - 1085"},"PeriodicalIF":1.3,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Editorial for ADAC issue 4 of volume 18 (2024) 为 ADAC 第 18 卷第 4 期（2024 年）撰写的社论

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-15 DOI: 10.1007/s11634-024-00615-4

Maurizio Vichi, Andrea Cerioli, Hans A. Kestler, Akinori Okada, Claus Weihs

引用次数: 0

Addressing class imbalance in functional data clustering 解决功能数据聚类中的类不平衡问题

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-14 DOI: 10.1007/s11634-024-00611-8

Catherine Higgins, Michelle Carey

The goal of functional clustering is twofold: first, to categorize curves with similar temporal behaviors into separate clusters, and second, to obtain a representative curve that summarizes the typical temporal behavior within each cluster. An important challenge in current functional clustering techniques is class imbalance, where some clusters contain a significantly greater number of curves than others. While class imbalance is extensively addressed in supervised classification, it remains relatively unexplored in unsupervised contexts. To address this gap, we propose adapting the iterative hierarchical clustering approach, originally designed for multivariate data, to the context of functional data. Thus introducing a novel method called functional iterative hierarchical clustering (funIHC) to effectively handle the clustering of imbalanced functional data. Through comprehensive simulation studies and benchmarking datasets, we demonstrate the effectiveness of the funIHC approach. Utilizing funIHC on gene expression data related to human influenza infection induced by the H3N2 virus, we identify five distinct and biologically meaningful patterns of gene expression. The R and MATLAB code for implementing funIHC is freely accessible at www.fdaatucd.com.

功能聚类的目标有两个：首先，将具有相似时间行为的曲线分类到单独的聚类中；其次，获得总结每个聚类中典型时间行为的代表性曲线。当前功能聚类技术的一个重要挑战是类不平衡，其中一些聚类包含的曲线数量明显多于其他聚类。虽然类不平衡在监督分类中得到了广泛的解决，但在非监督环境中仍然相对未被探索。为了解决这一差距，我们建议将最初为多元数据设计的迭代分层聚类方法应用于功能数据的上下文。因此，引入了一种新的方法——泛函迭代分层聚类（funIHC）来有效地处理不平衡功能数据的聚类。通过全面的仿真研究和基准数据集，我们证明了funIHC方法的有效性。利用funIHC分析与H3N2病毒诱导的人流感感染相关的基因表达数据，我们确定了五种不同且具有生物学意义的基因表达模式。实现funIHC的R和MATLAB代码可在www.fdaatucd.com免费访问。

{"title":"Addressing class imbalance in functional data clustering","authors":"Catherine Higgins, Michelle Carey","doi":"10.1007/s11634-024-00611-8","DOIUrl":"10.1007/s11634-024-00611-8","url":null,"abstract":"<div><p>The goal of functional clustering is twofold: first, to categorize curves with similar temporal behaviors into separate clusters, and second, to obtain a representative curve that summarizes the typical temporal behavior within each cluster. An important challenge in current functional clustering techniques is class imbalance, where some clusters contain a significantly greater number of curves than others. While class imbalance is extensively addressed in supervised classification, it remains relatively unexplored in unsupervised contexts. To address this gap, we propose adapting the iterative hierarchical clustering approach, originally designed for multivariate data, to the context of functional data. Thus introducing a novel method called functional iterative hierarchical clustering (funIHC) to effectively handle the clustering of imbalanced functional data. Through comprehensive simulation studies and benchmarking datasets, we demonstrate the effectiveness of the funIHC approach. Utilizing funIHC on gene expression data related to human influenza infection induced by the H3N2 virus, we identify five distinct and biologically meaningful patterns of gene expression. The R and MATLAB code for implementing funIHC is freely accessible at www.fdaatucd.com.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 4","pages":"1023 - 1050"},"PeriodicalIF":1.3,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterisation and calibration of multiversal methods 通用方法的表征和校准

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-11-12 DOI: 10.1007/s11634-024-00610-9

Giulio Giacomo Cantone, Venera Tomaselli

Multiverse Analysis is a heuristic for robust multiple models estimation where data fit many connected specifications of the same abstract model, instead of a singular or a small selection of specifications. Differently from the canonical application of multimodels, in Multiverse Analysis the probabilities of the specifications to be included in the analysis are never assumed independent of each other. Grounded in this consideration, this study provides a compact statistical characterisation of the process of elicitation of the specifications in Multiverse Analysis and conceptually adjacent methods, connecting previous insights from meta-analytical Statistics, model averaging, Network Theory, Information Theory, and Causal Inference. The calibration of the multiversal estimates is treated with references to the adoption of Bayesian Model Averaging vs. alternatives. In the applications, it is checked the theory that Bayesian Model Averaging reduces both error and uncertainty for well-specified multiversal models but amplifies errors when a collider variable is included in the multiversal model. In well-specified models, alternatives do not perform better than Uniform weighting of the estimates, so the adoption of a gold standard remains ambiguous. Normative implications for misinterpretation of Multiverse Analysis and future directions of research are discussed.

多元宇宙分析是一种用于鲁棒多模型估计的启发式方法，其中数据适合同一抽象模型的许多相互关联的规范，而不是单一的或少量的规范选择。与多模型的规范应用不同，在多元宇宙分析中，分析中包含的规范的概率从不假设彼此独立。基于这一考虑，本研究提供了多元宇宙分析和概念相邻方法中规范的推导过程的紧凑统计特征，将先前从元分析统计、模型平均、网络理论、信息论和因果推理中获得的见解联系起来。多元估计的校准参照贝叶斯模型平均与替代方法的采用进行处理。在实际应用中，验证了贝叶斯模型平均可以降低多宇宙模型的误差和不确定性，但当多宇宙模型中包含对撞机变量时，贝叶斯模型平均会放大误差。在明确指定的模型中，替代方案的表现并不比估计的统一加权更好，因此黄金标准的采用仍然模棱两可。讨论了多元宇宙分析的规范性含义和未来的研究方向。

{"title":"Characterisation and calibration of multiversal methods","authors":"Giulio Giacomo Cantone, Venera Tomaselli","doi":"10.1007/s11634-024-00610-9","DOIUrl":"10.1007/s11634-024-00610-9","url":null,"abstract":"<div><p>Multiverse Analysis is a heuristic for robust multiple models estimation where data fit many connected specifications of the same abstract model, instead of a singular or a small selection of specifications. Differently from the canonical application of multimodels, in Multiverse Analysis the probabilities of the specifications to be included in the analysis are never assumed independent of each other. Grounded in this consideration, this study provides a compact statistical characterisation of the process of elicitation of the specifications in Multiverse Analysis and conceptually adjacent methods, connecting previous insights from meta-analytical Statistics, model averaging, Network Theory, Information Theory, and Causal Inference. The calibration of the multiversal estimates is treated with references to the adoption of Bayesian Model Averaging vs. alternatives. In the applications, it is checked the theory that Bayesian Model Averaging reduces both error and uncertainty for well-specified multiversal models but amplifies errors when a collider variable is included in the multiversal model. In well-specified models, alternatives do not perform better than Uniform weighting of the estimates, so the adoption of a gold standard remains ambiguous. Normative implications for misinterpretation of Multiverse Analysis and future directions of research are discussed.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 4","pages":"989 - 1021"},"PeriodicalIF":1.3,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11634-024-00610-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A sparse exponential family latent block model for co-clustering 一种稀疏指数族潜块共聚类模型

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-10-26 DOI: 10.1007/s11634-024-00608-3

Saeid Hoseinipour, Mina Aminghafari, Adel Mohammadpour, Mohamed Nadif

Over the last decades, co-clustering models have spawned a number of algorithms showing the advantages that co-clustering can have over clustering. This is especially true for sparse high-dimensional data such as document-word matrices, which are our focus here. This proposal uses Latent Block Models (LBMs), rigorous statistical models that offer a variety of benefits in terms of flexibility, parsimony, and effectiveness. LBMs have been proposed in relation to different data types. This paper aims to embed existing and new models in a unified framework, focusing on exponential family LBM (ELBM) and the classification maximum likelihood approach. We then extend these models to include sparse versions, known as SELBM, taking into account the sparsity of datasets. The matrix formulations that we propose lead to simplified algorithms capable of addressing various types of data effectively.

在过去的几十年里，共聚类模型已经产生了许多算法，这些算法显示了共聚类比聚类具有的优势。对于稀疏的高维数据（如文档-词矩阵）尤其如此，这是我们这里的重点。本建议使用潜在块模型（Latent Block Models, lbm），这是一种严格的统计模型，在灵活性、简洁性和有效性方面提供了各种好处。已经针对不同的数据类型提出了lbm。本文旨在将现有模型和新模型嵌入到一个统一的框架中，重点研究指数族LBM （ELBM）和分类最大似然方法。然后，我们将这些模型扩展到包括稀疏版本，称为SELBM，考虑到数据集的稀疏性。我们提出的矩阵公式导致能够有效地处理各种类型数据的简化算法。

引用次数: 0

Robust Bayesian inference for the censored mixture of experts model using heavy-tailed distributions 使用重尾分布的剔除混合专家模型的鲁棒贝叶斯推断

IF 1.3 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification

Pub Date : 2024-10-17 DOI: 10.1007/s11634-024-00609-2

Elham Mirfarah, Mehrdad Naderi, Tsung-I Lin, Wan-Lun Wang

Statistical analysis of censored data has received considerable interest across various fields, including biomedical, clinical and econometrical studies. The cause of censored measurement is usually induced by the limitation of measuring instruments and/or experimental design. In practice, regression modeling for censored data might encounter departure from normality of errors due primarily to the latent sources of heterogeneity, and/or the presence of atypical observations and outliers. A Bayesian analysis for the mixture of linear experts model is studied wherein the errors follow the scale mixture of normal distribution, and the responses suffer from either a left or right censoring schemes. We propose a weakly informative prior structure for the parameters and show that the corresponding posterior distributions are proper. Leveraging the Ultimate Pólya-Gamma data-augmentation method, we efficiently sample gating parameters and consequentially allocate cluster memberships. Compared to the traditional maximum likelihood method, our Bayesian approach is shown to mitigate the impact of censoring on deteriorating estimation and classification abilities. The effectiveness of our proposal is illustrated by undertaking some synthetic studies and a real data example. (texttt{R}) scripts for the implementation of our Bayesian methods are available at the (texttt{GitHub}) repository.

审查数据的统计分析在包括生物医学、临床和计量经济学研究在内的各个领域引起了相当大的兴趣。截尾测量的原因通常是由测量仪器和/或实验设计的限制引起的。在实践中，审查数据的回归建模可能会遇到偏离正态性的误差，主要是由于潜在的异质性来源，和/或非典型观测值和异常值的存在。研究了混合线性专家模型的贝叶斯分析，该模型的误差遵循正态分布的尺度混合，并且响应受到左或右两种审查方案的影响。我们提出了参数的弱信息先验结构，并证明了相应的后验分布是合适的。利用Ultimate Pólya-Gamma数据增强方法，我们有效地对门控参数进行采样，并相应地分配集群成员。与传统的极大似然方法相比，我们的贝叶斯方法可以减轻审查对不断恶化的估计和分类能力的影响。通过进行一些综合研究和一个实际数据实例，说明了我们的建议的有效性。在(texttt{GitHub})存储库中可以获得用于实现贝叶斯方法的(texttt{R})脚本。

{"title":"Robust Bayesian inference for the censored mixture of experts model using heavy-tailed distributions","authors":"Elham Mirfarah, Mehrdad Naderi, Tsung-I Lin, Wan-Lun Wang","doi":"10.1007/s11634-024-00609-2","DOIUrl":"10.1007/s11634-024-00609-2","url":null,"abstract":"<div><p>Statistical analysis of censored data has received considerable interest across various fields, including biomedical, clinical and econometrical studies. The cause of censored measurement is usually induced by the limitation of measuring instruments and/or experimental design. In practice, regression modeling for censored data might encounter departure from normality of errors due primarily to the latent sources of heterogeneity, and/or the presence of atypical observations and outliers. A Bayesian analysis for the mixture of linear experts model is studied wherein the errors follow the scale mixture of normal distribution, and the responses suffer from either a left or right censoring schemes. We propose a weakly informative prior structure for the parameters and show that the corresponding posterior distributions are proper. Leveraging the Ultimate Pólya-Gamma data-augmentation method, we efficiently sample gating parameters and consequentially allocate cluster memberships. Compared to the traditional maximum likelihood method, our Bayesian approach is shown to mitigate the impact of censoring on deteriorating estimation and classification abilities. The effectiveness of our proposal is illustrated by undertaking some synthetic studies and a real data example. <span>(texttt{R})</span> scripts for the implementation of our Bayesian methods are available at the <span>(texttt{GitHub})</span> repository.</p></div>","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"19 4","pages":"921 - 949"},"PeriodicalIF":1.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Advances in Data Analysis and Classification

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀