Pub Date : 2023-04-13DOI: 10.1007/s00357-023-09433-3
Marilena Furno
{"title":"Computing Finite Mixture Estimators in the Tails","authors":"Marilena Furno","doi":"10.1007/s00357-023-09433-3","DOIUrl":"https://doi.org/10.1007/s00357-023-09433-3","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"267 - 297"},"PeriodicalIF":2.0,"publicationDate":"2023-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47624445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-04DOI: 10.1007/s00357-023-09432-4
Roberto Di Mari, Salvatore Ingrassia, Antonio Punzo
In generalized linear models (GLMs), measures of lack of fit are typically defined as the deviance between two nested models, and a deviance-based R2 is commonly used to evaluate the fit. In this paper, we extend deviance measures to mixtures of GLMs, whose parameters are estimated by maximum likelihood (ML) via the EM algorithm. Such measures are defined both locally, i.e., at cluster-level, and globally, i.e., with reference to the whole sample. At the cluster-level, we propose a normalized two-term decomposition of the local deviance into explained, and unexplained local deviances. At the sample-level, we introduce an additive normalized decomposition of the total deviance into three terms, where each evaluates a different aspect of the fitted model: (1) the cluster separation on the dependent variable, (2) the proportion of the total deviance explained by the fitted model, and (3) the proportion of the total deviance which remains unexplained. We use both local and global decompositions to define, respectively, local and overall deviance R2 measures for mixtures of GLMs, which we illustrate-for Gaussian, Poisson and binomial responses-by means of a simulation study. The proposed fit measures are then used to assess, and interpret clusters of COVID-19 spread in Italy in two time points.
{"title":"Local and Overall Deviance R-Squared Measures for Mixtures of Generalized Linear Models.","authors":"Roberto Di Mari, Salvatore Ingrassia, Antonio Punzo","doi":"10.1007/s00357-023-09432-4","DOIUrl":"10.1007/s00357-023-09432-4","url":null,"abstract":"<p><p>In generalized linear models (GLMs), measures of lack of fit are typically defined as the deviance between two nested models, and a deviance-based <i>R</i><sup>2</sup> is commonly used to evaluate the fit. In this paper, we extend deviance measures to mixtures of GLMs, whose parameters are estimated by maximum likelihood (ML) via the EM algorithm. Such measures are defined both locally, i.e., at cluster-level, and globally, i.e., with reference to the whole sample. At the cluster-level, we propose a normalized two-term decomposition of the local deviance into explained, and unexplained local deviances. At the sample-level, we introduce an additive normalized decomposition of the total deviance into three terms, where each evaluates a different aspect of the fitted model: (1) the cluster separation on the dependent variable, (2) the proportion of the total deviance explained by the fitted model, and (3) the proportion of the total deviance which remains unexplained. We use both local and global decompositions to define, respectively, local and overall deviance <i>R</i><sup>2</sup> measures for mixtures of GLMs, which we illustrate-for Gaussian, Poisson and binomial responses-by means of a simulation study. The proposed fit measures are then used to assess, and interpret clusters of COVID-19 spread in Italy in two time points.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":" ","pages":"1-34"},"PeriodicalIF":2.0,"publicationDate":"2023-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10071261/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9768843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-04-03DOI: 10.1007/s00357-023-09435-1
J. T. Temple, R. Bateman
{"title":"Characteristics of Distance Matrices Based on Euclidean, Manhattan and Hausdorff Coefficients","authors":"J. T. Temple, R. Bateman","doi":"10.1007/s00357-023-09435-1","DOIUrl":"https://doi.org/10.1007/s00357-023-09435-1","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"214 - 232"},"PeriodicalIF":2.0,"publicationDate":"2023-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46807267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-23DOI: 10.1007/s00357-023-09431-5
Trent Geisler, Herman Ray, Ying Xie
{"title":"Finding the Proverbial Needle: Improving Minority Class Identification Under Extreme Class Imbalance","authors":"Trent Geisler, Herman Ray, Ying Xie","doi":"10.1007/s00357-023-09431-5","DOIUrl":"https://doi.org/10.1007/s00357-023-09431-5","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"192-212"},"PeriodicalIF":2.0,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46841940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-02-16DOI: 10.1007/s00357-023-09430-6
L. Diao, Grace Y. Yi
{"title":"Classification Trees with Mismeasured Responses","authors":"L. Diao, Grace Y. Yi","doi":"10.1007/s00357-023-09430-6","DOIUrl":"https://doi.org/10.1007/s00357-023-09430-6","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"168-191"},"PeriodicalIF":2.0,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44301135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-21DOI: 10.1007/s00357-022-09429-5
R. Simone
{"title":"Uncertainty Diagnostics of Binomial Regression Trees for Ordered Rating Data","authors":"R. Simone","doi":"10.1007/s00357-022-09429-5","DOIUrl":"https://doi.org/10.1007/s00357-022-09429-5","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"79-105"},"PeriodicalIF":2.0,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47005228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01DOI: 10.1007/s00357-022-09428-6
Marian Lux, Stefanie Rinderle-Ma
This work studies the problem of clustering one-dimensional data points such that they are evenly distributed over a given number of low variance clusters. One application is the visualization of data on choropleth maps or on business process models, but without over-emphasizing outliers. This enables the detection and differentiation of smaller clusters. The problem is tackled based on a heuristic algorithm called DDCAL (1d distribution cluster algorithm) that is based on iterative feature scaling which generates stable results of clusters. The effectiveness of the DDCAL algorithm is shown based on 5 artificial data sets with different distributions and 4 real-world data sets reflecting different use cases. Moreover, the results from DDCAL, by using these data sets, are compared to 11 existing clustering algorithms. The application of the DDCAL algorithm is illustrated through the visualization of pandemic and population data on choropleth maps as well as process mining results on process models.
{"title":"DDCAL: Evenly Distributing Data into Low Variance Clusters Based on Iterative Feature Scaling.","authors":"Marian Lux, Stefanie Rinderle-Ma","doi":"10.1007/s00357-022-09428-6","DOIUrl":"https://doi.org/10.1007/s00357-022-09428-6","url":null,"abstract":"<p><p>This work studies the problem of clustering one-dimensional data points such that they are evenly distributed over a given number of low variance clusters. One application is the visualization of data on choropleth maps or on business process models, but without over-emphasizing outliers. This enables the detection and differentiation of smaller clusters. The problem is tackled based on a heuristic algorithm called DDCAL (1d distribution cluster algorithm) that is based on iterative feature scaling which generates stable results of clusters. The effectiveness of the DDCAL algorithm is shown based on 5 artificial data sets with different distributions and 4 real-world data sets reflecting different use cases. Moreover, the results from DDCAL, by using these data sets, are compared to 11 existing clustering algorithms. The application of the DDCAL algorithm is illustrated through the visualization of pandemic and population data on choropleth maps as well as process mining results on process models.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"106-144"},"PeriodicalIF":2.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9873542/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9476660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-14DOI: 10.1007/s00357-022-09425-9
M. Salehi, A. Bekker, M. Arashi
{"title":"A Semi-parametric Density Estimation with Application in Clustering","authors":"M. Salehi, A. Bekker, M. Arashi","doi":"10.1007/s00357-022-09425-9","DOIUrl":"https://doi.org/10.1007/s00357-022-09425-9","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"52-78"},"PeriodicalIF":2.0,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48188739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-07DOI: 10.1007/s00357-022-09424-w
Sangkon Oh, Byungtae Seo
{"title":"Merging Components in Linear Gaussian Cluster-Weighted Models","authors":"Sangkon Oh, Byungtae Seo","doi":"10.1007/s00357-022-09424-w","DOIUrl":"https://doi.org/10.1007/s00357-022-09424-w","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"25-51"},"PeriodicalIF":2.0,"publicationDate":"2022-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49126059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-26DOI: 10.1007/s00357-022-09422-y
Rabea Aschenbruck, G. Szepannek, A. Wilhelm
{"title":"Imputation Strategies for Clustering Mixed-Type Data with Missing Values","authors":"Rabea Aschenbruck, G. Szepannek, A. Wilhelm","doi":"10.1007/s00357-022-09422-y","DOIUrl":"https://doi.org/10.1007/s00357-022-09422-y","url":null,"abstract":"","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"40 1","pages":"2-24"},"PeriodicalIF":2.0,"publicationDate":"2022-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46679720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}