Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok
Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.
{"title":"Neural interval‐censored survival regression with feature selection","authors":"Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok","doi":"10.1002/sam.11704","DOIUrl":"https://doi.org/10.1002/sam.11704","url":null,"abstract":"Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"87 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141642725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley
Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.
{"title":"Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design","authors":"N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley","doi":"10.1002/sam.11698","DOIUrl":"https://doi.org/10.1002/sam.11698","url":null,"abstract":"Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141402295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Analyzing correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.
分析相关的高维数据是基因组学、蛋白质组学和其他相关领域的一个挑战性问题。例如,识别与生物标志物相关的重要基因通路效应非常重要,其中基因通路是一组基因,它们在功能上共同调节某一生物过程。基于通路的分析可以检测到基因分析无法发现的表达水平的微妙变化。在这里,我们将通路称为集合,将基因称为集合中的元素。然而,当存在多个通路时,自动选择哪些通路与结果高度相关是一项挑战。在本文中,我们提出了一种半参数多核回归模型来研究固定协变量(如临床变量)和元素集(如基因的通路)的影响,以解决检测与生物标志物相关的信号集的问题。我们通过多个高斯核机器对多集合的未知高维函数进行建模,以考虑同一集合内的元素相互影响的可能性。因此,我们的变量集选择可视为高斯过程集选择。我们在贝叶斯方差成分选择框架下开发了高斯过程集选择。我们通过对模型施加伊辛先验,纳入了结构集的先验知识。我们的方法可以轻松应用于样本量小于变量数量的高维空间。我们还开发了一种高效的变分贝叶斯算法。我们通过模拟研究和 II 型糖尿病遗传途径分析展示了我们方法的优势。
{"title":"Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis","authors":"Jiali Lin, Inyoung Kim","doi":"10.1002/sam.11699","DOIUrl":"https://doi.org/10.1002/sam.11699","url":null,"abstract":"Analyzing\u0000correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141409881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce an algorithmic approach designed to compare similar shoeprint images, with automated alignment. Our method employs the Iterative Closest Points (ICP) algorithm to attain optimal alignment, further enhancing precision through phase‐only correlation. Utilizing diverse metrics to quantify similarity, we train a random forest model to predict the empirical probability that two impressions originate from the same shoe. Experimental evaluations using high‐quality two‐dimensional shoeprints showcase our proposed algorithm's robustness in managing dissimilarities between impressions from the same shoe, outperforming existing approaches.
{"title":"An automated alignment algorithm for identification of the source of footwear impressions with common class characteristics","authors":"Hana Lee, Alicia Carriquiry, Soyoung Park","doi":"10.1002/sam.11659","DOIUrl":"https://doi.org/10.1002/sam.11659","url":null,"abstract":"We introduce an algorithmic approach designed to compare similar shoeprint images, with automated alignment. Our method employs the Iterative Closest Points (ICP) algorithm to attain optimal alignment, further enhancing precision through phase‐only correlation. Utilizing diverse metrics to quantify similarity, we train a random forest model to predict the empirical probability that two impressions originate from the same shoe. Experimental evaluations using high‐quality two‐dimensional shoeprints showcase our proposed algorithm's robustness in managing dissimilarities between impressions from the same shoe, outperforming existing approaches.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"97 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140484979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Problem of uncertainty of graph structure identification in random variable network is considered. An approach for the construction of upper and lower confidence bounds for graph structures is developed. This approach is applied for the construction of upper and lower confidence bounds for the threshold similarity graph. The stability of confidence bounds and gaps between upper and lower confidence bounds are investigated. Theoretical results are illustrated by numerical experiments.
{"title":"Confidence bounds for threshold similarity graph in random variable network","authors":"P. Koldanov, A. Koldanov, D. P. Semenov","doi":"10.1002/sam.11642","DOIUrl":"https://doi.org/10.1002/sam.11642","url":null,"abstract":"Problem of uncertainty of graph structure identification in random variable network is considered. An approach for the construction of upper and lower confidence bounds for graph structures is developed. This approach is applied for the construction of upper and lower confidence bounds for the threshold similarity graph. The stability of confidence bounds and gaps between upper and lower confidence bounds are investigated. Theoretical results are illustrated by numerical experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"50 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123519121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To address the problems of pattern collapse, uncontrollable data generation and high overlap rate when generative adversarial network (GAN) oversamples imbalanced data, we propose an imbalanced data oversampling algorithm based on improved dual discriminator generative adversarial nets (D2GAN). First, we integrate the positive class attribute information into the generator and the discriminator to ensure that the generator only generates the samples for positive class samples, which overcomes the problem of uncontrollable data generation by the generator. Second, we introduce a classifier into D2GAN for discriminating the generated samples and the original data, which avoids the overlap among the generated samples and the negative class samples, and ensures the diversity of the generated samples, the problem of pattern collapse is solved. Finally, the performance of the proposed algorithm is evaluated on 9 datasets by using SVM and neural network classification algorithm for oversampling experiments, the results show that the proposed algorithm effectively improve the classification performance of imbalanced data.
{"title":"An Improved D2GAN‐based oversampling algorithm for imbalanced data classification","authors":"Xiaoqiang Zhao, Qi Yao","doi":"10.1002/sam.11640","DOIUrl":"https://doi.org/10.1002/sam.11640","url":null,"abstract":"To address the problems of pattern collapse, uncontrollable data generation and high overlap rate when generative adversarial network (GAN) oversamples imbalanced data, we propose an imbalanced data oversampling algorithm based on improved dual discriminator generative adversarial nets (D2GAN). First, we integrate the positive class attribute information into the generator and the discriminator to ensure that the generator only generates the samples for positive class samples, which overcomes the problem of uncontrollable data generation by the generator. Second, we introduce a classifier into D2GAN for discriminating the generated samples and the original data, which avoids the overlap among the generated samples and the negative class samples, and ensures the diversity of the generated samples, the problem of pattern collapse is solved. Finally, the performance of the proposed algorithm is evaluated on 9 datasets by using SVM and neural network classification algorithm for oversampling experiments, the results show that the proposed algorithm effectively improve the classification performance of imbalanced data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123666575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass
A classifier may be limited by its conditional misclassification rates more than its overall misclassification rate. In the case that one or more of the conditional misclassification rates are high, a neutral zone may be introduced to decrease and possibly balance the misclassification rates. In this paper, a neutral zone is incorporated into a three‐class classifier with its region determined by controlling conditional misclassification rates. The neutral zone classifier is illustrated with a text mining application that classifies written comments associated with student evaluations of teaching.
{"title":"A neutral zone classifier for three classes with an application to text mining","authors":"Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass","doi":"10.1002/sam.11639","DOIUrl":"https://doi.org/10.1002/sam.11639","url":null,"abstract":"A classifier may be limited by its conditional misclassification rates more than its overall misclassification rate. In the case that one or more of the conditional misclassification rates are high, a neutral zone may be introduced to decrease and possibly balance the misclassification rates. In this paper, a neutral zone is incorporated into a three‐class classifier with its region determined by controlling conditional misclassification rates. The neutral zone classifier is illustrated with a text mining application that classifies written comments associated with student evaluations of teaching.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126453612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.
{"title":"Ensemble learning for score likelihood ratios under the common source problem","authors":"Federico Veneri, Danica M. Ommen","doi":"10.1002/sam.11637","DOIUrl":"https://doi.org/10.1002/sam.11637","url":null,"abstract":"Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125485310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In forensic practice, evaluating shoeprint evidence is challenging because the differences between images of two different outsoles can be subtle. In this paper, we propose a deep transfer learning‐based matching algorithm called the Shoe‐MS algorithm that quantifies the similarity between two outsole images. The Shoe‐MS algorithm consists of a Siamese neural network for two input images followed by a transfer learning component to extract features from outsole impression images. The added layers are finely tuned using images of shoe soles. To test the performance of the method we propose, we use a study dataset that is both realistic and challenging. The pairs of images for which we know ground truth include (1) close non‐matches and (2) mock‐crime scene pairs. The Shoe‐MS algorithm performed well in terms of prediction accuracy and was able to determine the source of pairs of outsole images, even when comparisons were challenging. When using a score‐based likelihood ratio, the algorithm made the correct decision with high probability in a test of the hypothesis that images had a common source. An important advantage of the proposed approach is that pairs of images can be compared without alignment. In initial tests, Shoe‐MS exhibited better‐discriminating power than existing methods.
{"title":"A finely tuned deep transfer learning algorithm to compare outsole images","authors":"Moon-Yeop Jang, Soyoung Park, A. Carriquiry","doi":"10.1002/sam.11636","DOIUrl":"https://doi.org/10.1002/sam.11636","url":null,"abstract":"In forensic practice, evaluating shoeprint evidence is challenging because the differences between images of two different outsoles can be subtle. In this paper, we propose a deep transfer learning‐based matching algorithm called the Shoe‐MS algorithm that quantifies the similarity between two outsole images. The Shoe‐MS algorithm consists of a Siamese neural network for two input images followed by a transfer learning component to extract features from outsole impression images. The added layers are finely tuned using images of shoe soles. To test the performance of the method we propose, we use a study dataset that is both realistic and challenging. The pairs of images for which we know ground truth include (1) close non‐matches and (2) mock‐crime scene pairs. The Shoe‐MS algorithm performed well in terms of prediction accuracy and was able to determine the source of pairs of outsole images, even when comparisons were challenging. When using a score‐based likelihood ratio, the algorithm made the correct decision with high probability in a test of the hypothesis that images had a common source. An important advantage of the proposed approach is that pairs of images can be compared without alignment. In initial tests, Shoe‐MS exhibited better‐discriminating power than existing methods.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"961 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124179773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria
{"title":"CLADAG 2021 special issue: Selected papers on classification and data analysis","authors":"C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio","doi":"10.1002/sam.11633","DOIUrl":"https://doi.org/10.1002/sam.11633","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132268384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}