首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Neural interval‐censored survival regression with feature selection 带特征选择的神经区间删失生存回归
Pub Date : 2024-07-16 DOI: 10.1002/sam.11704
Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok
Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.
生存分析是生物医学研究的一个基本重点领域,尤其是在个性化医疗方面。之所以如此突出,是因为大型高维数据集(如 omics 和医学图像数据)越来越普遍。然而,关于区间校正的非线性回归算法和变量选择技术的文献要么很有限,要么根本不存在,尤其是在神经网络方面。我们的目标是针对区间校正回归任务引入一个新的预测框架,该框架植根于加速故障时间(AFT)模型。我们的策略由两个关键部分组成:(i) 利用最近在稀疏神经网络架构方面取得的进展进行变量选择阶段;(ii) 以预测区间删失响应为目标的回归模型。为了评估新算法的性能,我们通过数值实验和实际应用进行了全面评估,其中包括与糖尿病和体育锻炼相关的场景。我们的结果优于传统的 AFT 算法,尤其是在具有非线性关系的场景中。
{"title":"Neural interval‐censored survival regression with feature selection","authors":"Carlos García Meixide, Marcos Matabuena, Louis Abraham, Michael R. Kosorok","doi":"10.1002/sam.11704","DOIUrl":"https://doi.org/10.1002/sam.11704","url":null,"abstract":"Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high‐dimensional datasets, such as omics and medical image data. However, the literature on nonlinear regression algorithms and variable selection techniques for interval‐censoring is either limited or nonexistent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval‐censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: (i) a variable selection phase leveraging recent advances on sparse neural network architectures; (ii) a regression model targeting prediction of the interval‐censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real‐world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring nonlinear relationships.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"87 18","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141642725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design 钼与钨惯性约束聚变双壳靶设计的贝叶斯批量优化
Pub Date : 2024-06-01 DOI: 10.1002/sam.11698
N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley
Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.
获得可靠的清洁能源是国家安全的一个主要问题。许多研究都集中在实验室环境下通过受控聚变反应生产能源这一 "巨大挑战 "上。要使聚变实验,特别是惯性约束聚变(ICF)产生足够的能量,ICF 燃料中的聚变反应必须能够自我维持,并能有效地燃烧氘-氚(DT)燃料。最近破纪录的 NIF 点火成功实现了这一目标,并产生了比用于驱动实验更多的能量。这一成就使基于核聚变的自我维持动力系统比以往任何时候都更接近于能够为人类提供安全的可再生能源。为了进一步推动这种动力系统的实现,需要在大型激光设施(如美国国家点火装置(NIF)或法国兆焦耳激光器)上进行更多的 ICF 实验。由于每次发射的成本较高,而且每年发射的次数有限,因此无法进行大量实验。因此,实验设计在很大程度上依赖于复杂的预测性物理模拟,以进行高保真的 "射前 "分析。这些多维、多物理场的高保真模拟必须考虑各种输入参数,并对点火时的极端条件(压力和密度)进行建模。这种模拟(尤其是三维模拟)的计算量巨大,难以满足每次 ICF 试验的要求。在这项工作中,我们探索使用贝叶斯优化和高斯过程(GPs)来寻找 ICF 双壳目标的最佳设计,同时将计算成本控制在可管理的水平。这些双壳靶的内壳等级从外表面的铍到内部的高 Z 材料钼,而不是名义上使用的钨,以便在与高密度内壳相关的高性能和胶囊稳定性之间进行权衡。我们描述了 "纯胶囊 "xRAGE 模拟的结果,以研究不同胶囊设计、内壳材料之间的物理关系以及未来实验的潜力。
{"title":"Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design","authors":"N. Vazirani, Ryan Sacks, Brian M. Haines, Michael J. Grosskopf, David J. Stark, Paul A. Bradley","doi":"10.1002/sam.11698","DOIUrl":"https://doi.org/10.1002/sam.11698","url":null,"abstract":"Access to reliable, clean energy sources is a major concern for national security. Much research is focused on the “grand challenge” of producing energy via controlled fusion reactions in a laboratory setting. For fusion experiments, specifically inertial confinement fusion (ICF), to produce sufficient energy, the fusion reactions in the ICF fuel need to become self‐sustaining and burn deuterium‐tritium (DT) fuel efficiently. The recent record‐breaking NIF ignition shot was able to achieve this goal as well as produce more energy than used to drive the experiment. This achievement brings self‐sustaining fusion‐based power systems closer than ever before, capable of providing humans with access to secure, renewable energy. In order to further progress toward the actualization of such power systems, more ICF experiments need to be conducted at large laser facilities such as the United States's National Ignition Facility (NIF) or France's Laser Mega‐Joule. The high cost per shot and limited number of shots that are possible per year make it prohibitive to perform large numbers of experiments. As such, experimental design relies heavily on complex predictive physics simulations for high‐fidelity “preshot” analysis. These multidimensional, multi‐physics, high‐fidelity simulations have to account for a variety of input parameters as well as modeling the extreme conditions (pressures and densities) present at ignition. Such simulations (especially in 3D) can become computationally prohibitive to turn around for each ICF experiment. In this work, we explore using Bayesian optimization with Gaussian processes (GPs) to find optimal designs for ICF double shell targets, while keeping computational costs to manageable levels. These double shell targets have an inner shell that grades from beryllium on the outer surface to the higher Z material molybdenum, as opposed to the nominally used tungsten, on the inside in order to trade off between the high performance associated with high density inner shells and capsule stability. We describe our results for “capsule‐only” xRAGE simulations to study the physics between different capsule designs, inner shell materials, and potential for future experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141402295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis 用于多途径分析的半参数多核机器回归中的高斯过程选择
Pub Date : 2024-06-01 DOI: 10.1002/sam.11699
Jiali Lin, Inyoung Kim
Analyzingcorrelated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.
分析相关的高维数据是基因组学、蛋白质组学和其他相关领域的一个挑战性问题。例如,识别与生物标志物相关的重要基因通路效应非常重要,其中基因通路是一组基因,它们在功能上共同调节某一生物过程。基于通路的分析可以检测到基因分析无法发现的表达水平的微妙变化。在这里,我们将通路称为集合,将基因称为集合中的元素。然而,当存在多个通路时,自动选择哪些通路与结果高度相关是一项挑战。在本文中,我们提出了一种半参数多核回归模型来研究固定协变量(如临床变量)和元素集(如基因的通路)的影响,以解决检测与生物标志物相关的信号集的问题。我们通过多个高斯核机器对多集合的未知高维函数进行建模,以考虑同一集合内的元素相互影响的可能性。因此,我们的变量集选择可视为高斯过程集选择。我们在贝叶斯方差成分选择框架下开发了高斯过程集选择。我们通过对模型施加伊辛先验,纳入了结构集的先验知识。我们的方法可以轻松应用于样本量小于变量数量的高维空间。我们还开发了一种高效的变分贝叶斯算法。我们通过模拟研究和 II 型糖尿病遗传途径分析展示了我们方法的优势。
{"title":"Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis","authors":"Jiali Lin, Inyoung Kim","doi":"10.1002/sam.11699","DOIUrl":"https://doi.org/10.1002/sam.11699","url":null,"abstract":"Analyzing\u0000correlated high‐dimensional data is a challenging problem in genomics, proteomics, and other related areas. For example, it is important to identify significant genetic pathway effects associated with biomarkers in which a gene pathway is a set of genes that functionally works together to regulate a certain biological process. A pathway‐based analysis can detect a subtle change in expression level that cannot be found using a gene‐based analysis. Here, we refer to pathway as a set and gene as an element in a set. However, it is challenging to select automatically which pathways are highly associated to the outcome when there are multiple pathways. In this paper, we propose a semiparametric multikernel regression model to study the effects of fixed covariates (e.g., clinical variables) and sets of elements (e.g., pathways of genes) to address a problem of detecting signal sets associated to biomarkers. We model the unknown high‐dimension functions of multi‐sets via multiple Gaussian kernel machines to consider the possibility that elements within the same set interact with each other. Hence, our variable set selection can be considered a Gaussian process set selection. We develop our Gaussian process set selection under the Bayesian variance component‐selection framework. We incorporate prior knowledge for structural sets by imposing an Ising prior on the model. Our approach can be easily applied in high‐dimensional spaces where the sample size is smaller than the number of variables. An efficient variational Bayes algorithm is developed. We demonstrate the advantages of our approach through simulation studies and through a type II diabetes genetic‐pathway analysis.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141409881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An automated alignment algorithm for identification of the source of footwear impressions with common class characteristics 用于识别具有共同类别特征的鞋印来源的自动排列算法
Pub Date : 2024-01-30 DOI: 10.1002/sam.11659
Hana Lee, Alicia Carriquiry, Soyoung Park
We introduce an algorithmic approach designed to compare similar shoeprint images, with automated alignment. Our method employs the Iterative Closest Points (ICP) algorithm to attain optimal alignment, further enhancing precision through phase‐only correlation. Utilizing diverse metrics to quantify similarity, we train a random forest model to predict the empirical probability that two impressions originate from the same shoe. Experimental evaluations using high‐quality two‐dimensional shoeprints showcase our proposed algorithm's robustness in managing dissimilarities between impressions from the same shoe, outperforming existing approaches.
我们介绍了一种旨在比较相似鞋印图像并自动对齐的算法方法。我们的方法采用迭代最邻近点 (ICP) 算法实现最佳配准,并通过仅相位相关性进一步提高精度。利用不同的指标来量化相似性,我们训练了一个随机森林模型来预测两张鞋印来自同一只鞋的经验概率。使用高质量二维鞋印进行的实验评估表明,我们提出的算法在处理同一鞋印之间的相似性方面非常稳健,优于现有方法。
{"title":"An automated alignment algorithm for identification of the source of footwear impressions with common class characteristics","authors":"Hana Lee, Alicia Carriquiry, Soyoung Park","doi":"10.1002/sam.11659","DOIUrl":"https://doi.org/10.1002/sam.11659","url":null,"abstract":"We introduce an algorithmic approach designed to compare similar shoeprint images, with automated alignment. Our method employs the Iterative Closest Points (ICP) algorithm to attain optimal alignment, further enhancing precision through phase‐only correlation. Utilizing diverse metrics to quantify similarity, we train a random forest model to predict the empirical probability that two impressions originate from the same shoe. Experimental evaluations using high‐quality two‐dimensional shoeprints showcase our proposed algorithm's robustness in managing dissimilarities between impressions from the same shoe, outperforming existing approaches.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"97 ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140484979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confidence bounds for threshold similarity graph in random variable network 随机变量网络中阈值相似图的置信度
Pub Date : 2023-09-07 DOI: 10.1002/sam.11642
P. Koldanov, A. Koldanov, D. P. Semenov
Problem of uncertainty of graph structure identification in random variable network is considered. An approach for the construction of upper and lower confidence bounds for graph structures is developed. This approach is applied for the construction of upper and lower confidence bounds for the threshold similarity graph. The stability of confidence bounds and gaps between upper and lower confidence bounds are investigated. Theoretical results are illustrated by numerical experiments.
研究了随机变量网络中图结构识别的不确定性问题。提出了一种构造图结构上、下置信区间的方法。将该方法应用于阈值相似图的上、下置信区间的构造。研究了置信区间的稳定性和上下置信区间的间隙。数值实验验证了理论结果。
{"title":"Confidence bounds for threshold similarity graph in random variable network","authors":"P. Koldanov, A. Koldanov, D. P. Semenov","doi":"10.1002/sam.11642","DOIUrl":"https://doi.org/10.1002/sam.11642","url":null,"abstract":"Problem of uncertainty of graph structure identification in random variable network is considered. An approach for the construction of upper and lower confidence bounds for graph structures is developed. This approach is applied for the construction of upper and lower confidence bounds for the threshold similarity graph. The stability of confidence bounds and gaps between upper and lower confidence bounds are investigated. Theoretical results are illustrated by numerical experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"50 15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123519121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Improved D2GAN‐based oversampling algorithm for imbalanced data classification 一种改进的基于D2GAN的不平衡数据分类过采样算法
Pub Date : 2023-08-25 DOI: 10.1002/sam.11640
Xiaoqiang Zhao, Qi Yao
To address the problems of pattern collapse, uncontrollable data generation and high overlap rate when generative adversarial network (GAN) oversamples imbalanced data, we propose an imbalanced data oversampling algorithm based on improved dual discriminator generative adversarial nets (D2GAN). First, we integrate the positive class attribute information into the generator and the discriminator to ensure that the generator only generates the samples for positive class samples, which overcomes the problem of uncontrollable data generation by the generator. Second, we introduce a classifier into D2GAN for discriminating the generated samples and the original data, which avoids the overlap among the generated samples and the negative class samples, and ensures the diversity of the generated samples, the problem of pattern collapse is solved. Finally, the performance of the proposed algorithm is evaluated on 9 datasets by using SVM and neural network classification algorithm for oversampling experiments, the results show that the proposed algorithm effectively improve the classification performance of imbalanced data.
针对生成式对抗网络(GAN)对不平衡数据进行过采样时存在模式崩溃、数据生成不可控和重叠率高等问题,提出了一种基于改进对偶判别器生成式对抗网络(D2GAN)的不平衡数据过采样算法。首先,我们将正类属性信息整合到生成器和鉴别器中,保证生成器只生成正类样本的样本,克服了生成器生成数据不可控的问题。其次,在D2GAN中引入分类器对生成样本与原始数据进行判别,避免了生成样本与负类样本的重叠,保证了生成样本的多样性,解决了模式崩溃问题;最后,利用支持向量机和神经网络分类算法在9个数据集上进行过采样实验,对所提算法的性能进行了评价,结果表明所提算法有效地提高了不平衡数据的分类性能。
{"title":"An Improved D2GAN‐based oversampling algorithm for imbalanced data classification","authors":"Xiaoqiang Zhao, Qi Yao","doi":"10.1002/sam.11640","DOIUrl":"https://doi.org/10.1002/sam.11640","url":null,"abstract":"To address the problems of pattern collapse, uncontrollable data generation and high overlap rate when generative adversarial network (GAN) oversamples imbalanced data, we propose an imbalanced data oversampling algorithm based on improved dual discriminator generative adversarial nets (D2GAN). First, we integrate the positive class attribute information into the generator and the discriminator to ensure that the generator only generates the samples for positive class samples, which overcomes the problem of uncontrollable data generation by the generator. Second, we introduce a classifier into D2GAN for discriminating the generated samples and the original data, which avoids the overlap among the generated samples and the negative class samples, and ensures the diversity of the generated samples, the problem of pattern collapse is solved. Finally, the performance of the proposed algorithm is evaluated on 9 datasets by using SVM and neural network classification algorithm for oversampling experiments, the results show that the proposed algorithm effectively improve the classification performance of imbalanced data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123666575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A neutral zone classifier for three classes with an application to text mining 一个用于三个类的中性区域分类器,用于文本挖掘
Pub Date : 2023-08-21 DOI: 10.1002/sam.11639
Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass
A classifier may be limited by its conditional misclassification rates more than its overall misclassification rate. In the case that one or more of the conditional misclassification rates are high, a neutral zone may be introduced to decrease and possibly balance the misclassification rates. In this paper, a neutral zone is incorporated into a three‐class classifier with its region determined by controlling conditional misclassification rates. The neutral zone classifier is illustrated with a text mining application that classifies written comments associated with student evaluations of teaching.
分类器可能受到其条件误分类率的限制,而不是其总体误分类率。在一个或多个条件误分类率很高的情况下,可以引入中性区来降低并可能平衡误分类率。本文在三类分类器中引入一个中性区,通过控制条件误分类率来确定其区域。中性区分类器用一个文本挖掘应用程序来说明,该应用程序对与学生教学评估相关的书面评论进行分类。
{"title":"A neutral zone classifier for three classes with an application to text mining","authors":"Dylan C. Friel, Yunzhe Li, Benjamin Ellis, D. Jeske, Herbert K. H. Lee, P. Kass","doi":"10.1002/sam.11639","DOIUrl":"https://doi.org/10.1002/sam.11639","url":null,"abstract":"A classifier may be limited by its conditional misclassification rates more than its overall misclassification rate. In the case that one or more of the conditional misclassification rates are high, a neutral zone may be introduced to decrease and possibly balance the misclassification rates. In this paper, a neutral zone is incorporated into a three‐class classifier with its region determined by controlling conditional misclassification rates. The neutral zone classifier is illustrated with a text mining application that classifies written comments associated with student evaluations of teaching.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126453612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensemble learning for score likelihood ratios under the common source problem 共源问题下分数似然比的集成学习
Pub Date : 2023-08-04 DOI: 10.1002/sam.11637
Federico Veneri, Danica M. Ommen
Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.
基于机器学习的分数似然比(slr)已经成为传统似然比和贝叶斯因子的替代品,用于在对比两个对立命题时量化证据的价值。当开发传统统计模型不可行的时候,机器学习可以用来为复杂数据构建一个(非)相似度分数,并估计分数的条件分布的比率。在共同来源问题下,如果两个项目来自同一来源,则对立命题解决。为了开发他们的单反,从业者从背景人口样本中使用两两比较创建数据集。这些比较导致了复杂的依赖结构,违背了许多流行方法所做的独立性假设。我们提出了一个重采样步骤来弥补这种独立性的缺乏,并提出了一个集成方法来提高单反系统的性能。首先,我们引入源感知重采样计划来构建满足独立性假设的数据集。使用这些新创建的集合,我们训练多个基本单反,并将它们的输出汇总为最终的证据值。我们的实验结果表明,这种集成单反方法在误导证据率和区分能力方面优于传统的单反方法,并且呈现出更一致的结果。
{"title":"Ensemble learning for score likelihood ratios under the common source problem","authors":"Federico Veneri, Danica M. Ommen","doi":"10.1002/sam.11637","DOIUrl":"https://doi.org/10.1002/sam.11637","url":null,"abstract":"Machine learning‐based score likelihood ratios (SLRs) have emerged as alternatives to traditional likelihood ratios and Bayes factors to quantify the value of evidence when contrasting two opposing propositions. When developing a conventional statistical model is infeasible, machine learning can be used to construct a (dis)similarity score for complex data and estimate the ratio of the conditional distributions of the scores. Under the common source problem, the opposing propositions address if two items come from the same source. To develop their SLRs, practitioners create datasets using pairwise comparisons from a background population sample. These comparisons result in a complex dependence structure that violates the independence assumption made by many popular methods. We propose a resampling step to remedy this lack of independence and an ensemble approach to enhance the performance of SLR systems. First, we introduce a source‐aware resampling plan to construct datasets where the independence assumption is met. Using these newly created sets, we train multiple base SLRs and aggregate their outputs into a final value of evidence. Our experimental results show that this ensemble SLR can outperform a traditional SLR approach in terms of the rate of misleading evidence and discriminatory power and present more consistent results.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125485310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A finely tuned deep transfer learning algorithm to compare outsole images 一个精细调整的深度迁移学习算法来比较大底图像
Pub Date : 2023-07-28 DOI: 10.1002/sam.11636
Moon-Yeop Jang, Soyoung Park, A. Carriquiry
In forensic practice, evaluating shoeprint evidence is challenging because the differences between images of two different outsoles can be subtle. In this paper, we propose a deep transfer learning‐based matching algorithm called the Shoe‐MS algorithm that quantifies the similarity between two outsole images. The Shoe‐MS algorithm consists of a Siamese neural network for two input images followed by a transfer learning component to extract features from outsole impression images. The added layers are finely tuned using images of shoe soles. To test the performance of the method we propose, we use a study dataset that is both realistic and challenging. The pairs of images for which we know ground truth include (1) close non‐matches and (2) mock‐crime scene pairs. The Shoe‐MS algorithm performed well in terms of prediction accuracy and was able to determine the source of pairs of outsole images, even when comparisons were challenging. When using a score‐based likelihood ratio, the algorithm made the correct decision with high probability in a test of the hypothesis that images had a common source. An important advantage of the proposed approach is that pairs of images can be compared without alignment. In initial tests, Shoe‐MS exhibited better‐discriminating power than existing methods.
{"title":"A finely tuned deep transfer learning algorithm to compare outsole images","authors":"Moon-Yeop Jang, Soyoung Park, A. Carriquiry","doi":"10.1002/sam.11636","DOIUrl":"https://doi.org/10.1002/sam.11636","url":null,"abstract":"In forensic practice, evaluating shoeprint evidence is challenging because the differences between images of two different outsoles can be subtle. In this paper, we propose a deep transfer learning‐based matching algorithm called the Shoe‐MS algorithm that quantifies the similarity between two outsole images. The Shoe‐MS algorithm consists of a Siamese neural network for two input images followed by a transfer learning component to extract features from outsole impression images. The added layers are finely tuned using images of shoe soles. To test the performance of the method we propose, we use a study dataset that is both realistic and challenging. The pairs of images for which we know ground truth include (1) close non‐matches and (2) mock‐crime scene pairs. The Shoe‐MS algorithm performed well in terms of prediction accuracy and was able to determine the source of pairs of outsole images, even when comparisons were challenging. When using a score‐based likelihood ratio, the algorithm made the correct decision with high probability in a test of the hypothesis that images had a common source. An important advantage of the proposed approach is that pairs of images can be compared without alignment. In initial tests, Shoe‐MS exhibited better‐discriminating power than existing methods.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"961 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124179773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLADAG 2021 special issue: Selected papers on classification and data analysis CLADAG 2021特刊:分类与数据分析论文精选
Pub Date : 2023-07-04 DOI: 10.1002/sam.11633
C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio
This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria
本期《统计分析与数据挖掘》特刊精选了分类与数据分析小组(CLADAG)第13届科学会议上发表的论文,该会议定于2021年9月9日至11日在意大利佛罗伦萨举行。受新冠肺炎疫情影响,会议选择在线举行。CLADAG是意大利统计学会(SIS)的一个分会,也是国际船级社联合会(IFCS)的成员。它成立于1997年,旨在促进多元统计的先进方法研究,重点是数据分析和分类。该科每两年组织一次国际科学会议,提供分类和数据分析课程,出版通讯,并与IFCS其他协会合作规划会议。前12届CLADAG会议在意大利各地举行:佩斯卡拉(1997年)、罗马(1999年)、巴勒莫(2001年)、博洛尼亚(2003年)、帕尔马(2005年)、马切拉塔(2007年)、卡塔尼亚(2009年)、帕维亚(2011年)、摩德纳和雷吉欧艾米利亚(2013年)、卡利亚里(2015年)、米兰(2017年)和卡西诺(2019年)。经过盲目的同行评议过程,六篇在会议上发表并提交给本期特刊的论文被选中发表。这些文章涵盖了广泛的数据分析主题:性别差距分析、收入聚类、结构方程建模、多变量非参数方法和分类器选择。它们的内容简述如下。在研究性别差距这一促进平等和社会正义的相关课题时,Greselin等人利用相对分布法和Dagum参数推理提出了一种新的参数化方法。此外,他们还评估了如何选择影响性别差距的协变量。根据2018年欧洲收入和生活条件调查的数据,该方法被用于衡量和比较波兰和意大利的性别差距。在相关领域,Condino提出了一种基于份额密度的动态聚类算法对收入数据进行聚类。本文采用基于信息论的不相似性度量来比较各子群体的收入不平等。然后利用这一措施进行聚类,为聚集的收入者提供收入不平等的原型描述符。该提议适用于意大利银行家庭收入和财富调查的数据。Yu等人的论文介绍了对所谓的Henseler-Ogasawara规范的改进,该规范将复合材料、变量的线性组合集成到结构方程模型中。这个改进的版本解决了Henseler-Ogasawara规范的一些问题,它不那么复杂,也不容易出现规范错误。此外,本文还提供了一种计算标准误差的策略。统计深度函数是多变量非参数数据分析的一个有价值的工具,将秩、排序和分位数的概念扩展到多变量设置。Laketa和Nagy的论文探讨了当代深度研究的一个基本开放问题,即所谓的表征和重构问题,重点关注简单深度。他们的结果是通过几个有见地的例子来说明的。在同一主题上,纳吉重新审视了简单深度的经典定义,并探讨了其理论性质。特别地,研究了简单中值的性质。作者在几个场景中提供了精确的简单深度,概述了该深度函数的不良行为。Carpita和Golia解决了在给定估计概率的情况下选择规则将单位分配给类别的问题。特别地,本文将最小化预期分类错误率的经典贝叶斯分类器与Max Difference分类器和Max Ratio分类器进行了比较,说明了何时应该优先使用这些分类器。研究结果通过广泛的模拟研究和基准数据集的应用来说明。综上所述,我们认为这期特刊准确地描绘了当今CLADAG社区的科学特征,并支持了CLADAG促进分类和数据分析思想交流的使命。我们热忱鼓励所有读者参加
{"title":"CLADAG 2021 special issue: Selected papers on classification and data analysis","authors":"C. Bocci, A. Gottard, T. B. Murphy, G. C. Porzio","doi":"10.1002/sam.11633","DOIUrl":"https://doi.org/10.1002/sam.11633","url":null,"abstract":"This special issue of Statistical Analysis and Data Mining contains a selection of the papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG), scheduled for September 9–11, 2021 in Florence, Italy. Due to the COVID-19 pandemic, the conference was held online. The CLADAG is a Section of the Italian Statistical Society (SIS), and a member of the International Federation of Classification Societies (IFCS). It was founded in 1997 to promote advanced methodological research in multivariate statistics, focusing on Data Analysis and Classification. The Section organizes a biennial international scientific meeting, offers classification and data analysis courses, publishes a newsletter, and collaborates on planning conferences and meetings with other IFCS societies. The previous 12 CLADAG meetings were held in various locations throughout Italy: Pescara (1997), Roma (1999), Palermo (2001), Bologna (2003), Parma (2005), Macerata (2007), Catania (2009), Pavia (2011), Modena and Reggio Emilia (2013), Cagliari (2015), Milano (2017), and Cassino (2019). Following a blind peer-review process, six papers presented at the conference and submitted to this special issue have been selected for publication. The articles cover a broad range of data analysis topics: gender gap analysis, income clustering, structural equation modeling, multivariate nonparametric methods, and classifier selection. Their content is briefly described below. In studying the gender gap, a relevant topic for promoting equality and social justice, Greselin et al. propose a new parametric approach utilizing the relative distribution method and Dagum parametric inference. Additionally, they assessed how to select covariates that impact gender gaps. The proposed approach is applied to measure and compare the gender gap in Poland and Italy, using data from the 2018 European Survey of Income and Living Conditions. On a related field, Condino proposes a procedure for clustering income data using a share density-based dynamic clustering algorithm. The paper compares subgroups’ income inequality using a dissimilarity measure based on information theory. This measure is then utilized for clustering, providing a prototype descriptor of income inequality for the clustered earners. The proposal is applied to data from the Survey on Households Income and Wealth by the Bank of Italy. The paper by Yu et al. introduces a refinement of the so-called Henseler–Ogasawara specification that integrates composites, linear combinations of variables, into structural equation models. This refined version addresses some concerns of the Henseler–Ogasawara specification, and it is less complex and less prone to misspecification mistakes. Additionally, the paper provides a strategy to compute standard errors. Statistical depth functions are a valuable tool for multivariate nonparametric data analysis, extending the concept of ranks, orderings, and quantiles to the multivaria","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132268384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1