首页 > 最新文献

Statistical Analysis and Data Mining最新文献

英文 中文
Quantifying Epistemic Uncertainty in Binary Classification via Accuracy Gain 通过精度增益量化二元分类中的认识不确定性
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-18 DOI: 10.1002/sam.11709
Christopher Qian, Tyler Ganter, Joshua Michalenko, Feng Liang, Jason Adams
Recently, a surge of interest has been given to quantifying epistemic uncertainty (EU), the reducible portion of uncertainty due to lack of data. We propose a novel EU estimator in the binary classification setting, as the posterior expected value of the empirical gain in accuracy between the current prediction and the optimal prediction. In order to validate the performance of our EU estimator, we introduce an experimental procedure where we take an existing dataset, remove a set of points, and compare the estimated EU with the observed change in accuracy. Through real and simulated data experiments, we demonstrate the effectiveness of our proposed EU estimator.
最近,人们对量化认识不确定性(EU)产生了浓厚的兴趣,认识不确定性是由于缺乏数据而产生的不确定性中可减少的部分。我们在二元分类设置中提出了一种新的 EU 估计器,即当前预测与最优预测之间准确性经验增益的后验期望值。为了验证我们的 EU 估计器的性能,我们引入了一个实验过程,即利用现有数据集,移除一组点,然后将估计的 EU 与观察到的准确率变化进行比较。通过真实和模拟数据实验,我们证明了我们提出的 EU 估计器的有效性。
{"title":"Quantifying Epistemic Uncertainty in Binary Classification via Accuracy Gain","authors":"Christopher Qian, Tyler Ganter, Joshua Michalenko, Feng Liang, Jason Adams","doi":"10.1002/sam.11709","DOIUrl":"https://doi.org/10.1002/sam.11709","url":null,"abstract":"Recently, a surge of interest has been given to quantifying epistemic uncertainty (EU), the reducible portion of uncertainty due to lack of data. We propose a novel EU estimator in the binary classification setting, as the posterior expected value of the empirical gain in accuracy between the current prediction and the optimal prediction. In order to validate the performance of our EU estimator, we introduce an experimental procedure where we take an existing dataset, remove a set of points, and compare the estimated EU with the observed change in accuracy. Through real and simulated data experiments, we demonstrate the effectiveness of our proposed EU estimator.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142257108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new logarithmic multiplicative distortion for correlation analysis 用于相关分析的新对数乘法失真
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-23 DOI: 10.1002/sam.11708
Siming Deng, Jun Zhang
We study the Pearson correlation coefficient in a logarithmic manner under the presence of multiplicative distortion measurement errors. In this context, the observed variables with logarithmic transformation are distorted in multiplicative fashions by an observed confounding variable. The proposed multiplicative distortion model in this paper is applied to analyze positive variables. We utilize the conditional mean calibration and the conditional absolute mean calibration methods to obtain the calibrated variables. Furthermore, we propose confidence intervals based on asymptotic normality, empirical likelihood, and jackknife empirical likelihood. Simulation studies demonstrate the effectiveness of the proposed estimation procedure, and a real‐world example is analyzed to illustrate its practical application.
我们研究了存在乘法扭曲测量误差情况下的对数皮尔逊相关系数。在这种情况下,具有对数变换的观测变量会被观测到的混杂变量以乘法方式扭曲。本文提出的乘法失真模型适用于分析正变量。我们利用条件均值校准法和条件绝对均值校准法获得校准变量。此外,我们还提出了基于渐近正态性、经验似然法和千斤顶经验似然法的置信区间。模拟研究证明了所建议的估计程序的有效性,并分析了一个实际案例来说明其实际应用。
{"title":"A new logarithmic multiplicative distortion for correlation analysis","authors":"Siming Deng, Jun Zhang","doi":"10.1002/sam.11708","DOIUrl":"https://doi.org/10.1002/sam.11708","url":null,"abstract":"We study the Pearson correlation coefficient in a logarithmic manner under the presence of multiplicative distortion measurement errors. In this context, the observed variables with logarithmic transformation are distorted in multiplicative fashions by an observed confounding variable. The proposed multiplicative distortion model in this paper is applied to analyze positive variables. We utilize the conditional mean calibration and the conditional absolute mean calibration methods to obtain the calibrated variables. Furthermore, we propose confidence intervals based on asymptotic normality, empirical likelihood, and jackknife empirical likelihood. Simulation studies demonstrate the effectiveness of the proposed estimation procedure, and a real‐world example is analyzed to illustrate its practical application.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142199011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Revisiting Winnow: A modified online feature selection algorithm for efficient binary classification 重新审视 Winnow:用于高效二元分类的改进型在线特征选择算法
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-30 DOI: 10.1002/sam.11707
Y. Narasimhulu, Pralhad Kolambkar, Venkaiah V. China
Winnow is an efficient binary classification algorithm that effectively learns from data even in the presence of a large number of irrelevant attributes. It is specifically designed for online learning scenarios. Unlike the Perceptron algorithm, Winnow employs a multiplicative weight update function, which leads to fewer mistakes and faster convergence. However, the original Winnow algorithm has several limitations. They include, it only works on binary data, and the weight updates are constant and do not depend on the input features. In this article, we propose a modified version of the Winnow algorithm that addresses these limitations. The proposed algorithm is capable of handling real‐valued data, updates the learning function based on the input feature vector. To evaluate the performance of our proposed algorithm, we compare it with seven existing variants of the Winnow algorithm on datasets of varying sizes. We employ various evaluation metrics and parameters to assess and compare the performance of the algorithms. The experimental results demonstrate that our proposed algorithm outperforms all the other algorithms used for comparison, highlighting its effectiveness in classification tasks.
Winnow 是一种高效的二元分类算法,即使在存在大量无关属性的情况下也能有效地学习数据。它专为在线学习场景而设计。与 Perceptron 算法不同,Winnow 采用了乘法权重更新函数,从而减少了错误,加快了收敛速度。不过,最初的 Winnow 算法有几个局限性。其中包括:该算法仅适用于二进制数据,权重更新是恒定的,不依赖于输入特征。在本文中,我们提出了 Winnow 算法的改进版,以解决这些局限性。该算法能够处理实值数据,并根据输入特征向量更新学习函数。为了评估我们提出的算法的性能,我们在不同规模的数据集上将其与 Winnow 算法的七个现有变体进行了比较。我们采用各种评价指标和参数来评估和比较算法的性能。实验结果表明,我们提出的算法优于用于比较的所有其他算法,突出了其在分类任务中的有效性。
{"title":"Revisiting Winnow: A modified online feature selection algorithm for efficient binary classification","authors":"Y. Narasimhulu, Pralhad Kolambkar, Venkaiah V. China","doi":"10.1002/sam.11707","DOIUrl":"https://doi.org/10.1002/sam.11707","url":null,"abstract":"Winnow is an efficient binary classification algorithm that effectively learns from data even in the presence of a large number of irrelevant attributes. It is specifically designed for online learning scenarios. Unlike the Perceptron algorithm, Winnow employs a multiplicative weight update function, which leads to fewer mistakes and faster convergence. However, the original Winnow algorithm has several limitations. They include, it only works on binary data, and the weight updates are constant and do not depend on the input features. In this article, we propose a modified version of the Winnow algorithm that addresses these limitations. The proposed algorithm is capable of handling real‐valued data, updates the learning function based on the input feature vector. To evaluate the performance of our proposed algorithm, we compare it with seven existing variants of the Winnow algorithm on datasets of varying sizes. We employ various evaluation metrics and parameters to assess and compare the performance of the algorithms. The experimental results demonstrate that our proposed algorithm outperforms all the other algorithms used for comparison, highlighting its effectiveness in classification tasks.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141871106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A random forest approach for interval selection in functional regression 函数回归中区间选择的随机森林方法
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-24 DOI: 10.1002/sam.11705
Rémi Servien, Nathalie Vialaneix
In this article, we focus on the problem of variable selection in a functional regression framework. This question is motivated by practical applications in the field of agronomy: In this field, identifying the temporal periods during which weather measurements have the greatest impact on yield is critical for guiding agriculture practices in a changing environment. From a methodological point of view, our goal is to identify consecutive measurement points in the definition domain of the functional predictors, which correspond to the most important intervals for the prediction of a numeric output from the functional variables. We propose an approach based on the versatile random forest method that benefits from its good performances for variable selection and prediction. Our method builds in three steps (interval creation, summary, and selection). Different variants for each of the steps are proposed and compared on both simulated and real‐life datasets. The performances of our method compared to alternative approaches highlight its usefulness to select relevant intervals while maintaining good prediction capabilities. All variants of our method are available in the R package SISIR.
本文将重点讨论函数回归框架中的变量选择问题。这个问题是由农学领域的实际应用所引发的:在这一领域,确定气象测量对产量影响最大的时间段对于在不断变化的环境中指导农业实践至关重要。从方法论的角度来看,我们的目标是确定功能预测因子定义域中的连续测量点,这些测量点与预测功能变量数值输出的最重要区间相对应。我们提出了一种基于多功能随机森林方法的方法,该方法在变量选择和预测方面表现出色。我们的方法分为三个步骤(区间创建、汇总和选择)。我们提出了每个步骤的不同变体,并在模拟数据集和现实数据集上进行了比较。与其他方法相比,我们的方法的性能突出了它在选择相关区间的同时保持良好预测能力的实用性。我们方法的所有变体都可以在 R 软件包 SISIR 中找到。
{"title":"A random forest approach for interval selection in functional regression","authors":"Rémi Servien, Nathalie Vialaneix","doi":"10.1002/sam.11705","DOIUrl":"https://doi.org/10.1002/sam.11705","url":null,"abstract":"In this article, we focus on the problem of variable selection in a functional regression framework. This question is motivated by practical applications in the field of agronomy: In this field, identifying the temporal periods during which weather measurements have the greatest impact on yield is critical for guiding agriculture practices in a changing environment. From a methodological point of view, our goal is to identify consecutive measurement points in the definition domain of the functional predictors, which correspond to the most important intervals for the prediction of a numeric output from the functional variables. We propose an approach based on the versatile random forest method that benefits from its good performances for variable selection and prediction. Our method builds in three steps (interval creation, summary, and selection). Different variants for each of the steps are proposed and compared on both simulated and real‐life datasets. The performances of our method compared to alternative approaches highlight its usefulness to select relevant intervals while maintaining good prediction capabilities. All variants of our method are available in the R package SISIR.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing climate pathways using feature importance on echo state networks 利用回波状态网络的特征重要性描述气候路径
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-23 DOI: 10.1002/sam.11706
Katherine Goode, Daniel Ries, Kellie McClernon
The 2022 National Defense Strategy of the United States listed climate change as a serious threat to national security. Climate intervention methods, such as stratospheric aerosol injection, have been proposed as mitigation strategies, but the downstream effects of such actions on a complex climate system are not well understood. The development of algorithmic techniques for quantifying relationships between source and impact variables related to a climate event (i.e., a climate pathway) would help inform policy decisions. Data‐driven deep learning models have become powerful tools for modeling highly nonlinear relationships and may provide a route to characterize climate variable relationships. In this paper, we explore the use of an echo state network (ESN) for characterizing climate pathways. ESNs are a computationally efficient neural network variation designed for temporal data, and recent work proposes ESNs as a useful tool for forecasting spatiotemporal climate data. However, ESNs are noninterpretable black‐box models along with other neural networks. The lack of model transparency poses a hurdle for understanding variable relationships. We address this issue by developing feature importance methods for ESNs in the context of spatiotemporal data to quantify variable relationships captured by the model. We conduct a simulation study to assess and compare the feature importance techniques, and we demonstrate the approach on reanalysis climate data. In the climate application, we consider a time period that includes the 1991 volcanic eruption of Mount Pinatubo. This event was a significant stratospheric aerosol injection, which acts as a proxy for an anthropogenic stratospheric aerosol injection. We are able to use the proposed approach to characterize relationships between pathway variables associated with this event that agree with relationships previously identified by climate scientists.
美国 2022 年国防战略将气候变化列为对国家安全的严重威胁。气候干预方法,如平流层气溶胶注入,已被作为减缓战略提出,但这些行动对复杂气候系统的下游影响还不甚了解。开发算法技术,量化与气候事件(即气候路径)相关的源变量和影响变量之间的关系,将有助于为政策决策提供信息。数据驱动的深度学习模型已成为高度非线性关系建模的强大工具,可为气候变量关系的特征描述提供途径。在本文中,我们将探索使用回声状态网络(ESN)来描述气候路径。ESN 是一种专为时间数据设计的计算效率高的神经网络变体,最近的研究提出 ESN 是预测时空气候数据的有用工具。然而,ESN 与其他神经网络一样,都是不可解释的黑箱模型。模型缺乏透明度对理解变量关系构成了障碍。为了解决这个问题,我们开发了时空数据背景下的 ESN 特征重要性方法,以量化模型捕捉到的变量关系。我们进行了一项模拟研究,以评估和比较特征重要性技术,并在再分析气候数据中演示了该方法。在气候应用中,我们考虑了包括 1991 年皮纳图博火山爆发在内的时间段。这一事件是一次重要的平流层气溶胶注入,可作为人为平流层气溶胶注入的替代物。我们能够利用所提出的方法来描述与这一事件相关的路径变量之间的关系,这些关系与气候科学家之前确定的关系是一致的。
{"title":"Characterizing climate pathways using feature importance on echo state networks","authors":"Katherine Goode, Daniel Ries, Kellie McClernon","doi":"10.1002/sam.11706","DOIUrl":"https://doi.org/10.1002/sam.11706","url":null,"abstract":"The 2022 National Defense Strategy of the United States listed climate change as a serious threat to national security. Climate intervention methods, such as stratospheric aerosol injection, have been proposed as mitigation strategies, but the downstream effects of such actions on a complex climate system are not well understood. The development of algorithmic techniques for quantifying relationships between source and impact variables related to a climate event (i.e., a climate pathway) would help inform policy decisions. Data‐driven deep learning models have become powerful tools for modeling highly nonlinear relationships and may provide a route to characterize climate variable relationships. In this paper, we explore the use of an echo state network (ESN) for characterizing climate pathways. ESNs are a computationally efficient neural network variation designed for temporal data, and recent work proposes ESNs as a useful tool for forecasting spatiotemporal climate data. However, ESNs are noninterpretable black‐box models along with other neural networks. The lack of model transparency poses a hurdle for understanding variable relationships. We address this issue by developing feature importance methods for ESNs in the context of spatiotemporal data to quantify variable relationships captured by the model. We conduct a simulation study to assess and compare the feature importance techniques, and we demonstrate the approach on reanalysis climate data. In the climate application, we consider a time period that includes the 1991 volcanic eruption of Mount Pinatubo. This event was a significant stratospheric aerosol injection, which acts as a proxy for an anthropogenic stratospheric aerosol injection. We are able to use the proposed approach to characterize relationships between pathway variables associated with this event that agree with relationships previously identified by climate scientists.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two‐sample testing for random graphs 随机图形的双样本测试
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-27 DOI: 10.1002/sam.11703
Xiaoyi Wen
The employment of two‐sample hypothesis testing in examining random graphs has been a prevalent approach in diverse fields such as social sciences, neuroscience, and genetics. We advance a spectral‐based two‐sample hypothesis testing methodology to test the latent position random graphs. We propose two distinct asymptotic normal statistics, each optimally designed for two different models—the elementary Erdős–Rényi model and the more complex latent position random graph model. For the latter, the spectral embedding of the adjacency matrix was utilized to estimate the test statistic. The proposed method exhibited superior efficacy as it accomplished higher power than the conventional method of mean estimation. To validate our hypothesis testing procedure, we applied it to empirical biological data to discern structural variances in gene co‐expression networks between COVID‐19 patients and individuals who remained unaffected by the disease.
采用双样本假设检验来检验随机图已经成为社会科学、神经科学和遗传学等多个领域的普遍方法。我们提出了一种基于频谱的双样本假设检验方法来检验潜在位置随机图。我们提出了两种不同的渐近正态统计量,分别针对两种不同的模型--基本的厄尔多斯-雷尼模型和更复杂的潜位置随机图模型--进行了优化设计。对于后者,利用邻接矩阵的谱嵌入来估计测试统计量。与传统的均值估计方法相比,所提出的方法具有更高的功率,因此表现出了卓越的功效。为了验证我们的假设检验程序,我们将其应用于经验生物数据,以发现 COVID-19 患者与未受该疾病影响的个体之间基因共表达网络的结构差异。
{"title":"Two‐sample testing for random graphs","authors":"Xiaoyi Wen","doi":"10.1002/sam.11703","DOIUrl":"https://doi.org/10.1002/sam.11703","url":null,"abstract":"The employment of two‐sample hypothesis testing in examining random graphs has been a prevalent approach in diverse fields such as social sciences, neuroscience, and genetics. We advance a spectral‐based two‐sample hypothesis testing methodology to test the latent position random graphs. We propose two distinct asymptotic normal statistics, each optimally designed for two different models—the elementary Erdős–Rényi model and the more complex latent position random graph model. For the latter, the spectral embedding of the adjacency matrix was utilized to estimate the test statistic. The proposed method exhibited superior efficacy as it accomplished higher power than the conventional method of mean estimation. To validate our hypothesis testing procedure, we applied it to empirical biological data to discern structural variances in gene co‐expression networks between COVID‐19 patients and individuals who remained unaffected by the disease.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost‐sensitive classification with time constraint on incomplete data 在不完整数据的时间限制下进行对成本敏感的分类
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-25 DOI: 10.1002/sam.11702
Yong‐Shiuan Lee, Chia‐Chi Wu
Missing values are common, but dealing with them by inappropriate method may lead to large classification errors. Empirical evidences show that the tree‐based classification algorithms such as classification and regression tree (CART) can benefit from imputation, especially multiple imputation. Nevertheless, less attention has been paid to incorporating multiple imputation into cost‐sensitive decision tree induction. This study focuses on the treatment of missing data based on a time‐constrained minimal‐cost tree algorithm. We introduce various approaches to handle incomplete data into the algorithm including complete‐case analysis, missing‐value branch, single imputation, feature acquisition, and multiple imputation. A simulation study under different scenarios examines the predictive performances of the proposed strategies. The simulation results show that the combination of the algorithm with multiple imputation can assure classification accuracy under the budget. A real medical data example provides insights into the problem of missing values in cost‐sensitive learning and the advantages of the proposed methods.
缺失值很常见,但用不恰当的方法处理缺失值可能会导致很大的分类误差。经验证据表明,基于树的分类算法,如分类和回归树(CART),可以从估算中获益,尤其是多重估算。然而,将多重归因纳入成本敏感决策树归纳法的研究却较少受到关注。本研究的重点是基于时间受限的最小成本树算法处理缺失数据。我们在算法中引入了多种处理不完整数据的方法,包括完整案例分析、缺失值分支、单一归因、特征获取和多重归因。在不同场景下进行的模拟研究检验了所提策略的预测性能。仿真结果表明,算法与多重归因的结合可以在预算范围内确保分类准确性。通过一个真实的医疗数据实例,我们可以深入了解成本敏感学习中的缺失值问题以及所提方法的优势。
{"title":"Cost‐sensitive classification with time constraint on incomplete data","authors":"Yong‐Shiuan Lee, Chia‐Chi Wu","doi":"10.1002/sam.11702","DOIUrl":"https://doi.org/10.1002/sam.11702","url":null,"abstract":"Missing values are common, but dealing with them by inappropriate method may lead to large classification errors. Empirical evidences show that the tree‐based classification algorithms such as classification and regression tree (CART) can benefit from imputation, especially multiple imputation. Nevertheless, less attention has been paid to incorporating multiple imputation into cost‐sensitive decision tree induction. This study focuses on the treatment of missing data based on a time‐constrained minimal‐cost tree algorithm. We introduce various approaches to handle incomplete data into the algorithm including complete‐case analysis, missing‐value branch, single imputation, feature acquisition, and multiple imputation. A simulation study under different scenarios examines the predictive performances of the proposed strategies. The simulation results show that the combination of the algorithm with multiple imputation can assure classification accuracy under the budget. A real medical data example provides insights into the problem of missing values in cost‐sensitive learning and the advantages of the proposed methods.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141512035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential metamodel‐based approaches to level‐set estimation under heteroscedasticity 基于序列元模型的异方差下水平集估计方法
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2024-05-29 DOI: 10.1002/sam.11697
Yutong Zhang, Xi Chen
This paper proposes two sequential metamodel‐based methods for level‐set estimation (LSE) that leverage the uniform bound built on stochastic kriging: predictive variance reduction (PVR) and expected classification improvement (ECI). We show that PVR and ECI possess desirable theoretical performance guarantees and provide closed‐form expressions for their respective sequential sampling criteria to seek the next design point for performing simulation runs, allowing computationally efficient one‐iteration look‐ahead updates. To enhance understanding, we reveal the connection between PVR and ECI's sequential sampling criteria. Additionally, we propose integrating a budget allocation feature with PVR and ECI, which improves computational efficiency and potentially enhances robustness to the impacts of heteroscedasticity. Numerical studies demonstrate the superior performance of the proposed methods compared to state‐of‐the‐art benchmarking approaches when given a fixed simulation budget, highlighting their effectiveness in addressing LSE problems.
本文提出了两种基于序列元模型的水平集估计(LSE)方法,它们利用了建立在随机克里金基础上的均匀约束:预测方差缩小(PVR)和预期分类改进(ECI)。我们证明了 PVR 和 ECI 具有理想的理论性能保证,并为它们各自的顺序采样准则提供了闭式表达式,以便在执行模拟运行时寻找下一个设计点,从而实现计算高效的单迭代前瞻性更新。为了加深理解,我们揭示了 PVR 和 ECI 的顺序采样准则之间的联系。此外,我们还建议将预算分配功能与 PVR 和 ECI 相结合,从而提高计算效率,并增强对异方差影响的稳健性。数值研究表明,在给定固定模拟预算的情况下,与最先进的基准方法相比,所提出的方法具有更优越的性能,突出了它们在解决 LSE 问题方面的有效性。
{"title":"Sequential metamodel‐based approaches to level‐set estimation under heteroscedasticity","authors":"Yutong Zhang, Xi Chen","doi":"10.1002/sam.11697","DOIUrl":"https://doi.org/10.1002/sam.11697","url":null,"abstract":"This paper proposes two sequential metamodel‐based methods for level‐set estimation (LSE) that leverage the uniform bound built on stochastic kriging: predictive variance reduction (PVR) and expected classification improvement (ECI). We show that PVR and ECI possess desirable theoretical performance guarantees and provide closed‐form expressions for their respective sequential sampling criteria to seek the next design point for performing simulation runs, allowing computationally efficient one‐iteration look‐ahead updates. To enhance understanding, we reveal the connection between PVR and ECI's sequential sampling criteria. Additionally, we propose integrating a budget allocation feature with PVR and ECI, which improves computational efficiency and potentially enhances robustness to the impacts of heteroscedasticity. Numerical studies demonstrate the superior performance of the proposed methods compared to state‐of‐the‐art benchmarking approaches when given a fixed simulation budget, highlighting their effectiveness in addressing LSE problems.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards accelerating particle‐resolved direct numerical simulation with neural operators 利用神经算子加速粒子分辨直接数值模拟
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2024-05-29 DOI: 10.1002/sam.11690
Mohammad Atif, Vanessa López‐Marrero, Tao Zhang, Abdullah Al Muti Sharfuddin, Kwangmin Yu, Jiaqi Yang, Fan Yang, Foluso Ladeinde, Yangang Liu, Meifeng Lin, Lingda Li
We present our ongoing work aimed at accelerating a particle‐resolved direct numerical simulation model designed to study aerosol–cloud–turbulence interactions. The dynamical model consists of two main components—a set of fluid dynamics equations for air velocity, temperature, and humidity, coupled with a set of equations for particle (i.e., cloud droplet) tracing. Rather than attempting to replace the original numerical solution method in its entirety with a machine learning (ML) method, we consider developing a hybrid approach. We exploit the potential of neural operator learning to yield fast and accurate surrogate models and, in this study, develop such surrogates for the velocity and vorticity fields. We discuss results from numerical experiments designed to assess the performance of ML architectures under consideration as well as their suitability for capturing the behavior of relevant dynamical systems.
我们介绍了我们正在进行的工作,该工作旨在加速一个粒子分辨直接数值模拟模型,该模型旨在研究气溶胶-云-湍流的相互作用。该动力学模型由两个主要部分组成--一组空气流速、温度和湿度的流体动力学方程,以及一组粒子(即云滴)追踪方程。我们没有试图用机器学习(ML)方法完全取代原始的数值求解方法,而是考虑开发一种混合方法。我们利用神经算子学习的潜力来建立快速准确的代用模型,并在本研究中开发了速度场和涡度场的代用模型。我们讨论了旨在评估所考虑的 ML 架构的性能及其捕捉相关动力系统行为的适用性的数值实验结果。
{"title":"Towards accelerating particle‐resolved direct numerical simulation with neural operators","authors":"Mohammad Atif, Vanessa López‐Marrero, Tao Zhang, Abdullah Al Muti Sharfuddin, Kwangmin Yu, Jiaqi Yang, Fan Yang, Foluso Ladeinde, Yangang Liu, Meifeng Lin, Lingda Li","doi":"10.1002/sam.11690","DOIUrl":"https://doi.org/10.1002/sam.11690","url":null,"abstract":"We present our ongoing work aimed at accelerating a particle‐resolved direct numerical simulation model designed to study aerosol–cloud–turbulence interactions. The dynamical model consists of two main components—a set of fluid dynamics equations for air velocity, temperature, and humidity, coupled with a set of equations for particle (i.e., cloud droplet) tracing. Rather than attempting to replace the original numerical solution method in its entirety with a machine learning (ML) method, we consider developing a hybrid approach. We exploit the potential of neural operator learning to yield fast and accurate surrogate models and, in this study, develop such surrogates for the velocity and vorticity fields. We discuss results from numerical experiments designed to assess the performance of ML architectures under consideration as well as their suitability for capturing the behavior of relevant dynamical systems.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141188840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric mean and variance adaptive classification rule for high‐dimensional data with heteroscedastic variances 具有异方差的高维数据的非参数均值和方差自适应分类规则
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2024-05-20 DOI: 10.1002/sam.11689
Seungyeon Oh, Hoyoung Park
In this study, we introduce an innovative methodology aimed at enhancing Fisher's Linear Discriminant Analysis (LDA) in the context of high‐dimensional data classification scenarios, specifically addressing situations where each feature exhibits distinct variances. Our approach leverages Nonparametric Maximum Likelihood Estimation (NPMLE) techniques to estimate both the mean and variance parameters. By accommodating varying variances among features, our proposed method leads to notable improvements in classification performance. In particular, unlike numerous prior studies that assume the distribution of heterogeneous variances follows a right‐skewed inverse gamma distribution, our proposed method demonstrates excellent performance even when the distribution of heterogeneous variances takes on left‐skewed, symmetric, or right‐skewed forms. We conducted a series of rigorous experiments to empirically validate the effectiveness of our approach. The results of these experiments demonstrate that our proposed methodology excels in accurately classifying high‐dimensional data characterized by heterogeneous variances.
在本研究中,我们介绍了一种创新方法,旨在增强费雪线性判别分析(LDA)在高维数据分类场景中的应用,特别是解决每个特征都表现出不同方差的情况。我们的方法利用非参数最大似然估计(NPMLE)技术来估计均值和方差参数。通过适应特征间不同的方差,我们提出的方法显著提高了分类性能。特别是,与之前许多假设异质性方差分布为右斜反伽马分布的研究不同,即使异质性方差分布为左斜、对称或右斜形式,我们提出的方法也能表现出卓越的性能。我们进行了一系列严格的实验来验证我们方法的有效性。这些实验结果表明,我们提出的方法在对具有异质性方差特征的高维数据进行精确分类方面表现出色。
{"title":"Nonparametric mean and variance adaptive classification rule for high‐dimensional data with heteroscedastic variances","authors":"Seungyeon Oh, Hoyoung Park","doi":"10.1002/sam.11689","DOIUrl":"https://doi.org/10.1002/sam.11689","url":null,"abstract":"In this study, we introduce an innovative methodology aimed at enhancing Fisher's Linear Discriminant Analysis (LDA) in the context of high‐dimensional data classification scenarios, specifically addressing situations where each feature exhibits distinct variances. Our approach leverages Nonparametric Maximum Likelihood Estimation (NPMLE) techniques to estimate both the mean and variance parameters. By accommodating varying variances among features, our proposed method leads to notable improvements in classification performance. In particular, unlike numerous prior studies that assume the distribution of heterogeneous variances follows a right‐skewed inverse gamma distribution, our proposed method demonstrates excellent performance even when the distribution of heterogeneous variances takes on left‐skewed, symmetric, or right‐skewed forms. We conducted a series of rigorous experiments to empirically validate the effectiveness of our approach. The results of these experiments demonstrate that our proposed methodology excels in accurately classifying high‐dimensional data characterized by heterogeneous variances.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141147814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1