首页 > 最新文献

Statistical Analysis and Data Mining最新文献

英文 中文
Boosting diversity in regression ensembles 提升回归集合的多样性
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-30 DOI: 10.1002/sam.11654
Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi
Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.
在分类和回归任务中,集合方法(如 Bagging、Boosting 或 Random Forests)通常能提高单个学习者的预测性能。在回归方面,我们提出了一种基于梯度提升的算法,该算法包含一个多样性项,目的是构建不同的学习器,丰富集合,同时在某些个体最优性与全局增强性之间实现权衡。通过验证 Biau 和 Cadre 定理(2021 年,《当代统计学和计量经济学进展--克里斯蒂娜-托马斯-阿格南纪念文集》,施普林格出版社)的假设,我们提出了一个收敛结果,确保相关优化策略达到全局最优。在实验中,我们考虑了各种不同的基础学习器,其复杂度也在不断增加:树桩、回归树、纯随机森林和布雷曼随机森林。最后,我们考虑了模拟数据集、基准数据集和一个真实世界的电力需求数据集,通过数值实验,不仅检查最终预测器或聚合预测器的行为,还检查整个生成序列的行为,从而展示我们的程序的适用性。
{"title":"Boosting diversity in regression ensembles","authors":"Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi","doi":"10.1002/sam.11654","DOIUrl":"https://doi.org/10.1002/sam.11654","url":null,"abstract":"Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, <i>Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan</i>, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"33 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139063502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A machine learning oracle for parameter estimation 用于参数估计的机器学习算法
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-09 DOI: 10.1002/sam.11651
Lucas Koepke, Mary Gregg, Michael Frey
Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set C�$$ mathcal{C} $$� of available preparatory procedures. The oracle learns the decision regions associated with C�$$ mathcal{C} $$� based on training data synthesized solely from the given data using model parameters with high posterior probability. An estimator in combination with an oracle to guide data preparation is called an oracle estimator. Oracle estimator performance is studied in two estimation problems: slope estimation in simple linear regression (SLR) and changepoint estimation in continuous two-linear-segments regression (CTLSR). In both examples, the regression response is given to be increasing, and the oracle must decide whether to isotonically smooth the response data preparatory to fitting the regression model. A measure of performance called headroom is proposed to assess the oracle's potential for reducing estimation error. Experiments with SLR and CTLSR find for important ranges of problem configurations that the headroom is high, the oracle's empirical performance is near the headroom, and the oracle estimator offers clear benefit.
数据平滑、加权、估算、离群值剔除等竞争性程序可用于参数模型估算的数据准备。然而,对于计划估算和观测数据的最佳准备程序选择,人们往往知之甚少。在这种情况下,可以构建一个基于机器学习的决策规则,即 "oracle",以便从可用准备程序集 C$$ mathcal{C}$$ 中选出最佳程序。甲骨文根据仅从给定数据合成的训练数据,使用具有高后验概率的模型参数,学习与 C$$ mathcal{C}$ 相关的决策区域。与指导数据准备的甲骨文相结合的估计器称为甲骨文估计器。甲骨文估计器的性能在两个估计问题中进行了研究:简单线性回归(SLR)中的斜率估计和连续双线段回归(CTLSR)中的变化点估计。在这两个例子中,给定的回归响应都是递增的,甲骨文必须决定是否在拟合回归模型之前对响应数据进行同调平滑。我们提出了一种称为 "余量"(headroom)的性能测量方法,用于评估神谕在减少估计误差方面的潜力。利用 SLR 和 CTLSR 进行的实验发现,在重要的问题配置范围内,余量很大,甲骨文的经验性能接近余量,而且甲骨文估计器具有明显的优势。
{"title":"A machine learning oracle for parameter estimation","authors":"Lucas Koepke, Mary Gregg, Michael Frey","doi":"10.1002/sam.11651","DOIUrl":"https://doi.org/10.1002/sam.11651","url":null,"abstract":"Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0001\" display=\"inline\" location=\"graphic/sam11651-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> of available preparatory procedures. The oracle learns the decision regions associated with <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0002\" display=\"inline\" location=\"graphic/sam11651-math-0002.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> based on training data synthesized solely from the given data using model parameters with high posterior probability. An estimator in combination with an oracle to guide data preparation is called an oracle estimator. Oracle estimator performance is studied in two estimation problems: slope estimation in simple linear regression (SLR) and changepoint estimation in continuous two-linear-segments regression (CTLSR). In both examples, the regression response is given to be increasing, and the oracle must decide whether to isotonically smooth the response data preparatory to fitting the regression model. A measure of performance called headroom is proposed to assess the oracle's potential for reducing estimation error. Experiments with SLR and CTLSR find for important ranges of problem configurations that the headroom is high, the oracle's empirical performance is near the headroom, and the oracle estimator offers clear benefit.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"64 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO 广义双曲线族和通过多选 LASSO 自动选择模型
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-08 DOI: 10.1002/sam.11652
Luca Bagnato, Alessio Farcomeni, Antonio Punzo
We revisit the generalized hyperbolic (GH) distribution and its nested models. These include widely used parametric choices like the multivariate normal, skew-t�$$ t $$�, Laplace, and several others. We also introduce the multiple-choice LASSO, a novel penalized method for choosing among alternative constraints on the same parameter. A hierarchical multiple-choice Least Absolute Shrinkage and Selection Operator (LASSO) penalized likelihood is optimized to perform simultaneous model selection and inference within the GH family. We illustrate our approach through a simulation study and a real data example. The methodology proposed in this paper has been implemented in R functions which are available as supplementary material.
我们重温了广义双曲线(GH)分布及其嵌套模型。这些模型包括广泛使用的参数选择,如多元正态分布、偏斜-t$$ t $$分布、拉普拉斯分布以及其他一些参数。我们还介绍了多选 LASSO,这是一种在同一参数的备选约束条件中进行选择的新型惩罚性方法。我们优化了分层多选最小绝对收缩和选择操作符(LASSO)惩罚似然法,以便在 GH 系列中同时执行模型选择和推断。我们通过模拟研究和真实数据示例来说明我们的方法。本文提出的方法已在 R 函数中实现,这些函数可作为补充材料提供。
{"title":"The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO","authors":"Luca Bagnato, Alessio Farcomeni, Antonio Punzo","doi":"10.1002/sam.11652","DOIUrl":"https://doi.org/10.1002/sam.11652","url":null,"abstract":"We revisit the generalized hyperbolic (GH) distribution and its nested models. These include widely used parametric choices like the multivariate normal, skew-<math altimg=\"urn:x-wiley:19321864:media:sam11652:sam11652-math-0001\" display=\"inline\" location=\"graphic/sam11652-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi>t</mi>\u0000</mrow>\u0000$$ t $$</annotation>\u0000</semantics></math>, Laplace, and several others. We also introduce the multiple-choice LASSO, a novel penalized method for choosing among alternative constraints on the same parameter. A hierarchical multiple-choice Least Absolute Shrinkage and Selection Operator (LASSO) penalized likelihood is optimized to perform simultaneous model selection and inference within the GH family. We illustrate our approach through a simulation study and a real data example. The methodology proposed in this paper has been implemented in R functions which are available as supplementary material.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138555623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling subpopulations for hierarchically structured data 为分层结构数据建模子种群
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-22 DOI: 10.1002/sam.11650
Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang
The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates an additional layer of complexity. Hence, the data has a hierarchical structure in addition to the existence of underlying subpopulations. Finite mixtures are known for modeling heterogeneity; however, previous parameter estimation procedures assume that the data is generated through a simple random sampling process. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same source, yet an unknown subpopulation. A simulation study and a real data analysis based on famous glass datasets and a keystroke dynamic typing data set show that the proposed approach performs better than other approaches that have been used previously in practice.
法医统计领域提供了一种独特的分层数据结构,其中总体由几个来源的子总体组成,并从每个来源收集样本。这种亚种群结构增加了一层复杂性。因此,除了存在潜在的子种群之外,数据还具有层次结构。有限混合以模拟异质性而闻名;然而,之前的参数估计过程假设数据是通过简单的随机抽样过程生成的。我们建议使用半监督混合建模方法来模拟亚种群结构,该方法利用我们知道样本收集来自同一来源,但未知的亚种群这一事实。基于著名玻璃数据集和按键动态打字数据集的仿真研究和实际数据分析表明,该方法比以往使用的其他方法具有更好的性能。
{"title":"Modeling subpopulations for hierarchically structured data","authors":"Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang","doi":"10.1002/sam.11650","DOIUrl":"https://doi.org/10.1002/sam.11650","url":null,"abstract":"The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates an additional layer of complexity. Hence, the data has a hierarchical structure in addition to the existence of underlying subpopulations. Finite mixtures are known for modeling heterogeneity; however, previous parameter estimation procedures assume that the data is generated through a simple random sampling process. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same source, yet an unknown subpopulation. A simulation study and a real data analysis based on famous glass datasets and a keystroke dynamic typing data set show that the proposed approach performs better than other approaches that have been used previously in practice.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"37 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model 基于位置相关Dirichlet过程混合模型的空间相关时间序列聚类
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-22 DOI: 10.1002/sam.11649
Junsub Jung, Sungil Kim, Heeyoung Kim
The Dirichlet process mixture (DPM) model has been widely used as a Bayesian nonparametric model for clustering. However, the exchangeability assumption of the Dirichlet process is not valid for clustering spatially correlated time series as these data are indexed spatially and temporally. While analyzing spatially correlated time series, correlations between observations at proximal times and locations must be appropriately considered. In this study, we propose a location-dependent DPM model by extending the traditional DPM model for clustering spatially correlated time series. We model the temporal pattern as an infinite mixture of Gaussian processes while considering spatial dependency using a location-dependent Dirichlet process prior over mixture components. This encourages the assignment of observations from proximal locations to the same cluster. By contrast, because mixture atoms for modeling temporal patterns are shared across space, observations with similar temporal patterns can be still grouped together even if they are located far apart. The proposed model also allows the number of clusters to be automatically determined in the clustering procedure. We validate the proposed model using simulated examples. Moreover, in a real case study, we cluster adjacent roads based on their traffic speed patterns that have changed as a result of a traffic accident occurred in Seoul, South Korea.
Dirichlet过程混合(DPM)模型作为一种贝叶斯非参数聚类模型被广泛应用。然而,Dirichlet过程的可交换性假设对于聚类空间相关时间序列是无效的,因为这些数据是空间和时间索引的。在分析空间相关时间序列时,必须适当考虑近时间和近地点观测值之间的相关性。本文通过对传统DPM模型的扩展,提出了一个基于位置的DPM模型,用于空间相关时间序列的聚类。我们将时间模式建模为高斯过程的无限混合,同时使用位置相关的狄利克雷过程优先于混合分量考虑空间依赖性。这鼓励将来自近端位置的观测值分配到同一群集。相比之下,由于用于建模时间模式的混合原子在整个空间中是共享的,因此具有相似时间模式的观测结果仍然可以分组在一起,即使它们位于很远的地方。该模型还允许在聚类过程中自动确定聚类的数量。我们用仿真实例验证了所提出的模型。此外,在一个真实的案例研究中,我们根据韩国首尔发生的交通事故导致的交通速度模式的变化,对相邻的道路进行了聚类。
{"title":"Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model","authors":"Junsub Jung, Sungil Kim, Heeyoung Kim","doi":"10.1002/sam.11649","DOIUrl":"https://doi.org/10.1002/sam.11649","url":null,"abstract":"The Dirichlet process mixture (DPM) model has been widely used as a Bayesian nonparametric model for clustering. However, the exchangeability assumption of the Dirichlet process is not valid for clustering spatially correlated time series as these data are indexed spatially and temporally. While analyzing spatially correlated time series, correlations between observations at proximal times and locations must be appropriately considered. In this study, we propose a location-dependent DPM model by extending the traditional DPM model for clustering spatially correlated time series. We model the temporal pattern as an infinite mixture of Gaussian processes while considering spatial dependency using a location-dependent Dirichlet process prior over mixture components. This encourages the assignment of observations from proximal locations to the same cluster. By contrast, because mixture atoms for modeling temporal patterns are shared across space, observations with similar temporal patterns can be still grouped together even if they are located far apart. The proposed model also allows the number of clusters to be automatically determined in the clustering procedure. We validate the proposed model using simulated examples. Moreover, in a real case study, we cluster adjacent roads based on their traffic speed patterns that have changed as a result of a traffic accident occurred in Seoul, South Korea.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"30 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Input-response space-filling designs incorporating response uncertainty 包含响应不确定性的输入-响应空间填充设计
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-20 DOI: 10.1002/sam.11648
Xiankui Yang, Lu Lu, Christine M. Anderson-Cook
Traditionally space-filling designs have focused on the characteristics of the design in the input space ensuring uniform spread throughout the region. Input-response space-filling designs considered scenarios when having good spread throughout the range or region of the responses is also of interest. This paper acknowledges that there is typically uncertainty associated with the values of the response(s) and hence proposes a method, Input-Response Space-Filling Designs with Uncertainty (IRSFwU), to incorporate this into the design construction. The Pareto front of designs offers alternatives that balance input and response space filling, while prioritizing input combinations with lower associated response uncertainty. These lower uncertainty choices improve the chances of observing the desired response values. We describe the new approach with an uncertainty-adjusted distance to measure the response space filling, the Pareto aggregate point exchange algorithm to populate the set of promising designs, and illustrate the method with three examples of different input and response relationships and dimensions.
传统的空间填充设计侧重于输入空间的设计特征,确保整个区域的均匀分布。输入-响应空间填充设计考虑了在整个响应范围或区域内具有良好分布的情况,这也是令人感兴趣的。本文承认,通常存在与响应值相关的不确定性,因此提出了一种方法,不确定性输入-响应填充空间设计(IRSFwU),将其纳入设计构造中。Pareto前沿设计提供了平衡输入和响应空间填充的替代方案,同时优先考虑具有较低相关响应不确定性的输入组合。这些不确定性较低的选择提高了观察到所需响应值的机会。我们描述了用不确定性调整距离测量响应空间填充的新方法,用Pareto聚集点交换算法填充有希望的设计集,并通过三个不同输入和响应关系和维度的例子说明了该方法。
{"title":"Input-response space-filling designs incorporating response uncertainty","authors":"Xiankui Yang, Lu Lu, Christine M. Anderson-Cook","doi":"10.1002/sam.11648","DOIUrl":"https://doi.org/10.1002/sam.11648","url":null,"abstract":"Traditionally space-filling designs have focused on the characteristics of the design in the input space ensuring uniform spread throughout the region. Input-response space-filling designs considered scenarios when having good spread throughout the range or region of the responses is also of interest. This paper acknowledges that there is typically uncertainty associated with the values of the response(s) and hence proposes a method, Input-Response Space-Filling Designs with Uncertainty (IRSFwU), to incorporate this into the design construction. The Pareto front of designs offers alternatives that balance input and response space filling, while prioritizing input combinations with lower associated response uncertainty. These lower uncertainty choices improve the chances of observing the desired response values. We describe the new approach with an uncertainty-adjusted distance to measure the response space filling, the Pareto aggregate point exchange algorithm to populate the set of promising designs, and illustrate the method with three examples of different input and response relationships and dimensions.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Driving mode analysis—How uncertain functional inputs propagate to an output 驱动模式分析-不确定的功能输入如何传播到输出
4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-10-06 DOI: 10.1002/sam.11646
Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker
Abstract Driving mode analysis elucidates how correlated features of uncertain functional inputs jointly propagate to produce uncertainty in the output of a computation. Uncertain input functions are decomposed into three terms: the mean functions, a zero‐mean driving mode, and zero‐mean residual. The random driving mode varies along a single direction, having fixed functional shape and random scale. It is uncorrelated with the residual, and under linear error propagation, it produces an output variance equal to that of the full input uncertainty. Finally, the driving mode best represents how input uncertainties propagate to the output because it minimizes expected squared Mahalanobis distance amongst competitors. These characteristics recommend interpretation of the driving mode as the single‐degree‐of‐freedom component of input uncertainty that drives output uncertainty. We derive the functional driving mode, show its superiority to other seemingly sensible definitions, and demonstrate the utility of driving mode analysis in an application. The application is the simulation of neutron transport in criticality experiments. The uncertain input functions are nuclear data that describe how Pu reacts to bombardment by neutrons. Visualization of the driving mode helps scientists understand what aspects of correlated functional uncertainty have effects that either reinforce or cancel one another in propagating to the output of the simulation.
驱动模式分析阐明了不确定函数输入的相关特征如何共同传播,从而在计算输出中产生不确定性。不确定输入函数被分解为三个部分:均值函数、零均值驱动模式和零均值残差。随机驱动方式沿单一方向变化,具有固定的功能形状和随机尺度。它与残差不相关,并且在线性误差传播下,它产生的输出方差等于全部输入不确定性的输出方差。最后,驱动模式最好地代表了输入不确定性如何传播到输出,因为它最小化了竞争对手之间的马氏距离的期望平方。这些特征建议将驱动模式解释为驱动输出不确定性的输入不确定性的单自由度组件。我们推导了功能驱动模式,展示了它相对于其他看似合理的定义的优越性,并演示了驱动模式分析在应用中的实用性。应用于模拟中子输运的临界实验。不确定输入函数是描述钚对中子轰击反应的核数据。驱动模式的可视化有助于科学家理解相关功能不确定性的哪些方面在传播到模拟输出时相互加强或相互抵消。
{"title":"Driving mode analysis—How uncertain functional inputs propagate to an output","authors":"Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker","doi":"10.1002/sam.11646","DOIUrl":"https://doi.org/10.1002/sam.11646","url":null,"abstract":"Abstract Driving mode analysis elucidates how correlated features of uncertain functional inputs jointly propagate to produce uncertainty in the output of a computation. Uncertain input functions are decomposed into three terms: the mean functions, a zero‐mean driving mode, and zero‐mean residual. The random driving mode varies along a single direction, having fixed functional shape and random scale. It is uncorrelated with the residual, and under linear error propagation, it produces an output variance equal to that of the full input uncertainty. Finally, the driving mode best represents how input uncertainties propagate to the output because it minimizes expected squared Mahalanobis distance amongst competitors. These characteristics recommend interpretation of the driving mode as the single‐degree‐of‐freedom component of input uncertainty that drives output uncertainty. We derive the functional driving mode, show its superiority to other seemingly sensible definitions, and demonstrate the utility of driving mode analysis in an application. The application is the simulation of neutron transport in criticality experiments. The uncertain input functions are nuclear data that describe how Pu reacts to bombardment by neutrons. Visualization of the driving mode helps scientists understand what aspects of correlated functional uncertainty have effects that either reinforce or cancel one another in propagating to the output of the simulation.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134944202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Residuals and diagnostics for multinomial regression models 多项回归模型的残差和诊断
4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-29 DOI: 10.1002/sam.11645
Eric A. E. Gerber, Bruce A. Craig
Abstract In this paper, we extend the concept of a randomized quantile residual to multinomial regression models. Customary diagnostics for these models are limited because they involve difficult‐to‐interpret residuals and often focus on the fit of one category versus the rest. Our residuals account for associations between categories by using the squared Mahalanobis distances of the observed log‐odds relative to their fitted sampling distributions. Aside from sampling variation, these residuals are exactly normal when the data come from the fitted model. This motivates our use of the residuals to detect model misspecification and overdispersion, in addition to an overall goodness‐of‐fit Kolmogorov–Smirnov test. We illustrate the use of the residuals and diagnostics in both simulation and real data studies.
摘要本文将随机分位数残差的概念推广到多项回归模型中。这些模型的常规诊断是有限的,因为它们涉及难以解释的残差,并且通常侧重于一个类别与其他类别的拟合。我们的残差通过使用观测到的对数概率相对于它们的拟合抽样分布的马氏距离的平方来解释类别之间的关联。除了抽样变化外,当数据来自拟合模型时,这些残差完全是正态的。这促使我们使用残差来检测模型的错误规范和过度分散,以及总体拟合优度Kolmogorov-Smirnov检验。我们说明了残差和诊断在模拟和实际数据研究中的应用。
{"title":"Residuals and diagnostics for multinomial regression models","authors":"Eric A. E. Gerber, Bruce A. Craig","doi":"10.1002/sam.11645","DOIUrl":"https://doi.org/10.1002/sam.11645","url":null,"abstract":"Abstract In this paper, we extend the concept of a randomized quantile residual to multinomial regression models. Customary diagnostics for these models are limited because they involve difficult‐to‐interpret residuals and often focus on the fit of one category versus the rest. Our residuals account for associations between categories by using the squared Mahalanobis distances of the observed log‐odds relative to their fitted sampling distributions. Aside from sampling variation, these residuals are exactly normal when the data come from the fitted model. This motivates our use of the residuals to detect model misspecification and overdispersion, in addition to an overall goodness‐of‐fit Kolmogorov–Smirnov test. We illustrate the use of the residuals and diagnostics in both simulation and real data studies.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135199164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Stratified learning: A general‐purpose statistical method for improved learning under covariate shift 分层学习:一种在协变量移位下改善学习的通用统计方法
4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-29 DOI: 10.1002/sam.11643
Maximilian Autenrieth, David A. Van Dyk, Roberto Trotta, David C. Stenning
Abstract We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well‐established methodology in causal inference and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much‐improved target prediction. We refer to the overall method as Stratified Learning, or StratLearn . We demonstrate the effectiveness of this general‐purpose method on two contemporary research questions in cosmology, outperforming state‐of‐the‐art importance weighting methods. We obtain the best‐reported AUC (0.958) on the updated “Supernovae photometric classification challenge,” and we improve upon existing conditional density estimation of galaxy redshift from Sloan Digital Sky Survey (SDSS) data.
当训练集不具有代表性时,我们提出了一种简单的、有统计学原则的、理论上合理的方法来改进监督学习,这种情况被称为协变量移位。我们在因果推理中建立了一个完善的方法,并表明协变量移位的影响可以通过对倾向得分的调节来减少或消除。在实践中,这是通过在基于估计的倾向分数划分数据构建的层内拟合学习器来实现的,从而导致近似平衡的协变量和大大改进的目标预测。我们将整个方法称为分层学习(Stratified Learning)或StratLearn。我们证明了这种通用方法在两个当代宇宙学研究问题上的有效性,优于最先进的重要性加权方法。我们在更新的“超新星光度分类挑战”中获得了最佳报告AUC(0.958),并且我们改进了现有的斯隆数字巡天(SDSS)数据中星系红移的条件密度估计。
{"title":"Stratified learning: A general‐purpose statistical method for improved learning under covariate shift","authors":"Maximilian Autenrieth, David A. Van Dyk, Roberto Trotta, David C. Stenning","doi":"10.1002/sam.11643","DOIUrl":"https://doi.org/10.1002/sam.11643","url":null,"abstract":"Abstract We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well‐established methodology in causal inference and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much‐improved target prediction. We refer to the overall method as Stratified Learning, or StratLearn . We demonstrate the effectiveness of this general‐purpose method on two contemporary research questions in cosmology, outperforming state‐of‐the‐art importance weighting methods. We obtain the best‐reported AUC (0.958) on the updated “Supernovae photometric classification challenge,” and we improve upon existing conditional density estimation of galaxy redshift from Sloan Digital Sky Survey (SDSS) data.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135131270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
On difference‐based gradient estimation in nonparametric regression 非参数回归中基于差分的梯度估计
4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-16 DOI: 10.1002/sam.11644
Maoyu Zhang, Wenlin Dai
Abstract We propose a framework to directly estimate the gradient in multivariate nonparametric regression models that bypasses fitting the regression function. Specifically, we construct the estimator as a linear combination of adjacent observations with the coefficients from a vector‐valued difference sequence, so it is more flexible than existing methods. Under the equidistant designs, closed‐form solutions of the optimal sequences are derived by minimizing the estimation variance, with the estimation bias well controlled. We derive the theoretical properties of the estimators and show that they achieve the optimal convergence rate. Further, we propose a data‐driven tuning parameter‐selection criterion for practical implementation. The effectiveness of our estimators is validated via simulation studies and a real data application.
摘要提出了一种直接估计多元非参数回归模型梯度的框架,绕过回归函数的拟合。具体来说,我们将估计量构造为相邻观测值与来自矢量值差分序列的系数的线性组合,因此它比现有方法更灵活。在等距设计下,通过最小化估计方差得到了最优序列的封闭解,估计偏差得到了很好的控制。我们推导了这些估计量的理论性质,并证明了它们达到了最优收敛速率。此外,我们提出了一个数据驱动的调优参数选择标准,用于实际实现。通过仿真研究和实际数据应用验证了估计器的有效性。
{"title":"On difference‐based gradient estimation in nonparametric regression","authors":"Maoyu Zhang, Wenlin Dai","doi":"10.1002/sam.11644","DOIUrl":"https://doi.org/10.1002/sam.11644","url":null,"abstract":"Abstract We propose a framework to directly estimate the gradient in multivariate nonparametric regression models that bypasses fitting the regression function. Specifically, we construct the estimator as a linear combination of adjacent observations with the coefficients from a vector‐valued difference sequence, so it is more flexible than existing methods. Under the equidistant designs, closed‐form solutions of the optimal sequences are derived by minimizing the estimation variance, with the estimation bias well controlled. We derive the theoretical properties of the estimators and show that they achieve the optimal convergence rate. Further, we propose a data‐driven tuning parameter‐selection criterion for practical implementation. The effectiveness of our estimators is validated via simulation studies and a real data application.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135308618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1