Statistical Analysis and Data Mining最新文献

英文中文

Driving mode analysis—How uncertain functional inputs propagate to an output 驱动模式分析-不确定的功能输入如何传播到输出

4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2023-10-06 DOI: 10.1002/sam.11646

Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker

Abstract Driving mode analysis elucidates how correlated features of uncertain functional inputs jointly propagate to produce uncertainty in the output of a computation. Uncertain input functions are decomposed into three terms: the mean functions, a zero‐mean driving mode, and zero‐mean residual. The random driving mode varies along a single direction, having fixed functional shape and random scale. It is uncorrelated with the residual, and under linear error propagation, it produces an output variance equal to that of the full input uncertainty. Finally, the driving mode best represents how input uncertainties propagate to the output because it minimizes expected squared Mahalanobis distance amongst competitors. These characteristics recommend interpretation of the driving mode as the single‐degree‐of‐freedom component of input uncertainty that drives output uncertainty. We derive the functional driving mode, show its superiority to other seemingly sensible definitions, and demonstrate the utility of driving mode analysis in an application. The application is the simulation of neutron transport in criticality experiments. The uncertain input functions are nuclear data that describe how Pu reacts to bombardment by neutrons. Visualization of the driving mode helps scientists understand what aspects of correlated functional uncertainty have effects that either reinforce or cancel one another in propagating to the output of the simulation.

驱动模式分析阐明了不确定函数输入的相关特征如何共同传播，从而在计算输出中产生不确定性。不确定输入函数被分解为三个部分:均值函数、零均值驱动模式和零均值残差。随机驱动方式沿单一方向变化，具有固定的功能形状和随机尺度。它与残差不相关，并且在线性误差传播下，它产生的输出方差等于全部输入不确定性的输出方差。最后，驱动模式最好地代表了输入不确定性如何传播到输出，因为它最小化了竞争对手之间的马氏距离的期望平方。这些特征建议将驱动模式解释为驱动输出不确定性的输入不确定性的单自由度组件。我们推导了功能驱动模式，展示了它相对于其他看似合理的定义的优越性，并演示了驱动模式分析在应用中的实用性。应用于模拟中子输运的临界实验。不确定输入函数是描述钚对中子轰击反应的核数据。驱动模式的可视化有助于科学家理解相关功能不确定性的哪些方面在传播到模拟输出时相互加强或相互抵消。

{"title":"Driving mode analysis—How uncertain functional inputs propagate to an output","authors":"Scott A. Vander Wiel, Michael J. Grosskopf, Isaac J. Michaud, Denise Neudecker","doi":"10.1002/sam.11646","DOIUrl":"https://doi.org/10.1002/sam.11646","url":null,"abstract":"Abstract Driving mode analysis elucidates how correlated features of uncertain functional inputs jointly propagate to produce uncertainty in the output of a computation. Uncertain input functions are decomposed into three terms: the mean functions, a zero‐mean driving mode, and zero‐mean residual. The random driving mode varies along a single direction, having fixed functional shape and random scale. It is uncorrelated with the residual, and under linear error propagation, it produces an output variance equal to that of the full input uncertainty. Finally, the driving mode best represents how input uncertainties propagate to the output because it minimizes expected squared Mahalanobis distance amongst competitors. These characteristics recommend interpretation of the driving mode as the single‐degree‐of‐freedom component of input uncertainty that drives output uncertainty. We derive the functional driving mode, show its superiority to other seemingly sensible definitions, and demonstrate the utility of driving mode analysis in an application. The application is the simulation of neutron transport in criticality experiments. The uncertain input functions are nuclear data that describe how Pu reacts to bombardment by neutrons. Visualization of the driving mode helps scientists understand what aspects of correlated functional uncertainty have effects that either reinforce or cancel one another in propagating to the output of the simulation.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134944202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Residuals and diagnostics for multinomial regression models 多项回归模型的残差和诊断

4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2023-09-29 DOI: 10.1002/sam.11645

Eric A. E. Gerber, Bruce A. Craig

Abstract In this paper, we extend the concept of a randomized quantile residual to multinomial regression models. Customary diagnostics for these models are limited because they involve difficult‐to‐interpret residuals and often focus on the fit of one category versus the rest. Our residuals account for associations between categories by using the squared Mahalanobis distances of the observed log‐odds relative to their fitted sampling distributions. Aside from sampling variation, these residuals are exactly normal when the data come from the fitted model. This motivates our use of the residuals to detect model misspecification and overdispersion, in addition to an overall goodness‐of‐fit Kolmogorov–Smirnov test. We illustrate the use of the residuals and diagnostics in both simulation and real data studies.

摘要本文将随机分位数残差的概念推广到多项回归模型中。这些模型的常规诊断是有限的，因为它们涉及难以解释的残差，并且通常侧重于一个类别与其他类别的拟合。我们的残差通过使用观测到的对数概率相对于它们的拟合抽样分布的马氏距离的平方来解释类别之间的关联。除了抽样变化外，当数据来自拟合模型时，这些残差完全是正态的。这促使我们使用残差来检测模型的错误规范和过度分散，以及总体拟合优度Kolmogorov-Smirnov检验。我们说明了残差和诊断在模拟和实际数据研究中的应用。

引用次数: 0

Stratified learning: A general‐purpose statistical method for improved learning under covariate shift 分层学习:一种在协变量移位下改善学习的通用统计方法

4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2023-09-29 DOI: 10.1002/sam.11643

Maximilian Autenrieth, David A. Van Dyk, Roberto Trotta, David C. Stenning

Abstract We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well‐established methodology in causal inference and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much‐improved target prediction. We refer to the overall method as Stratified Learning, or StratLearn . We demonstrate the effectiveness of this general‐purpose method on two contemporary research questions in cosmology, outperforming state‐of‐the‐art importance weighting methods. We obtain the best‐reported AUC (0.958) on the updated “Supernovae photometric classification challenge,” and we improve upon existing conditional density estimation of galaxy redshift from Sloan Digital Sky Survey (SDSS) data.

当训练集不具有代表性时，我们提出了一种简单的、有统计学原则的、理论上合理的方法来改进监督学习，这种情况被称为协变量移位。我们在因果推理中建立了一个完善的方法，并表明协变量移位的影响可以通过对倾向得分的调节来减少或消除。在实践中，这是通过在基于估计的倾向分数划分数据构建的层内拟合学习器来实现的，从而导致近似平衡的协变量和大大改进的目标预测。我们将整个方法称为分层学习(Stratified Learning)或StratLearn。我们证明了这种通用方法在两个当代宇宙学研究问题上的有效性，优于最先进的重要性加权方法。我们在更新的“超新星光度分类挑战”中获得了最佳报告AUC(0.958)，并且我们改进了现有的斯隆数字巡天(SDSS)数据中星系红移的条件密度估计。

引用次数: 1

On difference‐based gradient estimation in nonparametric regression 非参数回归中基于差分的梯度估计

4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2023-09-16 DOI: 10.1002/sam.11644

Maoyu Zhang, Wenlin Dai

Abstract We propose a framework to directly estimate the gradient in multivariate nonparametric regression models that bypasses fitting the regression function. Specifically, we construct the estimator as a linear combination of adjacent observations with the coefficients from a vector‐valued difference sequence, so it is more flexible than existing methods. Under the equidistant designs, closed‐form solutions of the optimal sequences are derived by minimizing the estimation variance, with the estimation bias well controlled. We derive the theoretical properties of the estimators and show that they achieve the optimal convergence rate. Further, we propose a data‐driven tuning parameter‐selection criterion for practical implementation. The effectiveness of our estimators is validated via simulation studies and a real data application.

摘要提出了一种直接估计多元非参数回归模型梯度的框架，绕过回归函数的拟合。具体来说，我们将估计量构造为相邻观测值与来自矢量值差分序列的系数的线性组合，因此它比现有方法更灵活。在等距设计下，通过最小化估计方差得到了最优序列的封闭解，估计偏差得到了很好的控制。我们推导了这些估计量的理论性质，并证明了它们达到了最优收敛速率。此外，我们提出了一个数据驱动的调优参数选择标准，用于实际实现。通过仿真研究和实际数据应用验证了估计器的有效性。

引用次数: 0

Integrative Learning of Structured High-Dimensional Data from Multiple Datasets. 从多个数据集整合学习结构化高维数据

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2023-04-01 Epub Date: 2022-11-08 DOI: 10.1002/sam.11601

Changgee Chang, Zongyu Dai, Jihwan Oh, Qi Long

Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.

在分析基因组学数据等大型生物医学数据时，经常会遇到小 n 大 p 的难题。通过联合选择所有数据集的特征，可以增强对微弱但重要信号的检测。然而，重要特征集在所有数据集中可能并不总是相同的。虽然现有的一些整合学习方法允许异构稀疏结构，即数据集子集的某些选定特征的系数可以为零，但它们往往会降低效率，再次出现丢失微弱重要信号的问题。我们提出了一种新的整合学习方法，它不仅能在同质稀疏性结构中很好地聚合重要信号，还能大大缓解在异质稀疏性结构中丢失弱重要信号的问题。我们的方法利用先验已知的特征图结构，鼓励联合选择图中有关联的特征。将这些先验信息整合到多个数据集中，既能增强功能，又能考虑到数据集之间的异质性。我们研究了拟议方法的理论特性。我们还通过模拟研究和对 ADNI 基因表达数据的分析，证明了现有方法的局限性和我们方法的优越性。

{"title":"Integrative Learning of Structured High-Dimensional Data from Multiple Datasets.","authors":"Changgee Chang, Zongyu Dai, Jihwan Oh, Qi Long","doi":"10.1002/sam.11601","DOIUrl":"10.1002/sam.11601","url":null,"abstract":"Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 2","pages":"120-134"},"PeriodicalIF":1.3,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10195070/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9511811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Tutorial on Generative Adversarial Networks with Application to Classification of Imbalanced Data. 生成对抗性网络教程及其在不平衡数据分类中的应用。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-10-01 Epub Date: 2021-12-31 DOI: 10.1002/sam.11570

Yuxiao Huang, Kara G Fields, Yan Ma

A challenge unique to classification model development is imbalanced data. In a binary classification problem, class imbalance occurs when one class, the minority group, contains significantly fewer samples than the other class, the majority group. In imbalanced data, the minority class is often the class of interest (e.g., patients with disease). However, when training a classifier on imbalanced data, the model will exhibit bias towards the majority class and, in extreme cases, may ignore the minority class completely. A common strategy for addressing class imbalance is data augmentation. However, traditional data augmentation methods are associated with overfitting, where the model is fit to the noise in the data. In this tutorial we introduce an advanced method for data augmentation: Generative Adversarial Networks (GANs). The advantages of GANs over traditional data augmentation methods are illustrated using the Breast Cancer Wisconsin study. To promote the adoption of GANs for data augmentation, we present an end-to-end pipeline that encompasses the complete life cycle of a machine learning project along with alternatives and good practices both in the paper and in a separate video. Our code, data, full results and video tutorial are publicly available in the paper's github repository.

分类模型开发的一个独特挑战是数据不平衡。在二元分类问题中，当一个类（少数组）包含的样本明显少于另一类（多数组）时，就会出现类不平衡。在不平衡的数据中，少数群体往往是感兴趣的群体（例如，疾病患者）。然而，当在不平衡数据上训练分类器时，模型会表现出对多数类的偏见，在极端情况下，可能会完全忽略少数类。解决类不平衡的一种常见策略是数据扩充。然而，传统的数据扩充方法与过拟合有关，其中模型适合数据中的噪声。在本教程中，我们介绍了一种高级的数据扩充方法：生成对抗性网络（GANs）。使用威斯康星州癌症乳腺癌研究说明了GANs相对于传统数据增强方法的优势。为了促进采用GANs进行数据扩充，我们在论文和单独的视频中介绍了一个端到端的管道，该管道包括机器学习项目的整个生命周期，以及替代方案和良好实践。我们的代码、数据、完整结果和视频教程可在论文的github存储库中公开获取。

{"title":"A Tutorial on Generative Adversarial Networks with Application to Classification of Imbalanced Data.","authors":"Yuxiao Huang, Kara G Fields, Yan Ma","doi":"10.1002/sam.11570","DOIUrl":"10.1002/sam.11570","url":null,"abstract":"A challenge unique to classification model development is imbalanced data. In a binary classification problem, class imbalance occurs when one class, the minority group, contains significantly fewer samples than the other class, the majority group. In imbalanced data, the minority class is often the class of interest (e.g., patients with disease). However, when training a classifier on imbalanced data, the model will exhibit bias towards the majority class and, in extreme cases, may ignore the minority class completely. A common strategy for addressing class imbalance is data augmentation. However, traditional data augmentation methods are associated with overfitting, where the model is fit to the noise in the data. In this tutorial we introduce an advanced method for data augmentation: Generative Adversarial Networks (GANs). The advantages of GANs over traditional data augmentation methods are illustrated using the Breast Cancer Wisconsin study. To promote the adoption of GANs for data augmentation, we present an end-to-end pipeline that encompasses the complete life cycle of a machine learning project along with alternatives and good practices both in the paper and in a separate video. Our code, data, full results and video tutorial are publicly available in the paper's github repository.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 5","pages":"543-552"},"PeriodicalIF":1.3,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9529000/pdf/nihms-1766432.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33488851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Regression-Based Bayesian Estimation and Structure Learning for Nonparanormal Graphical Models. 基于回归的非超自然图形模型贝叶斯估计与结构学习。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-10-01 Epub Date: 2022-02-28 DOI: 10.1002/sam.11576

Jami J Mulgrave, Subhashis Ghosal

A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables in which it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. We consider a Bayesian approach to inference in a nonparanormal graphical model in which we put priors on the unknown transformations through a random series based on B-splines. We use a regression formulation to construct the likelihood through the Cholesky decomposition on the underlying precision matrix of the transformed variables and put shrinkage priors on the regression coefficients. We apply a plug-in variational Bayesian algorithm for learning the sparse precision matrix and compare the performance to a posterior Gibbs sampling scheme in a simulation study. We finally apply the proposed methods to a microarray data set. The proposed methods have better performance as the dimension increases, and in particular, the variational Bayesian approach has the potential to speed up the estimation in the Bayesian nonparanormal graphical model without the Gaussianity assumption while retaining the information to construct the graph.

非超常图模型是连续变量高斯图模型的半参数推广，其中假设变量仅在经过一些未知的光滑单调变换后才服从高斯图模型。我们考虑在一个非超自然图形模型中使用贝叶斯方法进行推理，在该模型中，我们通过基于b样条的随机序列对未知变换设置先验。我们使用回归公式通过对转换变量的潜在精度矩阵的Cholesky分解来构建似然，并将收缩先验放在回归系数上。我们应用插件变分贝叶斯算法来学习稀疏精度矩阵，并在仿真研究中将其性能与后验吉布斯抽样方案进行比较。最后，我们将提出的方法应用于微阵列数据集。随着维数的增加，所提出的方法具有更好的性能，特别是变分贝叶斯方法在保留构建图的信息的同时，在不考虑高斯性假设的贝叶斯非超常图模型中具有加速估计的潜力。

{"title":"Regression-Based Bayesian Estimation and Structure Learning for Nonparanormal Graphical Models.","authors":"Jami J Mulgrave, Subhashis Ghosal","doi":"10.1002/sam.11576","DOIUrl":"https://doi.org/10.1002/sam.11576","url":null,"abstract":"A nonparanormal graphical model is a semiparametric generalization of a Gaussian graphical model for continuous variables in which it is assumed that the variables follow a Gaussian graphical model only after some unknown smooth monotone transformations. We consider a Bayesian approach to inference in a nonparanormal graphical model in which we put priors on the unknown transformations through a random series based on B-splines. We use a regression formulation to construct the likelihood through the Cholesky decomposition on the underlying precision matrix of the transformed variables and put shrinkage priors on the regression coefficients. We apply a plug-in variational Bayesian algorithm for learning the sparse precision matrix and compare the performance to a posterior Gibbs sampling scheme in a simulation study. We finally apply the proposed methods to a microarray data set. The proposed methods have better performance as the dimension increases, and in particular, the variational Bayesian approach has the potential to speed up the estimation in the Bayesian nonparanormal graphical model without the Gaussianity assumption while retaining the information to construct the graph.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 5","pages":"611-629"},"PeriodicalIF":1.3,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9455150/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33460008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data 驾驶风险的评估与解释:基于远程信息处理数据的汽车索赔频率建模

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-09-28 DOI: 10.1002/sam.11599

Yaqian Gao, Yifan Huang, Shengwang Meng

With the development of vehicle telematics and data mining technology, usage-based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low-risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.

随着车载信息处理技术和数据挖掘技术的发展，基于使用的保险(UBI)引起了学术界和产业界的广泛关注。广泛的驾驶行为特征使进一步了解投保车辆的风险成为可能，但在识别和解释重要的费率制定因素方面提出了挑战。本研究基于中国大陆地区投保人的远程信息处理数据，采用泊松回归和回归树、随机森林、梯度增强树、XGBoost和神经网络等机器学习模型，对商业卡车的保险理赔频率进行了分析。选择最佳模型后，从精算的角度分析特征重要性、特征效应以及各特征对预测的贡献。我们的实证研究表明，XGBoost大大优于传统模型，并能检测到一些重要的风险因素，如平均速度、平均日行驶里程、夜间驾驶比例、突然刹车次数和十字路口左右转弯比例。这些特征通常对驾驶风险具有非线性影响，并且特征之间存在复杂的相互作用。为了进一步区分高/低风险驾驶员，我们根据驾驶员的驾驶习惯运行监督聚类进行风险分割。综上所述，本研究不仅提供了更准确的驾驶风险预测，而且极大地满足了保险监管机构和风险管理机构的可解释性要求。

{"title":"Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data","authors":"Yaqian Gao, Yifan Huang, Shengwang Meng","doi":"10.1002/sam.11599","DOIUrl":"https://doi.org/10.1002/sam.11599","url":null,"abstract":"With the development of vehicle telematics and data mining technology, usage-based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low-risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"42 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A General Iterative Clustering Algorithm. 一种通用迭代聚类算法。

IF 1.3 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-08-01 Epub Date: 2022-01-31 DOI: 10.1002/sam.11573

Ziqiang Lin, Eugene Laska, Carole Siegel

The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.

未标记单元的聚类分析的质量取决于单元之间不相似性度量的质量。与欧几里得距离等与数据无关的几何度量相比，数据相关的不相似度更为客观。正如Breiman所建议的那样，许多数据驱动的方法都是基于决策树集成的，比如随机森林(RF)，它产生一个接近矩阵，可以很容易地转化为不相似矩阵。射频可以用标签来区分真实数据和合成数据。将得到的不相似矩阵输入到聚类程序中，并为单元分配与聚类隶属度相对应的标签。提出了一种通用迭代聚类(GIC)算法，改进了基射频的接近矩阵和聚类。聚类标签用于生成一个新的RF，产生一个更新的接近矩阵，该矩阵被输入到聚类程序中。这个过程不断重复，直到收敛。相同的过程可以用于许多基本过程，例如极端随机树集成。我们使用基准和模拟数据集来评估GIC算法的性能。由Silhouette Score测量的属性实质上优于基本聚类算法。GIC软件包已在R: https://cran.r-project.org/web/packages/GIC/index.html发布。

{"title":"A General Iterative Clustering Algorithm.","authors":"Ziqiang Lin, Eugene Laska, Carole Siegel","doi":"10.1002/sam.11573","DOIUrl":"https://doi.org/10.1002/sam.11573","url":null,"abstract":"The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. A RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a General Iterative Cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the Extremely Randomized Tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette Score are substantially superior to the base clustering algorithm. The GIC package has been released in R: https://cran.r-project.org/web/packages/GIC/index.html.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"15 4","pages":"433-446"},"PeriodicalIF":1.3,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/c8/08/SAM-15-433.PMC9438941.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40349011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-scale affinities with missing data: Estimation and applications. 有缺失数据的多尺度亲和力：估计与应用

IF 2.1 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Statistical Analysis and Data Mining

Pub Date : 2022-06-01 Epub Date: 2021-11-05 DOI: 10.1002/sam.11561

Min Zhang, Gal Mishne, Eric C Chi

Many machine learning algorithms depend on weights that quantify row and column similarities of a data matrix. The choice of weights can dramatically impact the effectiveness of the algorithm. Nonetheless, the problem of choosing weights has arguably not been given enough study. When a data matrix is completely observed, Gaussian kernel affinities can be used to quantify the local similarity between pairs of rows and pairs of columns. Computing weights in the presence of missing data, however, becomes challenging. In this paper, we propose a new method to construct row and column affinities even when data are missing by building off a co-clustering technique. This method takes advantage of solving the optimization problem for multiple pairs of cost parameters and filling in the missing values with increasingly smooth estimates. It exploits the coupled similarity structure among both the rows and columns of a data matrix. We show these affinities can be used to perform tasks such as data imputation, clustering, and matrix completion on graphs.

许多机器学习算法都依赖于量化数据矩阵行列相似性的权重。权重的选择会极大地影响算法的有效性。然而，人们对权重选择问题的研究可以说还远远不够。当数据矩阵完全被观测到时，高斯核亲和力可用于量化行对和列对之间的局部相似性。然而，在数据缺失的情况下计算权重就变得非常具有挑战性。在本文中，我们提出了一种新方法，即使在数据缺失的情况下，也能通过协同聚类技术构建行和列亲和力。这种方法利用了解决多对成本参数优化问题的优势，并通过越来越平滑的估计值来填补缺失值。它利用了数据矩阵的行和列之间的耦合相似性结构。我们展示了这些亲和性可用于执行数据估算、聚类和图上矩阵补全等任务。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Analysis and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀