首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Sketched Stochastic Dictionary Learning for large‐scale data and application to high‐throughput mass spectrometry 大规模数据的随机字典学习和高通量质谱的应用
Pub Date : 2021-08-20 DOI: 10.1002/sam.11542
O. Permiakova, T. Burger
Factorization of large data corpora has emerged as an essential technique to extract dictionaries (sets of patterns that are meaningful for sparse encoding). Following this line, we present a novel algorithm based on compressive learning theory. In this framework, the (arbitrarily large) dataset of interest is replaced by a fixed‐size sketch resulting from a random sampling of the data distribution characteristic function. We apply our algorithm to the extraction of chromatographic elution profiles in mass spectrometry data, where it demonstrates its efficiency and interest compared to other related algorithms.
大型数据语料库的分解已经成为提取字典(对稀疏编码有意义的模式集)的基本技术。在此基础上,我们提出了一种基于压缩学习理论的新算法。在这个框架中,感兴趣的(任意大的)数据集被由数据分布特征函数的随机抽样产生的固定大小的草图所取代。我们将我们的算法应用于质谱数据中色谱洗脱剖面的提取,与其他相关算法相比,它显示了它的效率和兴趣。
{"title":"Sketched Stochastic Dictionary Learning for large‐scale data and application to high‐throughput mass spectrometry","authors":"O. Permiakova, T. Burger","doi":"10.1002/sam.11542","DOIUrl":"https://doi.org/10.1002/sam.11542","url":null,"abstract":"Factorization of large data corpora has emerged as an essential technique to extract dictionaries (sets of patterns that are meaningful for sparse encoding). Following this line, we present a novel algorithm based on compressive learning theory. In this framework, the (arbitrarily large) dataset of interest is replaced by a fixed‐size sketch resulting from a random sampling of the data distribution characteristic function. We apply our algorithm to the extraction of chromatographic elution profiles in mass spectrometry data, where it demonstrates its efficiency and interest compared to other related algorithms.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"101 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114095267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Weighted validation of heteroscedastic regression models for better selection 异方差回归模型的加权验证,以获得更好的选择
Pub Date : 2021-08-17 DOI: 10.1002/sam.11544
Yoonsuh Jung, Hayoung Kim
In this paper, we suggest a method for improving model selection in the presence of heteroscedasticity. For this purpose, we measure the heteroscedasticity in the data using the inter‐quartile range (IQR) of the fitted values under the framework of cross‐validation. To find the IQR, we fit 0.25 and 0.75 generic quantile regression using the training data. The two models then predict the values of the response variable at 0.25 and 0.75 quantiles in the test data, which yields predicted IQR. To reduce the effect of heteroscedastic data in the model selection, we propose to use weighted prediction error. The inverse of the predicted IQR is utilized to estimate the weights. The proposed method reduces the impact of large prediction errors via weighted prediction and leads to better model and parameter selection. The benefits of the proposed method are demonstrated in simulations and with two real data sets.
在本文中,我们提出了一种在异方差存在下改进模型选择的方法。为此,我们在交叉验证的框架下,使用拟合值的四分位间距(IQR)来测量数据的异方差。为了找到IQR,我们使用训练数据拟合0.25和0.75通用分位数回归。然后,这两个模型在测试数据中预测0.25和0.75分位数处的响应变量值,从而产生预测的IQR。为了减少异方差数据对模型选择的影响,我们提出使用加权预测误差。利用预测IQR的倒数来估计权重。该方法通过加权预测减少了大预测误差的影响,从而更好地选择模型和参数。通过仿真和两个真实数据集验证了该方法的有效性。
{"title":"Weighted validation of heteroscedastic regression models for better selection","authors":"Yoonsuh Jung, Hayoung Kim","doi":"10.1002/sam.11544","DOIUrl":"https://doi.org/10.1002/sam.11544","url":null,"abstract":"In this paper, we suggest a method for improving model selection in the presence of heteroscedasticity. For this purpose, we measure the heteroscedasticity in the data using the inter‐quartile range (IQR) of the fitted values under the framework of cross‐validation. To find the IQR, we fit 0.25 and 0.75 generic quantile regression using the training data. The two models then predict the values of the response variable at 0.25 and 0.75 quantiles in the test data, which yields predicted IQR. To reduce the effect of heteroscedastic data in the model selection, we propose to use weighted prediction error. The inverse of the predicted IQR is utilized to estimate the weights. The proposed method reduces the impact of large prediction errors via weighted prediction and leads to better model and parameter selection. The benefits of the proposed method are demonstrated in simulations and with two real data sets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125180706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modal linear regression models with multiplicative distortion measurement errors 具有乘性失真测量误差的模态线性回归模型
Pub Date : 2021-08-10 DOI: 10.1002/sam.11541
Jun Zhang, Gaorong Li, Yiping Yang
We consider modal linear regression models when neither the response variable nor the covariates can be directly observed, but are measured with multiplicative distortion measurement errors. Four calibration procedures are used to estimate parameters in the modal linear regression models, namely, conditional mean calibration, conditional absolute mean calibration, conditional variance calibration, and conditional absolute logarithmic calibration. The asymptotic properties for the estimators based on four calibration procedures are established. Monte Carlo simulation experiments are conducted to examine the performance of the proposed estimators. The proposed estimators are applied to analyze a forest fires dataset for an illustration.
当响应变量和协变量都不能直接观测到,而是用乘法失真测量误差测量时,我们考虑模态线性回归模型。模态线性回归模型的参数估计采用了条件均值校准、条件绝对均值校准、条件方差校准和条件绝对对数校准四种校准方法。建立了基于四种校正方法的估计量的渐近性质。通过蒙特卡罗仿真实验验证了所提估计器的性能。提出的估计器应用于分析森林火灾数据集来说明。
{"title":"Modal linear regression models with multiplicative distortion measurement errors","authors":"Jun Zhang, Gaorong Li, Yiping Yang","doi":"10.1002/sam.11541","DOIUrl":"https://doi.org/10.1002/sam.11541","url":null,"abstract":"We consider modal linear regression models when neither the response variable nor the covariates can be directly observed, but are measured with multiplicative distortion measurement errors. Four calibration procedures are used to estimate parameters in the modal linear regression models, namely, conditional mean calibration, conditional absolute mean calibration, conditional variance calibration, and conditional absolute logarithmic calibration. The asymptotic properties for the estimators based on four calibration procedures are established. Monte Carlo simulation experiments are conducted to examine the performance of the proposed estimators. The proposed estimators are applied to analyze a forest fires dataset for an illustration.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114096519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Multivariate Gaussian RBF‐net for smooth function estimation and variable selection 多元高斯RBF - net平滑函数估计和变量选择
Pub Date : 2021-08-03 DOI: 10.1002/sam.11540
Arkaprava Roy
Neural networks are routinely used for nonparametric regression modeling. The interest in these models is growing with ever‐increasing complexities in modern datasets. With modern technological advancements, the number of predictors frequently exceeds the sample size in many application areas. Thus, selecting important predictors from the huge pool is an extremely important task for judicious inference. This paper proposes a novel flexible class of single‐layer radial basis functions (RBF) networks. The proposed architecture can estimate smooth unknown regression functions and also perform variable selection. We primarily focus on Gaussian RBF‐net due to its attractive properties. The extensions to other choices of RBF are fairly straightforward. The proposed architecture is also shown to be effective in identifying relevant predictors in a low‐dimensional setting using the posterior samples without imposing any sparse estimation scheme. We develop an efficient Markov chain Monte Carlo algorithm to generate posterior samples of the parameters. We illustrate the proposed method's empirical efficacy through simulation experiments, both in high and low dimensional regression problems. The posterior contraction rate is established with respect to empirical ℓ2 distance assuming that the error variance is unknown, and the true function belongs to a Hölder ball. We illustrate our method in a Human Connectome Project dataset to predict vocabulary comprehension and to identify important edges of the structural connectome.
神经网络通常用于非参数回归建模。随着现代数据集的复杂性不断增加,对这些模型的兴趣也在增长。随着现代技术的进步,在许多应用领域,预测因子的数量经常超过样本量。因此,从庞大的预测池中选择重要的预测因子是明智推理的一项极其重要的任务。提出了一类新的柔性单层径向基函数网络。所提出的结构可以估计光滑的未知回归函数,也可以进行变量选择。我们主要关注高斯RBF - net,因为它具有吸引人的特性。对RBF的其他选择的扩展相当简单。所提出的结构也被证明是有效的识别相关的预测在低维设置使用后验样本,而不强加任何稀疏估计方案。我们开发了一种有效的马尔可夫链蒙特卡罗算法来生成参数的后验样本。我们通过模拟实验说明了该方法在高维和低维回归问题中的经验有效性。在误差方差未知的情况下,根据经验距离建立后验收缩率,真实函数属于Hölder球。我们在人类连接体项目数据集中说明了我们的方法来预测词汇理解和识别结构连接体的重要边缘。
{"title":"Multivariate Gaussian RBF‐net for smooth function estimation and variable selection","authors":"Arkaprava Roy","doi":"10.1002/sam.11540","DOIUrl":"https://doi.org/10.1002/sam.11540","url":null,"abstract":"Neural networks are routinely used for nonparametric regression modeling. The interest in these models is growing with ever‐increasing complexities in modern datasets. With modern technological advancements, the number of predictors frequently exceeds the sample size in many application areas. Thus, selecting important predictors from the huge pool is an extremely important task for judicious inference. This paper proposes a novel flexible class of single‐layer radial basis functions (RBF) networks. The proposed architecture can estimate smooth unknown regression functions and also perform variable selection. We primarily focus on Gaussian RBF‐net due to its attractive properties. The extensions to other choices of RBF are fairly straightforward. The proposed architecture is also shown to be effective in identifying relevant predictors in a low‐dimensional setting using the posterior samples without imposing any sparse estimation scheme. We develop an efficient Markov chain Monte Carlo algorithm to generate posterior samples of the parameters. We illustrate the proposed method's empirical efficacy through simulation experiments, both in high and low dimensional regression problems. The posterior contraction rate is established with respect to empirical ℓ2 distance assuming that the error variance is unknown, and the true function belongs to a Hölder ball. We illustrate our method in a Human Connectome Project dataset to predict vocabulary comprehension and to identify important edges of the structural connectome.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133067313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Negative binomial graphical model with excess zeros 带有多余零的负二项图模型
Pub Date : 2021-07-21 DOI: 10.1002/sam.11536
Beomjin Park, Hosik Choi, Changyi Park
Markov random field or undirected graphical models (GM) are a popular class of GM useful in various fields because they provide an intuitive and interpretable graph expressing the complex relationship between random variables. The zero‐inflated local Poisson graphical model has been proposed as a graphical model for count data with excess zeros. However, as count data are often characterized by over‐dispersion, the local Poisson graphical model may suffer from a poor fit to data. In this paper, we propose a zero‐inflated local negative binomial (NB) graphical model. Due to the dependencies of parameters in our models, a direct optimization of the objective function is difficult. Instead, we devise expectation‐minimization algorithms based on two different parametrizations for the NB distribution. Through a simulation study, we illustrate the effectiveness of our method for learning network structure from over‐dispersed count data with excess zeros. We further apply our method to real data to estimate its network structure.
马尔可夫随机场或无向图模型(GM)是一类流行的GM,因为它们提供了一个直观的和可解释的图来表达随机变量之间的复杂关系,在各个领域都很有用。提出了零膨胀局部泊松图模型作为计数数据中有多余零的图形模型。然而,由于计数数据通常具有过分散的特征,局部泊松图模型可能与数据拟合较差。本文提出了一个零膨胀局部负二项(NB)图模型。由于模型中参数的依赖性,目标函数的直接优化是困难的。相反,我们设计了基于NB分布的两种不同参数化的期望最小化算法。通过仿真研究,我们证明了我们的方法在从带有多余零的过分散计数数据中学习网络结构的有效性。我们进一步将我们的方法应用到实际数据中来估计其网络结构。
{"title":"Negative binomial graphical model with excess zeros","authors":"Beomjin Park, Hosik Choi, Changyi Park","doi":"10.1002/sam.11536","DOIUrl":"https://doi.org/10.1002/sam.11536","url":null,"abstract":"Markov random field or undirected graphical models (GM) are a popular class of GM useful in various fields because they provide an intuitive and interpretable graph expressing the complex relationship between random variables. The zero‐inflated local Poisson graphical model has been proposed as a graphical model for count data with excess zeros. However, as count data are often characterized by over‐dispersion, the local Poisson graphical model may suffer from a poor fit to data. In this paper, we propose a zero‐inflated local negative binomial (NB) graphical model. Due to the dependencies of parameters in our models, a direct optimization of the objective function is difficult. Instead, we devise expectation‐minimization algorithms based on two different parametrizations for the NB distribution. Through a simulation study, we illustrate the effectiveness of our method for learning network structure from over‐dispersed count data with excess zeros. We further apply our method to real data to estimate its network structure.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124654145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data 驾驶风险的评估与解释:基于远程信息处理数据的汽车索赔频率建模
Pub Date : 2021-07-20 DOI: 10.2139/ssrn.3910216
Yaqian Gao, Yifan Huang, Shengwang Meng
With the development of vehicle telematics and data mining technology, usage‐based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low‐risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.
随着车载信息处理技术和数据挖掘技术的发展,基于使用的保险(UBI)已经引起了学术界和产业界的广泛关注。广泛的驾驶行为特征使进一步了解投保车辆的风险成为可能,但在识别和解释重要的费率制定因素方面提出了挑战。本研究基于中国大陆地区投保人的远程信息处理数据,采用泊松回归和回归树、随机森林、梯度增强树、XGBoost和神经网络等机器学习模型,对商业卡车的保险理赔频率进行了分析。选择最佳模型后,从精算的角度分析特征重要性、特征效应以及各特征对预测的贡献。我们的实证研究表明,XGBoost大大优于传统模型,并能检测到一些重要的风险因素,如平均速度、平均日行驶里程、夜间驾驶比例、突然刹车次数和十字路口左右转弯比例。这些特征通常对驾驶风险具有非线性影响,并且特征之间存在复杂的相互作用。为了进一步区分高/低风险驾驶员,我们根据驾驶员的驾驶习惯运行监督聚类进行风险分割。综上所述,本研究不仅提供了更准确的驾驶风险预测,而且极大地满足了保险监管机构和风险管理机构的可解释性要求。
{"title":"Evaluation and interpretation of driving risks: Automobile claim frequency modeling with telematics data","authors":"Yaqian Gao, Yifan Huang, Shengwang Meng","doi":"10.2139/ssrn.3910216","DOIUrl":"https://doi.org/10.2139/ssrn.3910216","url":null,"abstract":"With the development of vehicle telematics and data mining technology, usage‐based insurance (UBI) has aroused widespread interest from both academia and industry. The extensive driving behavior features make it possible to further understand the risks of insured vehicles, but pose challenges in the identification and interpretation of important ratemaking factors. This study, based on the telematics data of policyholders in China's mainland, analyzes insurance claim frequency of commercial trucks using both Poisson regression and several machine learning models, including regression tree, random forest, gradient boosting tree, XGBoost and neural network. After selecting the best model, we analyze feature importance, feature effects and the contribution of each feature to the prediction from an actuarial perspective. Our empirical study shows that XGBoost greatly outperforms the traditional models and detects some important risk factors, such as the average speed, the average mileage traveled per day, the fraction of night driving, the number of sudden brakes and the fraction of left/right turns at intersections. These features usually have a nonlinear effect on driving risk, and there are complex interactions between features. To further distinguish high−/low‐risk drivers, we run supervised clustering for risk segmentation according to drivers' driving habits. In summary, this study not only provide a more accurate prediction of driving risk, but also greatly satisfy the interpretability requirements of insurance regulators and risk management.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Power grid frequency prediction using spatiotemporal modeling 基于时空建模的电网频率预测
Pub Date : 2021-07-06 DOI: 10.1002/sam.11535
Amanda Lenzi, J. Bessac, M. Anitescu
Understanding power system dynamics is essential for interarea oscillation analysis and the detection of grid instabilities. The FNET/GridEye is a GPS‐synchronized wide‐area frequency measurement network that provides an accurate picture of the normal real‐time operational condition of the power system dynamics, giving rise to new and intricate spatiotemporal patterns of power loads. We propose to model FNET/GridEye grid frequency data from the U.S. Eastern Interconnection with a spatiotemporal statistical model. We predict the frequency data at locations without observations, a critical need during disruption events where measurement data are inaccessible. Spatial information is accounted for either as neighboring measurements in the form of covariates or with a spatiotemporal correlation model captured by a latent Gaussian field. The proposed method is useful in estimating power system dynamic response from limited phasor measurements and holds promise for predicting instability that may lead to undesirable effects such as cascading outages.
了解电力系统动力学对于区域间振荡分析和电网不稳定检测至关重要。FNET/GridEye是一个GPS同步广域频率测量网络,可提供电力系统动态的正常实时运行条件的准确图像,从而产生新的复杂的电力负载时空模式。我们建议用一个时空统计模型来模拟来自美国东部电网的FNET/GridEye电网频率数据。我们在没有观测的位置预测频率数据,这是在无法获得测量数据的中断事件期间的关键需求。空间信息以协变量的形式作为相邻的测量,或者用潜在高斯场捕获的时空相关模型来解释。该方法可用于从有限相量测量中估计电力系统的动态响应,并有望预测可能导致级联停电等不良影响的不稳定性。
{"title":"Power grid frequency prediction using spatiotemporal modeling","authors":"Amanda Lenzi, J. Bessac, M. Anitescu","doi":"10.1002/sam.11535","DOIUrl":"https://doi.org/10.1002/sam.11535","url":null,"abstract":"Understanding power system dynamics is essential for interarea oscillation analysis and the detection of grid instabilities. The FNET/GridEye is a GPS‐synchronized wide‐area frequency measurement network that provides an accurate picture of the normal real‐time operational condition of the power system dynamics, giving rise to new and intricate spatiotemporal patterns of power loads. We propose to model FNET/GridEye grid frequency data from the U.S. Eastern Interconnection with a spatiotemporal statistical model. We predict the frequency data at locations without observations, a critical need during disruption events where measurement data are inaccessible. Spatial information is accounted for either as neighboring measurements in the form of covariates or with a spatiotemporal correlation model captured by a latent Gaussian field. The proposed method is useful in estimating power system dynamic response from limited phasor measurements and holds promise for predicting instability that may lead to undesirable effects such as cascading outages.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132414630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Analyzing relevance vector machines using a single penalty approach 使用单一惩罚方法分析相关向量机
Pub Date : 2021-07-05 DOI: 10.1002/sam.11551
A. Dixit, Vivekananda Roy
Relevance vector machine (RVM) is a popular sparse Bayesian learning model typically used for prediction. Recently it has been shown that improper priors assumed on multiple penalty parameters in RVM may lead to an improper posterior. Currently in the literature, the sufficient conditions for posterior propriety of RVM do not allow improper priors over the multiple penalty parameters. In this article, we propose a single penalty relevance vector machine (SPRVM) model in which multiple penalty parameters are replaced by a single penalty and we consider a semi‐Bayesian approach for fitting the SPRVM. The necessary and sufficient conditions for posterior propriety of SPRVM are more liberal than those of RVM and allow for several improper priors over the penalty parameter. Additionally, we also prove the geometric ergodicity of the Gibbs sampler used to analyze the SPRVM model and hence can estimate the asymptotic standard errors associated with the Monte Carlo estimate of the means of the posterior predictive distribution. Such a Monte Carlo standard error cannot be computed in the case of RVM, since the rate of convergence of the Gibbs sampler used to analyze RVM is not known. The predictive performance of RVM and SPRVM is compared by analyzing two simulation examples and three real life datasets.
相关向量机(RVM)是一种常用的稀疏贝叶斯学习模型,通常用于预测。最近有研究表明,在RVM中,对多个惩罚参数假设不正确的先验会导致不正确的后验。目前在文献中,RVM的后验适当性的充分条件不允许对多个惩罚参数的先验不适当。在本文中,我们提出了一个单惩罚相关向量机(SPRVM)模型,其中多个惩罚参数被单个惩罚取代,我们考虑了半贝叶斯方法来拟合SPRVM。SPRVM的后验适当性的充分必要条件比RVM的后验适当性更为宽松,并允许在惩罚参数上存在多个不适当的先验。此外,我们还证明了用于分析SPRVM模型的Gibbs抽样器的几何遍历性,从而可以估计与后验预测分布均值的蒙特卡罗估计相关的渐近标准误差。这种蒙特卡罗标准误差不能在RVM的情况下计算,因为用于分析RVM的吉布斯采样器的收敛速度是未知的。通过对两个仿真实例和三个实际数据集的分析,比较了RVM和SPRVM的预测性能。
{"title":"Analyzing relevance vector machines using a single penalty approach","authors":"A. Dixit, Vivekananda Roy","doi":"10.1002/sam.11551","DOIUrl":"https://doi.org/10.1002/sam.11551","url":null,"abstract":"Relevance vector machine (RVM) is a popular sparse Bayesian learning model typically used for prediction. Recently it has been shown that improper priors assumed on multiple penalty parameters in RVM may lead to an improper posterior. Currently in the literature, the sufficient conditions for posterior propriety of RVM do not allow improper priors over the multiple penalty parameters. In this article, we propose a single penalty relevance vector machine (SPRVM) model in which multiple penalty parameters are replaced by a single penalty and we consider a semi‐Bayesian approach for fitting the SPRVM. The necessary and sufficient conditions for posterior propriety of SPRVM are more liberal than those of RVM and allow for several improper priors over the penalty parameter. Additionally, we also prove the geometric ergodicity of the Gibbs sampler used to analyze the SPRVM model and hence can estimate the asymptotic standard errors associated with the Monte Carlo estimate of the means of the posterior predictive distribution. Such a Monte Carlo standard error cannot be computed in the case of RVM, since the rate of convergence of the Gibbs sampler used to analyze RVM is not known. The predictive performance of RVM and SPRVM is compared by analyzing two simulation examples and three real life datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132586838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coefficient tree regression for generalized linear models 广义线性模型的系数树回归
Pub Date : 2021-07-02 DOI: 10.1002/sam.11534
Özge Sürer, D. Apley, E. Malthouse
Large regression data sets are now commonplace, with so many predictors that they cannot or should not all be included individually. In practice, derived predictors are relevant as meaningful features or, at the very least, as a form of regularized approximation of the true coefficients. We consider derived predictors that are the sum of some groups of individual predictors, which is equivalent to predictors within a group sharing the same coefficient. However, the groups of predictors are usually not known in advance and must be discovered from the data. In this paper we develop a coefficient tree regression algorithm for generalized linear models to discover the group structure from the data. The approach results in simple and highly interpretable models, and we demonstrated with real examples that it can provide a clear and concise interpretation of the data. Via simulation studies under different scenarios we showed that our approach performs better than existing competitors in terms of computing time and predictive accuracy.
大型回归数据集现在很常见,有太多的预测因子,它们不能或不应该单独包含。在实践中,导出的预测因子与有意义的特征相关,或者至少与真实系数的正则化近似形式相关。我们考虑的衍生预测因子是一些个体预测因子组的总和,这相当于一个组内的预测因子共享相同的系数。然而,预测因子组通常是事先不知道的,必须从数据中发现。本文提出了一种适用于广义线性模型的系数树回归算法,用于从数据中发现群结构。该方法产生了简单且高度可解释的模型,并且我们用实际示例证明了它可以提供清晰而简洁的数据解释。通过不同场景下的仿真研究,我们表明我们的方法在计算时间和预测精度方面优于现有的竞争对手。
{"title":"Coefficient tree regression for generalized linear models","authors":"Özge Sürer, D. Apley, E. Malthouse","doi":"10.1002/sam.11534","DOIUrl":"https://doi.org/10.1002/sam.11534","url":null,"abstract":"Large regression data sets are now commonplace, with so many predictors that they cannot or should not all be included individually. In practice, derived predictors are relevant as meaningful features or, at the very least, as a form of regularized approximation of the true coefficients. We consider derived predictors that are the sum of some groups of individual predictors, which is equivalent to predictors within a group sharing the same coefficient. However, the groups of predictors are usually not known in advance and must be discovered from the data. In this paper we develop a coefficient tree regression algorithm for generalized linear models to discover the group structure from the data. The approach results in simple and highly interpretable models, and we demonstrated with real examples that it can provide a clear and concise interpretation of the data. Via simulation studies under different scenarios we showed that our approach performs better than existing competitors in terms of computing time and predictive accuracy.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"45 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125830306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fourier neural networks as function approximators and differential equation solvers 傅里叶神经网络作为函数逼近器和微分方程求解器
Pub Date : 2021-06-22 DOI: 10.1002/sam.11531
M. Ngom, O. Marin
We present a Fourier neural network (FNN) that can be mapped directly to the Fourier decomposition. The choice of activation and loss function yields results that replicate a Fourier series expansion closely while preserving a straightforward architecture with a single hidden layer. The simplicity of this network architecture facilitates the integration with any other higher‐complexity networks, at a data pre‐ or postprocessing stage. We validate this FNN on naturally periodic smooth functions and on piecewise continuous periodic functions. We showcase the use of this FNN for modeling or solving partial differential equations with periodic boundary conditions. The main advantages of the current approach are the validity of the solution outside the training region, interpretability of the trained model, and simplicity of use.
我们提出了一个可以直接映射到傅里叶分解的傅里叶神经网络(FNN)。激活函数和损失函数的选择产生的结果可以近似地复制傅立叶级数展开,同时保留具有单个隐藏层的简单结构。这种网络架构的简单性有助于在数据预处理或后处理阶段与任何其他更高复杂性的网络集成。我们在自然周期平滑函数和分段连续周期函数上验证了该神经网络。我们展示了使用这种FNN来建模或求解具有周期边界条件的偏微分方程。当前方法的主要优点是解在训练区域外的有效性、训练模型的可解释性和使用简单性。
{"title":"Fourier neural networks as function approximators and differential equation solvers","authors":"M. Ngom, O. Marin","doi":"10.1002/sam.11531","DOIUrl":"https://doi.org/10.1002/sam.11531","url":null,"abstract":"We present a Fourier neural network (FNN) that can be mapped directly to the Fourier decomposition. The choice of activation and loss function yields results that replicate a Fourier series expansion closely while preserving a straightforward architecture with a single hidden layer. The simplicity of this network architecture facilitates the integration with any other higher‐complexity networks, at a data pre‐ or postprocessing stage. We validate this FNN on naturally periodic smooth functions and on piecewise continuous periodic functions. We showcase the use of this FNN for modeling or solving partial differential equations with periodic boundary conditions. The main advantages of the current approach are the validity of the solution outside the training region, interpretability of the trained model, and simplicity of use.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117015691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1