首页 > 最新文献

Statistical Analysis and Data Mining最新文献

英文 中文
Imputed quantile vector autoregressive model for multivariate spatial–temporal data 多变量时空数据的估算量级向量自回归模型
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-25 DOI: 10.1002/sam.11658
Liang Jinwen, Tian Maozai
Imputing missing values in multivariate spatial–temporal data is important in many fields. Existing low rank tensor learning methods are popular for handling this task but are sensitive to high level of skewness. The aim of this paper is to develop an alternative method with robustness and high imputation accuracy for multivariate spatial–temporal data. In view of the fact that quantile regression is robust to noises and outliers, we propose an imputed quantile vector autoregressive (IQVAR) model. IQVAR can simultaneously impute missing values and estimate parameters of quantile vector autoregressive model. The objective function includes check loss and nuclear norm penalization. We develop an ADMM (Alternating Direction Method of Multipliers) algorithm to solve the resulting optimization problem. Simulation studies and real data analysis are conducted to verify the efficiency of IQVAR. Compared with other approaches, IQVAR is more robust and accurate.
多变量时空数据中缺失值的填补在许多领域都很重要。现有的低秩张量学习方法是处理这一任务的常用方法,但对高偏度很敏感。本文旨在为多变量时空数据开发一种具有鲁棒性和高估算精度的替代方法。鉴于量子回归对噪声和异常值具有鲁棒性,我们提出了一种估算量子向量自回归(IQVAR)模型。IQVAR 可以同时估算缺失值和估计量子向量自回归模型的参数。目标函数包括检验损失和核规范惩罚。我们开发了一种 ADMM(乘数交替法)算法来解决由此产生的优化问题。为了验证 IQVAR 的效率,我们进行了仿真研究和实际数据分析。与其他方法相比,IQVAR 更稳健、更准确。
{"title":"Imputed quantile vector autoregressive model for multivariate spatial–temporal data","authors":"Liang Jinwen, Tian Maozai","doi":"10.1002/sam.11658","DOIUrl":"https://doi.org/10.1002/sam.11658","url":null,"abstract":"Imputing missing values in multivariate spatial–temporal data is important in many fields. Existing low rank tensor learning methods are popular for handling this task but are sensitive to high level of skewness. The aim of this paper is to develop an alternative method with robustness and high imputation accuracy for multivariate spatial–temporal data. In view of the fact that quantile regression is robust to noises and outliers, we propose an imputed quantile vector autoregressive (IQVAR) model. IQVAR can simultaneously impute missing values and estimate parameters of quantile vector autoregressive model. The objective function includes check loss and nuclear norm penalization. We develop an ADMM (Alternating Direction Method of Multipliers) algorithm to solve the resulting optimization problem. Simulation studies and real data analysis are conducted to verify the efficiency of IQVAR. Compared with other approaches, IQVAR is more robust and accurate.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"40 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139590190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer 非参数贝叶斯功能聚类在乳腺癌种族差异中的应用
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-25 DOI: 10.1002/sam.11657
Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah
As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.
随着我们更容易获取海量数据集,功能分析越来越受到关注。然而,这类数据集通常包含大量异质性、噪声和维度。当把分析从向量推广到函数时,经典方法可能无法直接发挥作用。本文从两个方面考虑在函数分析中减少噪声信息:通过函数聚类将相似的观测数据归类,从而减少样本量;通过函数变量选择降低维度。由于贝叶斯层次模型的灵活性,复杂的数据结构和关系很容易用贝叶斯层次模型来建模。因此,本文提出了一种非参数贝叶斯函数聚类和峰值点选择方法,通过加权狄利克特过程混合物(WDPM)建模,结合条件拉普拉斯先验(一种共轭变量选择先验),自动聚类并提供精确估计。所提出的方法简称为 WDPM-VS,能同时完成以下任务:(1)自动聚类,无需事先指定聚类数目或聚类中心;(2)对异质函数进行聚类;(3)选择振动峰点;以及(4)从样本量和维度两个角度减少噪声信息。在均方根误差方面,该方法将大大优于同类方法。基于该方法,我们能够找出解释乳腺癌种族差异的生物学因素。
{"title":"Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer","authors":"Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah","doi":"10.1002/sam.11657","DOIUrl":"https://doi.org/10.1002/sam.11657","url":null,"abstract":"As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"85 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Study of a bounded interval perks distribution with quantile regression analysis 利用量子回归分析研究有界区间津贴分布
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-01-25 DOI: 10.1002/sam.11656
Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal
In this article, a novel bounded interval model called the unit-Perks model is developed by suitably transforming the positive random variable of the Perks distribution. Numerous statistical features of the bounded interval Perks model are being explored based on the expansion of the density function. Eight distinct estimation approaches are being used to estimate the parameters of the unit-Perks model. A throughout simulation analysis is also included to evaluate the precision of the resulting estimators from eight estimating approaches. Two real bounded interval data sets are being utilized to investigate the practical applicability of the unit-Perks model. A comparison is also made to determine which method of estimation works better for the given model. According to a comparison of eight different estimation approaches, the maximum likelihood estimation approach outperformed than the other seven estimating approaches. The unit-perks model is then used to introduce the quantile regression model named as quantile unit-Perks distribution. Application to real data set for the quantile unit-Perks distribution is also performed. The quantile residuals are used for the residual analysis of the fitted regression model. On the basis of mathematical, computational, and pictorial evidences, it is concluded that the presented model exhibited greater modeling capabilities.
本文通过对 Perks 分布的正随机变量进行适当变换,建立了一种新的有界区间模型,即单位 Perks 模型。根据密度函数的扩展,探讨了有界区间 Perks 模型的许多统计特征。八种不同的估算方法用于估算单位 Perks 模型的参数。此外,还进行了全程模拟分析,以评估八种估计方法所产生的估计器的精度。利用两个真实的有界区间数据集来研究单位-珀克斯模型的实际适用性。同时还进行了比较,以确定哪种估算方法对给定模型更有效。根据对八种不同估计方法的比较,最大似然估计方法优于其他七种估计方法。然后,使用单位-珀克斯模型引入了名为量子单位-珀克斯分布的量子回归模型。此外,还将量子单位-珀克斯分布应用于真实数据集。量子残差用于拟合回归模型的残差分析。在数学、计算和图像证据的基础上,得出的结论是所提出的模型具有更强的建模能力。
{"title":"Study of a bounded interval perks distribution with quantile regression analysis","authors":"Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal","doi":"10.1002/sam.11656","DOIUrl":"https://doi.org/10.1002/sam.11656","url":null,"abstract":"In this article, a novel bounded interval model called the unit-Perks model is developed by suitably transforming the positive random variable of the Perks distribution. Numerous statistical features of the bounded interval Perks model are being explored based on the expansion of the density function. Eight distinct estimation approaches are being used to estimate the parameters of the unit-Perks model. A throughout simulation analysis is also included to evaluate the precision of the resulting estimators from eight estimating approaches. Two real bounded interval data sets are being utilized to investigate the practical applicability of the unit-Perks model. A comparison is also made to determine which method of estimation works better for the given model. According to a comparison of eight different estimation approaches, the maximum likelihood estimation approach outperformed than the other seven estimating approaches. The unit-perks model is then used to introduce the quantile regression model named as quantile unit-Perks distribution. Application to real data set for the quantile unit-Perks distribution is also performed. The quantile residuals are used for the residual analysis of the fitted regression model. On the basis of mathematical, computational, and pictorial evidences, it is concluded that the presented model exhibited greater modeling capabilities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting diversity in regression ensembles 提升回归集合的多样性
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-30 DOI: 10.1002/sam.11654
Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi
Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.
在分类和回归任务中,集合方法(如 Bagging、Boosting 或 Random Forests)通常能提高单个学习者的预测性能。在回归方面,我们提出了一种基于梯度提升的算法,该算法包含一个多样性项,目的是构建不同的学习器,丰富集合,同时在某些个体最优性与全局增强性之间实现权衡。通过验证 Biau 和 Cadre 定理(2021 年,《当代统计学和计量经济学进展--克里斯蒂娜-托马斯-阿格南纪念文集》,施普林格出版社)的假设,我们提出了一个收敛结果,确保相关优化策略达到全局最优。在实验中,我们考虑了各种不同的基础学习器,其复杂度也在不断增加:树桩、回归树、纯随机森林和布雷曼随机森林。最后,我们考虑了模拟数据集、基准数据集和一个真实世界的电力需求数据集,通过数值实验,不仅检查最终预测器或聚合预测器的行为,还检查整个生成序列的行为,从而展示我们的程序的适用性。
{"title":"Boosting diversity in regression ensembles","authors":"Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi","doi":"10.1002/sam.11654","DOIUrl":"https://doi.org/10.1002/sam.11654","url":null,"abstract":"Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, <i>Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan</i>, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"33 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139063502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate contaminated normal mixture regression modeling of longitudinal data based on joint mean-covariance model 基于联合均值-协方差模型的纵向数据多变量污染正态混合回归建模
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-22 DOI: 10.1002/sam.11653
Niu Xiaoyu, Tian Yuzhu, Tang Manlai, Tian Maozai
Outliers are common in longitudinal data analysis, and the multivariate contaminated normal (MCN) distribution in model-based clustering is often used to detect outliers and provide robust parameter estimates in each subgroup. In this paper, we propose a method, the mixture of MCN (MCNM), based on the joint mean-covariance model, specifically designed to analyze longitudinal data characterized by mild outliers. Our model can automatically detect outliers in longitudinal data and provide robust parameter estimates in each subgroup. We use iteratively expectation-conditional maximization (ECM) algorithm and Aitken acceleration to estimate the model parameters, achieving both algorithm acceleration and stable convergence. Our proposed method simultaneously clusters the population, identifies progression patterns of the mean and covariance structures for different subgroups over time, and detects outliers. To demonstrate the effectiveness of our method, we conduct simulation studies under various cases involving different proportions and degrees of contamination. Additionally, we apply our method to real data on the number of people infected with AIDS in 49 countries or regions from 2001 to 2021. Results show that our proposed method effectively clusters the data based on various mean progression trajectories. In summary, our proposed MCNM based on the joint mean-covariance model and MCD of covariance matrices provides a robust method for clustering longitudinal data with mild outliers. It effectively detects outliers and identifies progression patterns in different groups over time, making it valuable for various applications in longitudinal data analysis.
离群值在纵向数据分析中很常见,基于模型的聚类中的多变量污染正态分布(MCN)通常用于检测离群值,并在每个子群中提供稳健的参数估计。在本文中,我们提出了一种基于联合均值-协方差模型的 MCN 混合物(MCNM)方法,专门用于分析以轻度异常值为特征的纵向数据。我们的模型可以自动检测纵向数据中的异常值,并在每个子群中提供稳健的参数估计。我们使用迭代期望条件最大化(ECM)算法和艾特肯加速来估计模型参数,实现了算法加速和稳定收敛。我们提出的方法可同时对人群进行聚类,识别不同子群的均值和协方差结构随时间变化的进展模式,并检测异常值。为了证明我们方法的有效性,我们在不同污染比例和程度的情况下进行了模拟研究。此外,我们还将我们的方法应用于 49 个国家或地区 2001 年至 2021 年艾滋病感染人数的真实数据。结果表明,我们提出的方法能有效地根据不同的平均进展轨迹对数据进行聚类。总之,我们提出的基于联合均值-协方差模型和协方差矩阵 MCD 的 MCNM 方法为对具有轻度异常值的纵向数据进行聚类提供了一种稳健的方法。它能有效检测离群值,并识别不同组别随时间推移的进展模式,因此在纵向数据分析的各种应用中具有重要价值。
{"title":"Multivariate contaminated normal mixture regression modeling of longitudinal data based on joint mean-covariance model","authors":"Niu Xiaoyu, Tian Yuzhu, Tang Manlai, Tian Maozai","doi":"10.1002/sam.11653","DOIUrl":"https://doi.org/10.1002/sam.11653","url":null,"abstract":"Outliers are common in longitudinal data analysis, and the multivariate contaminated normal (MCN) distribution in model-based clustering is often used to detect outliers and provide robust parameter estimates in each subgroup. In this paper, we propose a method, the mixture of MCN (MCNM), based on the joint mean-covariance model, specifically designed to analyze longitudinal data characterized by mild outliers. Our model can automatically detect outliers in longitudinal data and provide robust parameter estimates in each subgroup. We use iteratively expectation-conditional maximization (ECM) algorithm and Aitken acceleration to estimate the model parameters, achieving both algorithm acceleration and stable convergence. Our proposed method simultaneously clusters the population, identifies progression patterns of the mean and covariance structures for different subgroups over time, and detects outliers. To demonstrate the effectiveness of our method, we conduct simulation studies under various cases involving different proportions and degrees of contamination. Additionally, we apply our method to real data on the number of people infected with AIDS in 49 countries or regions from 2001 to 2021. Results show that our proposed method effectively clusters the data based on various mean progression trajectories. In summary, our proposed MCNM based on the joint mean-covariance model and MCD of covariance matrices provides a robust method for clustering longitudinal data with mild outliers. It effectively detects outliers and identifies progression patterns in different groups over time, making it valuable for various applications in longitudinal data analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"6 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A machine learning oracle for parameter estimation 用于参数估计的机器学习算法
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-09 DOI: 10.1002/sam.11651
Lucas Koepke, Mary Gregg, Michael Frey
Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set C�$$ mathcal{C} $$� of available preparatory procedures. The oracle learns the decision regions associated with C�$$ mathcal{C} $$� based on training data synthesized solely from the given data using model parameters with high posterior probability. An estimator in combination with an oracle to guide data preparation is called an oracle estimator. Oracle estimator performance is studied in two estimation problems: slope estimation in simple linear regression (SLR) and changepoint estimation in continuous two-linear-segments regression (CTLSR). In both examples, the regression response is given to be increasing, and the oracle must decide whether to isotonically smooth the response data preparatory to fitting the regression model. A measure of performance called headroom is proposed to assess the oracle's potential for reducing estimation error. Experiments with SLR and CTLSR find for important ranges of problem configurations that the headroom is high, the oracle's empirical performance is near the headroom, and the oracle estimator offers clear benefit.
数据平滑、加权、估算、离群值剔除等竞争性程序可用于参数模型估算的数据准备。然而,对于计划估算和观测数据的最佳准备程序选择,人们往往知之甚少。在这种情况下,可以构建一个基于机器学习的决策规则,即 "oracle",以便从可用准备程序集 C$$ mathcal{C}$$ 中选出最佳程序。甲骨文根据仅从给定数据合成的训练数据,使用具有高后验概率的模型参数,学习与 C$$ mathcal{C}$ 相关的决策区域。与指导数据准备的甲骨文相结合的估计器称为甲骨文估计器。甲骨文估计器的性能在两个估计问题中进行了研究:简单线性回归(SLR)中的斜率估计和连续双线段回归(CTLSR)中的变化点估计。在这两个例子中,给定的回归响应都是递增的,甲骨文必须决定是否在拟合回归模型之前对响应数据进行同调平滑。我们提出了一种称为 "余量"(headroom)的性能测量方法,用于评估神谕在减少估计误差方面的潜力。利用 SLR 和 CTLSR 进行的实验发现,在重要的问题配置范围内,余量很大,甲骨文的经验性能接近余量,而且甲骨文估计器具有明显的优势。
{"title":"A machine learning oracle for parameter estimation","authors":"Lucas Koepke, Mary Gregg, Michael Frey","doi":"10.1002/sam.11651","DOIUrl":"https://doi.org/10.1002/sam.11651","url":null,"abstract":"Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0001\" display=\"inline\" location=\"graphic/sam11651-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> of available preparatory procedures. The oracle learns the decision regions associated with <math altimg=\"urn:x-wiley:19321864:media:sam11651:sam11651-math-0002\" display=\"inline\" location=\"graphic/sam11651-math-0002.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi mathvariant=\"script\">C</mi>\u0000</mrow>\u0000$$ mathcal{C} $$</annotation>\u0000</semantics></math> based on training data synthesized solely from the given data using model parameters with high posterior probability. An estimator in combination with an oracle to guide data preparation is called an oracle estimator. Oracle estimator performance is studied in two estimation problems: slope estimation in simple linear regression (SLR) and changepoint estimation in continuous two-linear-segments regression (CTLSR). In both examples, the regression response is given to be increasing, and the oracle must decide whether to isotonically smooth the response data preparatory to fitting the regression model. A measure of performance called headroom is proposed to assess the oracle's potential for reducing estimation error. Experiments with SLR and CTLSR find for important ranges of problem configurations that the headroom is high, the oracle's empirical performance is near the headroom, and the oracle estimator offers clear benefit.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"64 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO 广义双曲线族和通过多选 LASSO 自动选择模型
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-12-08 DOI: 10.1002/sam.11652
Luca Bagnato, Alessio Farcomeni, Antonio Punzo
We revisit the generalized hyperbolic (GH) distribution and its nested models. These include widely used parametric choices like the multivariate normal, skew-t�$$ t $$�, Laplace, and several others. We also introduce the multiple-choice LASSO, a novel penalized method for choosing among alternative constraints on the same parameter. A hierarchical multiple-choice Least Absolute Shrinkage and Selection Operator (LASSO) penalized likelihood is optimized to perform simultaneous model selection and inference within the GH family. We illustrate our approach through a simulation study and a real data example. The methodology proposed in this paper has been implemented in R functions which are available as supplementary material.
我们重温了广义双曲线(GH)分布及其嵌套模型。这些模型包括广泛使用的参数选择,如多元正态分布、偏斜-t$$ t $$分布、拉普拉斯分布以及其他一些参数。我们还介绍了多选 LASSO,这是一种在同一参数的备选约束条件中进行选择的新型惩罚性方法。我们优化了分层多选最小绝对收缩和选择操作符(LASSO)惩罚似然法,以便在 GH 系列中同时执行模型选择和推断。我们通过模拟研究和真实数据示例来说明我们的方法。本文提出的方法已在 R 函数中实现,这些函数可作为补充材料提供。
{"title":"The generalized hyperbolic family and automatic model selection through the multiple-choice LASSO","authors":"Luca Bagnato, Alessio Farcomeni, Antonio Punzo","doi":"10.1002/sam.11652","DOIUrl":"https://doi.org/10.1002/sam.11652","url":null,"abstract":"We revisit the generalized hyperbolic (GH) distribution and its nested models. These include widely used parametric choices like the multivariate normal, skew-<math altimg=\"urn:x-wiley:19321864:media:sam11652:sam11652-math-0001\" display=\"inline\" location=\"graphic/sam11652-math-0001.png\" overflow=\"scroll\">\u0000<semantics>\u0000<mrow>\u0000<mi>t</mi>\u0000</mrow>\u0000$$ t $$</annotation>\u0000</semantics></math>, Laplace, and several others. We also introduce the multiple-choice LASSO, a novel penalized method for choosing among alternative constraints on the same parameter. A hierarchical multiple-choice Least Absolute Shrinkage and Selection Operator (LASSO) penalized likelihood is optimized to perform simultaneous model selection and inference within the GH family. We illustrate our approach through a simulation study and a real data example. The methodology proposed in this paper has been implemented in R functions which are available as supplementary material.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138555623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling subpopulations for hierarchically structured data 为分层结构数据建模子种群
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-22 DOI: 10.1002/sam.11650
Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang
The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates an additional layer of complexity. Hence, the data has a hierarchical structure in addition to the existence of underlying subpopulations. Finite mixtures are known for modeling heterogeneity; however, previous parameter estimation procedures assume that the data is generated through a simple random sampling process. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same source, yet an unknown subpopulation. A simulation study and a real data analysis based on famous glass datasets and a keystroke dynamic typing data set show that the proposed approach performs better than other approaches that have been used previously in practice.
法医统计领域提供了一种独特的分层数据结构,其中总体由几个来源的子总体组成,并从每个来源收集样本。这种亚种群结构增加了一层复杂性。因此,除了存在潜在的子种群之外,数据还具有层次结构。有限混合以模拟异质性而闻名;然而,之前的参数估计过程假设数据是通过简单的随机抽样过程生成的。我们建议使用半监督混合建模方法来模拟亚种群结构,该方法利用我们知道样本收集来自同一来源,但未知的亚种群这一事实。基于著名玻璃数据集和按键动态打字数据集的仿真研究和实际数据分析表明,该方法比以往使用的其他方法具有更好的性能。
{"title":"Modeling subpopulations for hierarchically structured data","authors":"Andrew Simpson, Semhar Michael, Dylan Borchert, Christopher Saunders, Larry Tang","doi":"10.1002/sam.11650","DOIUrl":"https://doi.org/10.1002/sam.11650","url":null,"abstract":"The field of forensic statistics offers a unique hierarchical data structure in which a population is composed of several subpopulations of sources and a sample is collected from each source. This subpopulation structure creates an additional layer of complexity. Hence, the data has a hierarchical structure in addition to the existence of underlying subpopulations. Finite mixtures are known for modeling heterogeneity; however, previous parameter estimation procedures assume that the data is generated through a simple random sampling process. We propose using a semi-supervised mixture modeling approach to model the subpopulation structure which leverages the fact that we know the collection of samples came from the same source, yet an unknown subpopulation. A simulation study and a real data analysis based on famous glass datasets and a keystroke dynamic typing data set show that the proposed approach performs better than other approaches that have been used previously in practice.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"37 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model 基于位置相关Dirichlet过程混合模型的空间相关时间序列聚类
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-22 DOI: 10.1002/sam.11649
Junsub Jung, Sungil Kim, Heeyoung Kim
The Dirichlet process mixture (DPM) model has been widely used as a Bayesian nonparametric model for clustering. However, the exchangeability assumption of the Dirichlet process is not valid for clustering spatially correlated time series as these data are indexed spatially and temporally. While analyzing spatially correlated time series, correlations between observations at proximal times and locations must be appropriately considered. In this study, we propose a location-dependent DPM model by extending the traditional DPM model for clustering spatially correlated time series. We model the temporal pattern as an infinite mixture of Gaussian processes while considering spatial dependency using a location-dependent Dirichlet process prior over mixture components. This encourages the assignment of observations from proximal locations to the same cluster. By contrast, because mixture atoms for modeling temporal patterns are shared across space, observations with similar temporal patterns can be still grouped together even if they are located far apart. The proposed model also allows the number of clusters to be automatically determined in the clustering procedure. We validate the proposed model using simulated examples. Moreover, in a real case study, we cluster adjacent roads based on their traffic speed patterns that have changed as a result of a traffic accident occurred in Seoul, South Korea.
Dirichlet过程混合(DPM)模型作为一种贝叶斯非参数聚类模型被广泛应用。然而,Dirichlet过程的可交换性假设对于聚类空间相关时间序列是无效的,因为这些数据是空间和时间索引的。在分析空间相关时间序列时,必须适当考虑近时间和近地点观测值之间的相关性。本文通过对传统DPM模型的扩展,提出了一个基于位置的DPM模型,用于空间相关时间序列的聚类。我们将时间模式建模为高斯过程的无限混合,同时使用位置相关的狄利克雷过程优先于混合分量考虑空间依赖性。这鼓励将来自近端位置的观测值分配到同一群集。相比之下,由于用于建模时间模式的混合原子在整个空间中是共享的,因此具有相似时间模式的观测结果仍然可以分组在一起,即使它们位于很远的地方。该模型还允许在聚类过程中自动确定聚类的数量。我们用仿真实例验证了所提出的模型。此外,在一个真实的案例研究中,我们根据韩国首尔发生的交通事故导致的交通速度模式的变化,对相邻的道路进行了聚类。
{"title":"Spatially-correlated time series clustering using location-dependent Dirichlet process mixture model","authors":"Junsub Jung, Sungil Kim, Heeyoung Kim","doi":"10.1002/sam.11649","DOIUrl":"https://doi.org/10.1002/sam.11649","url":null,"abstract":"The Dirichlet process mixture (DPM) model has been widely used as a Bayesian nonparametric model for clustering. However, the exchangeability assumption of the Dirichlet process is not valid for clustering spatially correlated time series as these data are indexed spatially and temporally. While analyzing spatially correlated time series, correlations between observations at proximal times and locations must be appropriately considered. In this study, we propose a location-dependent DPM model by extending the traditional DPM model for clustering spatially correlated time series. We model the temporal pattern as an infinite mixture of Gaussian processes while considering spatial dependency using a location-dependent Dirichlet process prior over mixture components. This encourages the assignment of observations from proximal locations to the same cluster. By contrast, because mixture atoms for modeling temporal patterns are shared across space, observations with similar temporal patterns can be still grouped together even if they are located far apart. The proposed model also allows the number of clusters to be automatically determined in the clustering procedure. We validate the proposed model using simulated examples. Moreover, in a real case study, we cluster adjacent roads based on their traffic speed patterns that have changed as a result of a traffic accident occurred in Seoul, South Korea.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"30 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Input-response space-filling designs incorporating response uncertainty 包含响应不确定性的输入-响应空间填充设计
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-20 DOI: 10.1002/sam.11648
Xiankui Yang, Lu Lu, Christine M. Anderson-Cook
Traditionally space-filling designs have focused on the characteristics of the design in the input space ensuring uniform spread throughout the region. Input-response space-filling designs considered scenarios when having good spread throughout the range or region of the responses is also of interest. This paper acknowledges that there is typically uncertainty associated with the values of the response(s) and hence proposes a method, Input-Response Space-Filling Designs with Uncertainty (IRSFwU), to incorporate this into the design construction. The Pareto front of designs offers alternatives that balance input and response space filling, while prioritizing input combinations with lower associated response uncertainty. These lower uncertainty choices improve the chances of observing the desired response values. We describe the new approach with an uncertainty-adjusted distance to measure the response space filling, the Pareto aggregate point exchange algorithm to populate the set of promising designs, and illustrate the method with three examples of different input and response relationships and dimensions.
传统的空间填充设计侧重于输入空间的设计特征,确保整个区域的均匀分布。输入-响应空间填充设计考虑了在整个响应范围或区域内具有良好分布的情况,这也是令人感兴趣的。本文承认,通常存在与响应值相关的不确定性,因此提出了一种方法,不确定性输入-响应填充空间设计(IRSFwU),将其纳入设计构造中。Pareto前沿设计提供了平衡输入和响应空间填充的替代方案,同时优先考虑具有较低相关响应不确定性的输入组合。这些不确定性较低的选择提高了观察到所需响应值的机会。我们描述了用不确定性调整距离测量响应空间填充的新方法,用Pareto聚集点交换算法填充有希望的设计集,并通过三个不同输入和响应关系和维度的例子说明了该方法。
{"title":"Input-response space-filling designs incorporating response uncertainty","authors":"Xiankui Yang, Lu Lu, Christine M. Anderson-Cook","doi":"10.1002/sam.11648","DOIUrl":"https://doi.org/10.1002/sam.11648","url":null,"abstract":"Traditionally space-filling designs have focused on the characteristics of the design in the input space ensuring uniform spread throughout the region. Input-response space-filling designs considered scenarios when having good spread throughout the range or region of the responses is also of interest. This paper acknowledges that there is typically uncertainty associated with the values of the response(s) and hence proposes a method, Input-Response Space-Filling Designs with Uncertainty (IRSFwU), to incorporate this into the design construction. The Pareto front of designs offers alternatives that balance input and response space filling, while prioritizing input combinations with lower associated response uncertainty. These lower uncertainty choices improve the chances of observing the desired response values. We describe the new approach with an uncertainty-adjusted distance to measure the response space filling, the Pareto aggregate point exchange algorithm to populate the set of promising designs, and illustrate the method with three examples of different input and response relationships and dimensions.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"16 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138517929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1