首页 > 最新文献

Statistical Analysis and Data Mining: The ASA Data Science Journal最新文献

英文 中文
Randomized algorithms for tensor response regression 张量响应回归的随机算法
Pub Date : 2022-11-21 DOI: 10.1002/sam.11603
Zhe Cheng, Xiangjian Xu, Zihao Song, Weihua Zhao
In this paper, we consider the estimation algorithm of tensor response on vector covariate regression model. Based on projection theory of tensor and the idea of randomized algorithm for tensor decomposition, three new algorithms named SHOLRR, RHOLRR and RSHOLRR are proposed under the low‐rank Tucker decomposition and some theoretical analyses for two randomized algorithms are also provided. To explore the nonlinear relationship between tensor response and vector covariate, we develop the KRSHOLRR algorithm based on kernel trick and RSHOLRR algorithm. Our proposed algorithms can not only guarantee high estimation accuracy but also have the advantage of fast computing speed, especially for higher‐order tensor response. Through extensive synthesized data analyses and applications to two real datasets, we demonstrate the outperformance of our proposed algorithms over the stat‐of‐art.
本文研究向量协变量回归模型上张量响应的估计算法。基于张量投影理论和张量分解的随机化算法思想,提出了低秩Tucker分解下的SHOLRR、RHOLRR和RSHOLRR三种新算法,并对两种随机化算法进行了理论分析。为了探索张量响应与矢量协变量之间的非线性关系,我们开发了基于核技巧和RSHOLRR算法的KRSHOLRR算法。我们提出的算法不仅保证了较高的估计精度,而且具有计算速度快的优点,特别是对于高阶张量响应。通过广泛的综合数据分析和对两个真实数据集的应用,我们证明了我们提出的算法在最新技术上的优异性能。
{"title":"Randomized algorithms for tensor response regression","authors":"Zhe Cheng, Xiangjian Xu, Zihao Song, Weihua Zhao","doi":"10.1002/sam.11603","DOIUrl":"https://doi.org/10.1002/sam.11603","url":null,"abstract":"In this paper, we consider the estimation algorithm of tensor response on vector covariate regression model. Based on projection theory of tensor and the idea of randomized algorithm for tensor decomposition, three new algorithms named SHOLRR, RHOLRR and RSHOLRR are proposed under the low‐rank Tucker decomposition and some theoretical analyses for two randomized algorithms are also provided. To explore the nonlinear relationship between tensor response and vector covariate, we develop the KRSHOLRR algorithm based on kernel trick and RSHOLRR algorithm. Our proposed algorithms can not only guarantee high estimation accuracy but also have the advantage of fast computing speed, especially for higher‐order tensor response. Through extensive synthesized data analyses and applications to two real datasets, we demonstrate the outperformance of our proposed algorithms over the stat‐of‐art.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local support vector machine based dimension reduction 基于局部支持向量机的降维
Pub Date : 2022-10-17 DOI: 10.1002/sam.11600
Linxi Li, Qin Wang, Chenlu Ke
Motivated by several recent work that adopt support vector machines into the sufficient dimension reduction research, we propose a local support vector machine based dimension reduction approach. The proposal deals with continuous and binary responses, linear and nonlinear dimension reduction in a unified framework. The localization can also help relax the stringent probabilistic assumptions required by the global methods. Numerical experiments and a real data application demonstrate the efficacy of the proposed approach.
受近年来一些将支持向量机引入到充分降维研究的启发,我们提出了一种基于局部支持向量机的降维方法。该方案在一个统一的框架中处理连续响应和二元响应,线性和非线性降维。局部化还可以帮助放松全局方法所要求的严格概率假设。数值实验和实际数据应用验证了该方法的有效性。
{"title":"Local support vector machine based dimension reduction","authors":"Linxi Li, Qin Wang, Chenlu Ke","doi":"10.1002/sam.11600","DOIUrl":"https://doi.org/10.1002/sam.11600","url":null,"abstract":"Motivated by several recent work that adopt support vector machines into the sufficient dimension reduction research, we propose a local support vector machine based dimension reduction approach. The proposal deals with continuous and binary responses, linear and nonlinear dimension reduction in a unified framework. The localization can also help relax the stringent probabilistic assumptions required by the global methods. Numerical experiments and a real data application demonstrate the efficacy of the proposed approach.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128990855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frequentist model averaging for zero‐inflated Poisson regression models 零膨胀泊松回归模型的频率模型平均
Pub Date : 2022-10-05 DOI: 10.1002/sam.11598
Jianhong Zhou, Alan T. K. Wan, Dalei Yu
This paper considers frequentist model averaging for estimating the unknown parameters of the zero‐inflated Poisson regression model. Our proposed weight choice procedure is based on the minimization of an unbiased estimator of a conditional quadratic loss function. We prove that the resulting model average estimator enjoys optimal asymptotic property and improves finite sample properties over the two commonly used information‐based model selection estimators and their model average estimators via simulation studies. The proposed method is illustrated by a real data example.
本文研究了零膨胀泊松回归模型未知参数估计的频率模型平均方法。我们提出的权重选择过程是基于一个条件二次损失函数的无偏估计的最小化。通过仿真研究,证明了所得到的模型平均估计量具有最优的渐近性,并改善了两种常用的基于信息的模型选择估计量及其模型平均估计量的有限样本性质。通过实际数据算例说明了该方法的有效性。
{"title":"Frequentist model averaging for zero‐inflated Poisson regression models","authors":"Jianhong Zhou, Alan T. K. Wan, Dalei Yu","doi":"10.1002/sam.11598","DOIUrl":"https://doi.org/10.1002/sam.11598","url":null,"abstract":"This paper considers frequentist model averaging for estimating the unknown parameters of the zero‐inflated Poisson regression model. Our proposed weight choice procedure is based on the minimization of an unbiased estimator of a conditional quadratic loss function. We prove that the resulting model average estimator enjoys optimal asymptotic property and improves finite sample properties over the two commonly used information‐based model selection estimators and their model average estimators via simulation studies. The proposed method is illustrated by a real data example.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129204527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature screening of ultrahigh dimensional longitudinal data based on the C‐statistic 基于C统计量的超高维纵向数据特征筛选
Pub Date : 2022-09-26 DOI: 10.1002/sam.11597
Peng Lai, Qing Di, Zhezi Shen, Yanqiu Zhou
This paper considers the feature screening method for the ultrahigh dimensional semiparametric linear models with longitudinal data. The C‐statistic which measures the rank concordance between predictors and outcomes is generalized to the longitudinal data. On the basis of C‐statistic and the score equation theory, we propose a feature screening method named LCSIS. Based on the smoothed technique and the score equations, the proposed estimating screening procedure is easy to compute and satisfies the feature screening consistency. Furthermore, Monte Carlo simulation studies and a real data application are conducted to examine the finite sample performance of the proposed procedure.
研究了具有纵向数据的超高维半参数线性模型的特征筛选方法。衡量预测因子和结果之间的等级一致性的C统计量被推广到纵向数据。在C统计量和分数方程理论的基础上,提出了一种特征筛选方法LCSIS。基于平滑技术和分数方程的估计筛选过程易于计算,且满足特征筛选的一致性。此外,还进行了蒙特卡罗模拟研究和实际数据应用,以检验所提出的程序的有限样本性能。
{"title":"Feature screening of ultrahigh dimensional longitudinal data based on the C‐statistic","authors":"Peng Lai, Qing Di, Zhezi Shen, Yanqiu Zhou","doi":"10.1002/sam.11597","DOIUrl":"https://doi.org/10.1002/sam.11597","url":null,"abstract":"This paper considers the feature screening method for the ultrahigh dimensional semiparametric linear models with longitudinal data. The C‐statistic which measures the rank concordance between predictors and outcomes is generalized to the longitudinal data. On the basis of C‐statistic and the score equation theory, we propose a feature screening method named LCSIS. Based on the smoothed technique and the score equations, the proposed estimating screening procedure is easy to compute and satisfies the feature screening consistency. Furthermore, Monte Carlo simulation studies and a real data application are conducted to examine the finite sample performance of the proposed procedure.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129688262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric clustering of RNA‐sequencing data RNA测序数据的非参数聚类
Pub Date : 2022-09-23 DOI: 10.1002/sam.11638
Gabriel L. Lozano, Nadia M. Atallah, M. Levine
Identification of clusters of co-expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance-based or model-based approaches. Distance-based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model-based approaches are based on using the mixture-modeling framework. Compared to distance-based approaches, model-based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (Maximum Smoothed Likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and compares favorably to the two other mixture-based algorithms commonly used for RNA-seq data clustering. Our code is publicly available in Github at https://github.com/Matematikoi/non_parametric_clustering.
在转录组学数据中鉴定共表达基因簇是一项艰巨的任务。用于此目的的大多数算法可分为两大类:基于距离的方法或基于模型的方法。基于距离的方法通常利用数据对象对之间的距离函数,并将相似的对象分组到集群中。基于模型的方法基于使用混合建模框架。与基于距离的方法相比,基于模型的方法提供了更好的可解释性,因为每个集群都可以根据所提议的模型显式地表征。然而,这些模型在确定混合物可以基于的正确的多变量分布方面存在特别的困难。在本文中,我们首先回顾了用于选择所需混合模型的分布的一些方法。然后,我们提出使用非参数最大平滑似然(MSL)算法来避免这个问题。该算法早在统计文献中提出,但据我们所知,尚未应用于转录组学数据。这种方法的显著特点是,它避免了个体生物样本分布的明确规范,从而使从业者的任务更容易。当在真实数据集上使用时,该算法产生大量具有生物学意义的聚类,并且与通常用于RNA-seq数据聚类的其他两种基于混合的算法相比具有优势。我们的代码在Github上是公开的,网址是https://github.com/Matematikoi/non_parametric_clustering。
{"title":"Nonparametric clustering of RNA‐sequencing data","authors":"Gabriel L. Lozano, Nadia M. Atallah, M. Levine","doi":"10.1002/sam.11638","DOIUrl":"https://doi.org/10.1002/sam.11638","url":null,"abstract":"Identification of clusters of co-expressed genes in transcriptomic data is a difficult task. Most algorithms used for this purpose can be classified into two broad categories: distance-based or model-based approaches. Distance-based approaches typically utilize a distance function between pairs of data objects and group similar objects together into clusters. Model-based approaches are based on using the mixture-modeling framework. Compared to distance-based approaches, model-based approaches offer better interpretability because each cluster can be explicitly characterized in terms of the proposed model. However, these models present a particular difficulty in identifying a correct multivariate distribution that a mixture can be based upon. In this manuscript, we review some of the approaches used to select a distribution for the needed mixture model first. Then, we propose avoiding this problem altogether by using a nonparametric MSL (Maximum Smoothed Likelihood) algorithm. This algorithm was proposed earlier in statistical literature but has not been, to the best of our knowledge, applied to transcriptomics data. The salient feature of this approach is that it avoids explicit specification of distributions of individual biological samples altogether, thus making the task of a practitioner easier. When used on a real dataset, the algorithm produces a large number of biologically meaningful clusters and compares favorably to the two other mixture-based algorithms commonly used for RNA-seq data clustering. Our code is publicly available in Github at https://github.com/Matematikoi/non_parametric_clustering.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning and neural network based model predictions of soybean export shares from US Gulf to China 基于机器学习和神经网络的模型预测美国海湾地区对中国大豆出口份额
Pub Date : 2022-09-07 DOI: 10.1002/sam.11595
Shantanu Awasthi, I. Sengupta, W. Wilson, Prithviraj Lakkakula
In this paper, we propose a general model for the soybean export market share dynamics and provide several theoretical analyses related to a special case of the general model. We implement machine and neural network algorithms to train, analyze, and predict US Gulf soybean market shares (target variable) to China using weekly time series data consisting of several features between January 6, 2012 and January 3, 2020. Overall, the results indicate that US Gulf soybean market shares to China are volatile and can be effectively explained (predicted) using a set of logical input variables. Some of the variables, including shipments due at US Gulf port in 10 days, cost of transporting soybean shipments via barge at Mid‐Mississippi, and soybean exports loaded at US Gulf port in the past 7 days, and binary variables have shown significant influence in predicting soybean market shares.
本文提出了大豆出口市场份额动态的一般模型,并对该模型的一个特例进行了理论分析。我们使用2012年1月6日至2020年1月3日期间由多个特征组成的每周时间序列数据,实现机器和神经网络算法来训练、分析和预测美国海湾大豆对中国的市场份额(目标变量)。总体而言,结果表明,美国海湾大豆对中国的市场份额是不稳定的,可以使用一组逻辑输入变量有效地解释(预测)。一些变量,包括10天内美国海湾港口到期的货物,密西西比中部驳船运输大豆货物的成本,以及过去7天内美国海湾港口装载的大豆出口,以及二元变量在预测大豆市场份额方面显示出显著的影响。
{"title":"Machine learning and neural network based model predictions of soybean export shares from US Gulf to China","authors":"Shantanu Awasthi, I. Sengupta, W. Wilson, Prithviraj Lakkakula","doi":"10.1002/sam.11595","DOIUrl":"https://doi.org/10.1002/sam.11595","url":null,"abstract":"In this paper, we propose a general model for the soybean export market share dynamics and provide several theoretical analyses related to a special case of the general model. We implement machine and neural network algorithms to train, analyze, and predict US Gulf soybean market shares (target variable) to China using weekly time series data consisting of several features between January 6, 2012 and January 3, 2020. Overall, the results indicate that US Gulf soybean market shares to China are volatile and can be effectively explained (predicted) using a set of logical input variables. Some of the variables, including shipments due at US Gulf port in 10 days, cost of transporting soybean shipments via barge at Mid‐Mississippi, and soybean exports loaded at US Gulf port in the past 7 days, and binary variables have shown significant influence in predicting soybean market shares.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122208973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang 一个辅助词性标注器,用于博客和微博网络俚语
Pub Date : 2022-09-06 DOI: 10.1002/sam.11596
Silvia Golia, Paola Zola
The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.
Web 2.0的影响越来越大,俚语、缩写和强调词的使用越来越多,这限制了传统自然语言处理模型的性能。最先进的词性标注器通常无法为Web 2.0文本中的所有单词分配有意义的词性标注。为了解决这个限制,我们提出了一个辅助POS标记器,它根据从前面和后面的POS标记序列派生的信息将POS标记分配给给定的令牌。所提出的辅助POS标记器的主要优点是它能够克服对令牌信息的需求,因为它只依赖于现有POS标记的序列。这个标注器被称为辅助标注器,因为它需要一个初始的词性标注过程,这个过程可以使用在线字典(例如,Wikidictionary)或其他词性标注算法来执行。辅助POS标记器依赖于使用前后POS标记信息的贝叶斯网络。在Brown语料库(一个通用语言学语料库)、由Twitter消息组成的现代ARK数据集以及人工标记的Web 2.0数据语料库上对它进行了评估。
{"title":"An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang","authors":"Silvia Golia, Paola Zola","doi":"10.1002/sam.11596","DOIUrl":"https://doi.org/10.1002/sam.11596","url":null,"abstract":"The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114603767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Out‐of‐bag stability estimation for k‐means clustering k均值聚类的袋外稳定性估计
Pub Date : 2022-08-03 DOI: 10.1002/sam.11593
Tianmou Liu, Han Yu, R. Blair
Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.
在没有金标准的无监督学习中,聚类数据是一个具有挑战性的问题。结果取决于几个因素,如聚类方法的选择、不相似性的度量、参数和可靠分组数量的确定。稳定性已经成为性能和鲁棒性的一个有价值的替代品,可以为研究人员提供关于聚类质量的见解,并指导后续的聚类优先级。这项工作开发了一个基于重采样和OB估计的稳定性测量框架。集群稳定性的自举方法在类似于监督学习中测试集和训练集描述不佳的设置中容易出现过拟合。依赖于重新采样的OB项的稳定性克服了这些问题,并且不依赖于参考聚类进行比较。此外,OB稳定性可以在项目、聚类和总体总结的水平上提供估计,这具有很好的解释价值。通过对聚类数据的稳定性估计与无信号聚类参考数据的稳定性估计之间的对比,扩展该框架以开发用于确定聚类数量(模型选择)的稳定性估计。这些对比形成的稳定性概况可用于识别稳定性的最大差异,并且不需要稳定性值的直接阈值,这往往是特定于数据的。这些方法可以使用综合R存档网络上提供的R包引导集群来实现。
{"title":"Out‐of‐bag stability estimation for k‐means clustering","authors":"Tianmou Liu, Han Yu, R. Blair","doi":"10.1002/sam.11593","DOIUrl":"https://doi.org/10.1002/sam.11593","url":null,"abstract":"Clustering data is a challenging problem in unsupervised learning where there is no gold standard. Results depend on several factors, such as the selection of a clustering method, measures of dissimilarity, parameters, and the determination of the number of reliable groupings. Stability has become a valuable surrogate to performance and robustness that can provide insight to an investigator on the quality of a clustering, and guidance on subsequent cluster prioritization. This work develops a framework for stability measurements that is based on resampling and OB estimation. Bootstrapping methods for cluster stability can be prone to overfitting in a setting that is analogous to poor delineation of test and training sets in supervised learning. Stability that relies on OB items from a resampling overcomes these issues and does not depend on a reference clustering for comparisons. Furthermore, OB stability can provide estimates at the level of the item, cluster, and as an overall summary, which has good interpretive value. This framework is extended to develop stability estimates for determining the number of clusters (model selection) through contrasts between stability estimates on clustered data, and stability estimates of clustered reference data with no signal. These contrasts form stability profiles that can be used to identify the largest differences in stability and do not require a direct threshold on stability values, which tend to be data specific. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128573120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed dimension reduction with nearly oracle rate 以接近oracle的速度进行分布式降维
Pub Date : 2022-08-03 DOI: 10.1002/sam.11592
Zhengtian Zhu, Liping Zhu
We consider sufficient dimension reduction for heterogeneous massive data. We show that, even in the presence of heterogeneity and nonlinear dependence, the minimizers of convex loss functions of linear regression fall into the central subspace at the population level. We suggest a distributed algorithm to perform sufficient dimension reduction, where the convex loss functions are approximated with surrogate quadratic losses. This allows to perform dimension reduction in a unified least squares framework and facilitates to transmit the gradients in our distributed algorithm. The minimizers of these surrogate quadratic losses possess a nearly oracle rate after a finite number of iterations. We conduct simulations and an application to demonstrate the effectiveness of our proposed distributed algorithm for heterogeneous massive data.
我们考虑了异构海量数据的充分降维。我们证明,即使在异质性和非线性依赖存在的情况下,线性回归的凸损失函数的极小值落在总体水平的中心子空间。我们建议一种分布式算法来执行足够的降维,其中凸损失函数用代理二次损失近似。这允许在统一的最小二乘框架中执行降维,并且便于在我们的分布式算法中传输梯度。在有限次迭代之后,这些代理二次损失的最小值具有接近oracle的速率。我们进行了仿真和应用,以证明我们提出的分布式算法对异构海量数据的有效性。
{"title":"Distributed dimension reduction with nearly oracle rate","authors":"Zhengtian Zhu, Liping Zhu","doi":"10.1002/sam.11592","DOIUrl":"https://doi.org/10.1002/sam.11592","url":null,"abstract":"We consider sufficient dimension reduction for heterogeneous massive data. We show that, even in the presence of heterogeneity and nonlinear dependence, the minimizers of convex loss functions of linear regression fall into the central subspace at the population level. We suggest a distributed algorithm to perform sufficient dimension reduction, where the convex loss functions are approximated with surrogate quadratic losses. This allows to perform dimension reduction in a unified least squares framework and facilitates to transmit the gradients in our distributed algorithm. The minimizers of these surrogate quadratic losses possess a nearly oracle rate after a finite number of iterations. We conduct simulations and an application to demonstrate the effectiveness of our proposed distributed algorithm for heterogeneous massive data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134186971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A novel Bayesian method for variable selection and estimation in binary quantile regression 二元分位数回归中变量选择与估计的贝叶斯新方法
Pub Date : 2022-07-23 DOI: 10.1002/sam.11591
Mai Dao, Min Wang, Souparno Ghosh
In this paper, we develop a Bayesian hierarchical model and associated computation strategy for simultaneously conducting parameter estimation and variable selection in binary quantile regression. We specify customary asymmetric Laplace distribution on the error term and assign quantile‐dependent priors on the regression coefficients and a binary vector to identify the model configuration. Thanks to the normal‐exponential mixture representation of the asymmetric Laplace distribution, we proceed to develop a novel three‐stage computational scheme starting with an expectation–maximization algorithm and then the Gibbs sampler followed by an importance re‐weighting step to draw nearly independent Markov chain Monte Carlo samples from the full posterior distributions of the unknown parameters. Simulation studies are conducted to compare the performance of the proposed Bayesian method with that of several existing ones in the literature. Finally, two real‐data applications are provided for illustrative purposes.
本文提出了一种贝叶斯层次模型和相应的计算策略,用于同时进行二元分位数回归的参数估计和变量选择。我们在误差项上指定习惯的非对称拉普拉斯分布,并在回归系数和二元向量上分配分位数相关的先验,以识别模型配置。由于非对称拉普拉斯分布的正态-指数混合表示,我们继续开发一种新的三阶段计算方案,从期望最大化算法开始,然后是Gibbs采样器,然后是一个重要的重加权步骤,从未知参数的完全后验分布中提取几乎独立的马尔可夫链蒙特卡罗样本。通过仿真研究,将所提出的贝叶斯方法与文献中已有的几种贝叶斯方法的性能进行了比较。最后,为了说明目的,提供了两个实际数据应用。
{"title":"A novel Bayesian method for variable selection and estimation in binary quantile regression","authors":"Mai Dao, Min Wang, Souparno Ghosh","doi":"10.1002/sam.11591","DOIUrl":"https://doi.org/10.1002/sam.11591","url":null,"abstract":"In this paper, we develop a Bayesian hierarchical model and associated computation strategy for simultaneously conducting parameter estimation and variable selection in binary quantile regression. We specify customary asymmetric Laplace distribution on the error term and assign quantile‐dependent priors on the regression coefficients and a binary vector to identify the model configuration. Thanks to the normal‐exponential mixture representation of the asymmetric Laplace distribution, we proceed to develop a novel three‐stage computational scheme starting with an expectation–maximization algorithm and then the Gibbs sampler followed by an importance re‐weighting step to draw nearly independent Markov chain Monte Carlo samples from the full posterior distributions of the unknown parameters. Simulation studies are conducted to compare the performance of the proposed Bayesian method with that of several existing ones in the literature. Finally, two real‐data applications are provided for illustrative purposes.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining: The ASA Data Science Journal
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1