首页 > 最新文献

Statistical Analysis and Data Mining最新文献

英文 中文
Reduced Rank Ridge Regression and Its Kernel Extensions. 简化秩岭回归及其核扩展。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2011-12-01 Epub Date: 2011-10-07 DOI: 10.1002/sam.10138
Ashin Mukherjee, Ji Zhu

In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.

在多元线性回归中,通常假设响应矩阵本质上是低秩的。这可能是由于预测变量之间的相关结构或系数矩阵的秩较低所致。为了适应这两者,我们提出了多元线性回归的减少秩岭回归。具体来说,我们将脊惩罚与系数矩阵上的简化秩约束结合起来,提出了一个计算简单的算法。数值研究表明,该方法始终优于同类方法。本文还提出了一种将该方法扩展到再现核希尔伯特空间(RKHS)的方法。
{"title":"Reduced Rank Ridge Regression and Its Kernel Extensions.","authors":"Ashin Mukherjee,&nbsp;Ji Zhu","doi":"10.1002/sam.10138","DOIUrl":"https://doi.org/10.1002/sam.10138","url":null,"abstract":"<p><p>In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"612-622"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10138","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30919516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Clustering Based on Periodicity in High-Throughput Time Course Data. 基于周期性的高通量时间课程数据聚类。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2011-12-01 DOI: 10.1002/sam.10137
Anna J Blackstock, Amita K Manatunga, Youngja Park, Dean P Jones, Tianwei Yu

Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.

传统上用于分析化学的核磁共振(NMR)光谱学最近被引入到生物液体和组织的代谢物组成的研究中。代谢物水平随着时间的推移而变化,提供一种工具来更好地提取显示周期性行为的NMR峰是有意义的。我们提出了一种基于周期行为的核磁共振峰聚类方法。周期回归是用来获得参数的估计对应于周期的各个核磁共振峰。然后,考虑到回归参数估计的可变性,使用混合模型来开发峰簇。方法应用于24小时内从人血浆中收集的核磁共振数据。仿真研究表明,在聚类过程中应考虑到由于参数估计估计而产生的额外方差分量。
{"title":"Clustering Based on Periodicity in High-Throughput Time Course Data.","authors":"Anna J Blackstock,&nbsp;Amita K Manatunga,&nbsp;Youngja Park,&nbsp;Dean P Jones,&nbsp;Tianwei Yu","doi":"10.1002/sam.10137","DOIUrl":"https://doi.org/10.1002/sam.10137","url":null,"abstract":"<p><p>Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"579-589"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31503030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data. 用于纵向高维数据的新型支持向量分类器及其在神经影像数据中的应用
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2011-12-01 DOI: 10.1002/sam.10141
Shuo Chen, F DuBois Bowman

Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.

最近的技术进步使许多研究能够纵向收集高维数据(HDD),例如在不同扫描过程中收集的图像。这类研究可能会产生所选特征的时间变化,当这些特征与机器学习方法相结合时,就能预测疾病状态或对治疗的反应。支持向量机(SVM)技术是非常适合 HDD 分类和预测的强大而有效的工具。然而,目前用于 HDD 分析的 SVM 方法通常考虑的是在一个时间段或疗程(如基线)内收集的横截面数据。我们提出了一种用于纵向 HDD 的新型支持向量分类器 (SVC),该分类器可同时估算 SVM 分离超平面参数和时间趋势参数,从而确定结合纵向数据进行分类和预测的最佳方法。我们的方法基于增强再现核函数,并使用二次编程进行优化。我们通过模拟研究和阿尔茨海默病神经成像计划的数据实例,展示了我们提出的方法的用途和潜在优势。结果表明,与仅使用横截面数据的方法和通过天真地扩展特征空间来结合纵向数据的方法相比,我们提出的方法利用了额外的纵向信息,实现了更高的准确性。
{"title":"A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data.","authors":"Shuo Chen, F DuBois Bowman","doi":"10.1002/sam.10141","DOIUrl":"10.1002/sam.10141","url":null,"abstract":"<p><p>Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"604-611"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189187/pdf/nihms-629358.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Space-efficient tracking of persistent items in a massive data stream 大规模数据流中持久项的空间高效跟踪
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2011-07-11 DOI: 10.1145/2002259.2002294
Bibudh Lahiri, S. Tirthapura, J. Chandrashekar
Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur "regularly" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors. We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.
受网络异常检测场景的启发,我们考虑了检测数据流中持久项的问题,这些持久项是流中“定期”出现的项。与重量级条目相比,持久条目不一定对流的容量有很大贡献,并且可能无法被传统的基于容量的异常检测器检测到。我们首先表明,任何精确跟踪持久项的在线算法都必须使用大型工作空间,并且在流量监视节点上运行是不可行的。鉴于这个下界,我们引入了问题的近似公式,并提出了一个小空间算法来近似跟踪大数据流上的持久项。我们在真实交通数据集上的实验表明,在典型情况下,该算法实现了5 -7x的物理空间压缩,同时产生很少的假阳性(< 1%)和假阴性(< 4%)。据我们所知,这是对数据流中持久项检测问题的第一个系统研究,我们的工作可以帮助检测暂时的异常,而不是基于量的异常。
{"title":"Space-efficient tracking of persistent items in a massive data stream","authors":"Bibudh Lahiri, S. Tirthapura, J. Chandrashekar","doi":"10.1145/2002259.2002294","DOIUrl":"https://doi.org/10.1145/2002259.2002294","url":null,"abstract":"Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur \"regularly\" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors.\u0000 We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"45 1 1","pages":"70-92"},"PeriodicalIF":1.3,"publicationDate":"2011-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77904624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification. 嵌入熵的序列支持向量回归用于SNP选择和疾病分类。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2011-06-01 DOI: 10.1002/sam.10110
Yulan Liang, Arpad Kelemen

Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.

在全基因组尺度上通过SNP结构与常见疾病的关联来综合评价常见遗传变异是目前人类基因组研究的热点。为了更低成本和更快的诊断,需要先进的计算方法来选择具有最高预测精度的最小snp,用于常见的复杂疾病。在本文中,我们提出了一种嵌入熵算法的序列支持向量回归模型来处理选择具有最佳疾病预测性能的snp的冗余问题。我们将提出的方法用于SNP选择和疾病分类,并将其应用于模拟数据集和两个真实疾病数据集。结果表明,平均而言,我们提出的方法优于众所周知的支持向量机递归特征消除、逻辑回归、CART和基于逻辑回归的SNP选择方法进行疾病分类。
{"title":"Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification.","authors":"Yulan Liang,&nbsp;Arpad Kelemen","doi":"10.1002/sam.10110","DOIUrl":"https://doi.org/10.1002/sam.10110","url":null,"abstract":"<p><p>Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 3","pages":"301-312"},"PeriodicalIF":1.3,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10110","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29930336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Machine-Learning Approach to Detecting Unknown Bacterial Serovars. 检测未知细菌血清型的机器学习方法。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2010-10-01 DOI: 10.1002/sam.10085
Ferit Akova, Murat Dundar, V Jo Davisson, E Daniel Hirleman, Arun K Bhunia, J Paul Robinson, Bartek Rajwa
Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light‐scattering sensor recently developed for real‐time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unknown bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28‐class bacteria dataset and also on the benchmark 26‐class letter recognition dataset for further validation. The proposed approach is compared against state‐of‐the‐art involving density‐based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 289‐301, 2010
快速检测细菌病原体的技术对于确保粮食供应至关重要。最近开发的一种用于实时识别多个菌落的光散射传感器在区分细菌培养方面显示出很大的希望。该系统目前使用的分类方法依赖于监督学习。为了准确地分类细菌病原体,训练库应该是详尽的,也就是说,应该包括所有可能的病原体的样本。然而,现有细菌血清型的绝对数量,更重要的是它们的高突变率的影响,不允许进行实际和可管理的训练。在这项研究中,我们提出了一种贝叶斯方法来学习非穷举训练数据集,用于自动检测不匹配的细菌血清型,即训练库中没有样本的血清型。我们工作的主要贡献是定义在类分布上的Wishart共轭先验。这允许我们使用从已知类中获得的先验信息来对未知类进行推断。通过这种方法,我们可以识别新的具有信息价值的类别,并使用这些类别动态更新训练数据集,使其越来越具有样本总体的代表性。这导致分类器对未来样本的预测性能有所提高。我们在28类细菌数据集和基准26类字母识别数据集上评估了我们的方法,以进一步验证。将提出的方法与基于密度的方法和支持向量域描述的最新方法以及最近引入的基于模拟类的贝叶斯方法进行了比较。
{"title":"A Machine-Learning Approach to Detecting Unknown Bacterial Serovars.","authors":"Ferit Akova,&nbsp;Murat Dundar,&nbsp;V Jo Davisson,&nbsp;E Daniel Hirleman,&nbsp;Arun K Bhunia,&nbsp;J Paul Robinson,&nbsp;Bartek Rajwa","doi":"10.1002/sam.10085","DOIUrl":"https://doi.org/10.1002/sam.10085","url":null,"abstract":"Technologies for rapid detection of bacterial pathogens are crucial for securing the food supply. A light‐scattering sensor recently developed for real‐time identification of multiple colonies has shown great promise for distinguishing bacteria cultures. The classification approach currently used with this system relies on supervised learning. For accurate classification of bacterial pathogens, the training library should be exhaustive, i.e., should consist of samples of all possible pathogens. Yet, the sheer number of existing bacterial serovars and more importantly the effect of their high mutation rate would not allow for a practical and manageable training. In this study, we propose a Bayesian approach to learning with a nonexhaustive training dataset for automated detection of unknown bacterial serovars, i.e., serovars for which no samples exist in the training library. The main contribution of our work is the Wishart conjugate priors defined over class distributions. This allows us to employ the prior information obtained from known classes to make inferences about unknown classes as well. By this means, we identify new classes of informational value and dynamically update the training dataset with these classes to make it increasingly more representative of the sample population. This results in a classifier with improved predictive performance for future samples. We evaluated our approach on a 28‐class bacteria dataset and also on the benchmark 26‐class letter recognition dataset for further validation. The proposed approach is compared against state‐of‐the‐art involving density‐based approaches and support vector domain description, as well as a recently introduced Bayesian approach based on simulated classes. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 289‐301, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"289-301"},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30319662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Model selection procedure for high-dimensional data. 高维数据的模型选择程序。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2010-10-01 DOI: 10.1002/sam.10088
Yongli Zhang, Xiaotong Shen

For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.

对于高维回归,预测因子的数量可能大大超过样本量,但其中只有一小部分与响应相关。因此,变量选择是不可避免的,其中一致的模型选择是主要关注的问题。然而,传统的一致性模型选择准则(如BIC)对模型空间的不适应性和穷举搜索的不可行性,可能存在不足。为了解决这两个问题,我们建立了一个根据信息准则选择最小真模型的概率下界,在此基础上我们提出了一个模型选择准则,我们称之为RIC(c),它适应于模型空间。此外,我们还开发了一种计算可行的方法,将最小角回归(LAR)的计算能力与RIC(c)相结合。理论和仿真研究均表明,该方法在最小真模型被LAR选择的情况下,以概率收敛于1的方式识别出最小真模型。将该方法应用于电力市场的实际数据,在价格预测精度上优于后向变量选择方法。
{"title":"Model selection procedure for high-dimensional data.","authors":"Yongli Zhang,&nbsp;Xiaotong Shen","doi":"10.1002/sam.10088","DOIUrl":"https://doi.org/10.1002/sam.10088","url":null,"abstract":"<p><p>For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"350-358"},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10088","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29500256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Discriminative frequent subgraph mining with optimality guarantees 具有最优性保证的判别频繁子图挖掘
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2010-10-01 DOI: 10.1002/SAM.V3:5
Marisa Thoma, Hong Cheng, A. Gretton, Jiawei Han, H. Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan, Karsten M. Borgwardt
The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010
频繁子图挖掘的目标是检测频繁出现在图数据集中的子图。在分类设置中,人们通常对发现判别频繁子图感兴趣,它们的存在与否表明了图的类隶属度。在本文中,我们提出了一种在频繁子图上进行特征选择的方法,称为CORK,它结合了两个主要优点。首先,它优化了一个次模质量准则,这意味着我们可以使用贪婪特征选择产生一个接近最优的解决方案。其次,我们的子模块质量函数准则可以集成到gSpan中,gSpan是最先进的频繁子图挖掘工具,即使在频繁子图挖掘过程中,也有助于减少判别性频繁子图的搜索空间。版权所有©2010 Wiley期刊公司统计分析与数据挖掘,2010
{"title":"Discriminative frequent subgraph mining with optimality guarantees","authors":"Marisa Thoma, Hong Cheng, A. Gretton, Jiawei Han, H. Kriegel, Alex Smola, Le Song, Philip S. Yu, Xifeng Yan, Karsten M. Borgwardt","doi":"10.1002/SAM.V3:5","DOIUrl":"https://doi.org/10.1002/SAM.V3:5","url":null,"abstract":"The goal of frequent subgraph mining is to detect subgraphs that frequently occur in a dataset of graphs. In classification settings, one is often interested in discovering discriminative frequent subgraphs, whose presence or absence is indicative of the class membership of a graph. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 302-318, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 1","pages":"302-318"},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51496964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database 基于大规模回归的模式发现:以筛选WHO全球药物安全数据库为例
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2010-08-01 DOI: 10.1002/SAM.V3:4
O. Caster, G. N. Norén, D. Madigan, A. Bate
Most measures of interestingness for patterns of co-occurring events are based on data projections onto contingency tables for the events of primary interest. As an alternative, this article presents the first implementation of shrinkage logistic regression for large-scale pattern discovery, with an evaluation of its usefulness in real-world binary transaction data. Regression accounts for the impact of other covariates that may confound or otherwise distort associations. The application considered is international adverse drug reaction (ADR) surveillance, in which large collections of reports on suspected ADRs are screened for interesting reporting patterns worthy of clinical follow-up. Our results show that regression-based pattern discovery does offer practical advantages. Specifically it can eliminate false positives and false negatives due to other covariates. Furthermore, it identifies some established drug safety issues earlier than a measure based on contingency tables. While regression offers clear conceptual advantages, our results suggest that methods based on contingency tables will continue to play a key role in ADR surveillance, for two reasons: the failure of regression to identify some established drug safety concerns as early as the currently used measures, and the relative lack of transparency of the procedure to estimate the regression coefficients. This suggests shrinkage regression should be used in parallel to existing measures of interestingness in ADR surveillance and other large-scale pattern discovery applications. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 197-208, 2010
对共同发生事件模式的兴趣度的大多数度量是基于对主要感兴趣事件的列联表的数据投影。作为替代方案,本文介绍了用于大规模模式发现的收缩逻辑回归的第一个实现,并评估了其在现实世界二进制事务数据中的实用性。回归解释了其他协变量的影响,这些协变量可能混淆或扭曲关联。考虑的应用是国际药物不良反应(ADR)监测,其中对大量可疑ADR报告进行筛选,以寻找值得临床随访的有趣报告模式。我们的研究结果表明,基于回归的模式发现确实具有实际优势。具体来说,它可以消除由于其他协变量引起的假阳性和假阴性。此外,它比基于列联表的措施更早地确定了一些既定的药物安全问题。虽然回归具有明显的概念优势,但我们的研究结果表明,基于列联表的方法将继续在ADR监测中发挥关键作用,原因有两个:回归无法在目前使用的措施中尽早识别出一些已建立的药物安全问题,以及估计回归系数的程序相对缺乏透明度。这表明收缩回归应该与现有的ADR监测和其他大规模模式发现应用中的兴趣度度量并行使用。版权所有©2010 Wiley期刊公司统计分析与数据挖掘(3):197-208,2010
{"title":"Large-scale regression-based pattern discovery: The example of screening the WHO global drug safety database","authors":"O. Caster, G. N. Norén, D. Madigan, A. Bate","doi":"10.1002/SAM.V3:4","DOIUrl":"https://doi.org/10.1002/SAM.V3:4","url":null,"abstract":"Most measures of interestingness for patterns of co-occurring events are based on data projections onto contingency tables for the events of primary interest. As an alternative, this article presents the first implementation of shrinkage logistic regression for large-scale pattern discovery, with an evaluation of its usefulness in real-world binary transaction data. Regression accounts for the impact of other covariates that may confound or otherwise distort associations. The application considered is international adverse drug reaction (ADR) surveillance, in which large collections of reports on suspected ADRs are screened for interesting reporting patterns worthy of clinical follow-up. Our results show that regression-based pattern discovery does offer practical advantages. Specifically it can eliminate false positives and false negatives due to other covariates. Furthermore, it identifies some established drug safety issues earlier than a measure based on contingency tables. While regression offers clear conceptual advantages, our results suggest that methods based on contingency tables will continue to play a key role in ADR surveillance, for two reasons: the failure of regression to identify some established drug safety concerns as early as the currently used measures, and the relative lack of transparency of the procedure to estimate the regression coefficients. This suggests shrinkage regression should be used in parallel to existing measures of interestingness in ADR surveillance and other large-scale pattern discovery applications. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 197-208, 2010","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"22 1","pages":"197-208"},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"51496096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Multicategory Composite Least Squares Classifiers. 多类别复合最小二乘分类器。
IF 1.3 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2010-08-01 DOI: 10.1002/sam.10081
Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl

Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.

分类是一种非常有用的信息提取统计工具。特别是,多类别分类在各种应用程序中很常见。尽管人们对二元分类问题进行了大量的研究,但对多类别情况的扩展却很少。鉴于现代统计问题的复杂性和数量的增加,希望有多类别分类器,能够处理具有高维和大量类的问题。此外,多类别分类器还需要有良好的理论性质。在文献中,存在几种不同版本的同步多类别支持向量机(svm)。然而,支持向量机的计算对于大规模问题来说是困难的,特别是对于具有大量类的问题。此外,支持向量机不能直接产生类概率估计。在本文中,我们提出了一种新的高效的多类别复合最小二乘分类器(CLS分类器),该分类器利用了一种新的复合平方损失函数。所提出的CLS分类器具有以下几个重要优点:对大量类问题的高效计算、渐近一致性、处理高维数据的能力以及简单的条件类概率估计。我们的模拟和实际实例证明了所提出的方法的竞争性能。
{"title":"Multicategory Composite Least Squares Classifiers.","authors":"Seo Young Park,&nbsp;Yufeng Liu,&nbsp;Dacheng Liu,&nbsp;Paul Scholl","doi":"10.1002/sam.10081","DOIUrl":"https://doi.org/10.1002/sam.10081","url":null,"abstract":"<p><p>Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 4","pages":"272-286"},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29584342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Statistical Analysis and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1