Pub Date : 2011-12-01Epub Date: 2011-10-07DOI: 10.1002/sam.10138
Ashin Mukherjee, Ji Zhu
In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.
{"title":"Reduced Rank Ridge Regression and Its Kernel Extensions.","authors":"Ashin Mukherjee, Ji Zhu","doi":"10.1002/sam.10138","DOIUrl":"https://doi.org/10.1002/sam.10138","url":null,"abstract":"<p><p>In multivariate linear regression, it is often assumed that the response matrix is intrinsically of lower rank. This could be because of the correlation structure among the prediction variables or the coefficient matrix being lower rank. To accommodate both, we propose a reduced rank ridge regression for multivariate linear regression. Specifically, we combine the ridge penalty with the reduced rank constraint on the coefficient matrix to come up with a computationally straightforward algorithm. Numerical studies indicate that the proposed method consistently outperforms relevant competitors. A novel extension of the proposed method to the reproducing kernel Hilbert space (RKHS) set-up is also developed.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"612-622"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10138","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30919516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna J Blackstock, Amita K Manatunga, Youngja Park, Dean P Jones, Tianwei Yu
Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.
{"title":"Clustering Based on Periodicity in High-Throughput Time Course Data.","authors":"Anna J Blackstock, Amita K Manatunga, Youngja Park, Dean P Jones, Tianwei Yu","doi":"10.1002/sam.10137","DOIUrl":"https://doi.org/10.1002/sam.10137","url":null,"abstract":"<p><p>Nuclear magnetic resonance (NMR) spectroscopy, traditionally used in analytical chemistry, has recently been introduced to studies of metabolite composition of biological fluids and tissues. Metabolite levels change over time, and providing a tool for better extraction of NMR peaks exhibiting periodic behavior is of interest. We propose a method in which NMR peaks are clustered based on periodic behavior. Periodic regression is used to obtain estimates of the parameter corresponding to period for individual NMR peaks. A mixture model is then used to develop clusters of peaks, taking into account the variability of the regression parameter estimates. Methods are applied to NMR data collected from human blood plasma over a 24-hour period. Simulation studies show that the extra variance component due to the estimation of the parameter estimate should be accounted for in the clustering procedure.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"579-589"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31503030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.
{"title":"A Novel Support Vector Classifier for Longitudinal High-dimensional Data and Its Application to Neuroimaging Data.","authors":"Shuo Chen, F DuBois Bowman","doi":"10.1002/sam.10141","DOIUrl":"10.1002/sam.10141","url":null,"abstract":"<p><p>Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 6","pages":"604-611"},"PeriodicalIF":1.3,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189187/pdf/nihms-629358.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32742225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur "regularly" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors. We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.
{"title":"Space-efficient tracking of persistent items in a massive data stream","authors":"Bibudh Lahiri, S. Tirthapura, J. Chandrashekar","doi":"10.1145/2002259.2002294","DOIUrl":"https://doi.org/10.1145/2002259.2002294","url":null,"abstract":"Motivated by scenarios in network anomaly detection, we consider the problem of detecting persistent items in a data stream, which are items that occur \"regularly\" in the stream. In contrast with heavy-hitters, persistent items do not necessarily contribute significantly to the volume of a stream, and may escape detection by traditional volume-based anomaly detectors.\u0000 We first show that any online algorithm that tracks persistent items exactly must necessarily use a large workspace, and is infeasible to run on a traffic monitoring node. In light of this lower bound, we introduce an approximate formulation of the problem and present a small-space algorithm to approximately track persistent items over a large data stream. Our experiments on a real traffic dataset shows that in typical cases, the algorithm achieves a physical space compression of 5x-7x, while incurring very few false positives (< 1%) and false negatives (< 4%). To our knowledge, this is the first systematic study of the problem of detecting persistent items in a data stream, and our work can help detect anomalies that are temporal, rather than volume based.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"45 1 1","pages":"70-92"},"PeriodicalIF":1.3,"publicationDate":"2011-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77904624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.
{"title":"Sequential Support Vector Regression with Embedded Entropy for SNP Selection and Disease Classification.","authors":"Yulan Liang, Arpad Kelemen","doi":"10.1002/sam.10110","DOIUrl":"https://doi.org/10.1002/sam.10110","url":null,"abstract":"<p><p>Comprehensive evaluation of common genetic variations through association of SNP structure with common diseases on the genome-wide scale is currently a hot area in human genome research. For less costly and faster diagnostics, advanced computational approaches are needed to select the minimum SNPs with the highest prediction accuracy for common complex diseases. In this paper, we present a sequential support vector regression model with embedded entropy algorithm to deal with the redundancy for the selection of the SNPs that have best prediction performance of diseases. We implemented our proposed method for both SNP selection and disease classification, and applied it to simulation data sets and two real disease data sets. Results show that on the average, our proposed method outperforms the well known methods of Support Vector Machine Recursive Feature Elimination, logistic regression, CART, and logic regression based SNP selections for disease classification.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"4 3","pages":"301-312"},"PeriodicalIF":1.3,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10110","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29930336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.
{"title":"Model selection procedure for high-dimensional data.","authors":"Yongli Zhang, Xiaotong Shen","doi":"10.1002/sam.10088","DOIUrl":"https://doi.org/10.1002/sam.10088","url":null,"abstract":"<p><p>For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"350-358"},"PeriodicalIF":1.3,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10088","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29500256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl
Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.
{"title":"Multicategory Composite Least Squares Classifiers.","authors":"Seo Young Park, Yufeng Liu, Dacheng Liu, Paul Scholl","doi":"10.1002/sam.10081","DOIUrl":"https://doi.org/10.1002/sam.10081","url":null,"abstract":"<p><p>Classification is a very useful statistical tool for information extraction. In particular, multicategory classification is commonly seen in various applications. Although binary classification problems are heavily studied, extensions to the multicategory case are much less so. In view of the increased complexity and volume of modern statistical problems, it is desirable to have multicategory classifiers that are able to handle problems with high dimensions and with a large number of classes. Moreover, it is necessary to have sound theoretical properties for the multicategory classifiers. In the literature, there exist several different versions of simultaneous multicategory Support Vector Machines (SVMs). However, the computation of the SVM can be difficult for large scale problems, especially for problems with large number of classes. Furthermore, the SVM cannot produce class probability estimation directly. In this article, we propose a novel efficient multicategory composite least squares classifier (CLS classifier), which utilizes a new composite squared loss function. The proposed CLS classifier has several important merits: efficient computation for problems with large number of classes, asymptotic consistency, ability to handle high dimensional data, and simple conditional class probability estimation. Our simulated and real examples demonstrate competitive performance of the proposed approach.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 4","pages":"272-286"},"PeriodicalIF":1.3,"publicationDate":"2010-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10081","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"29584342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}