Support vector machine (SVM) is one of the most prevalent classification techniques due to its excellent performance. The standard binary SVM has been well‐studied. However, a large number of multicategory classification problems in the real world are equally worth attention. In this paper, focusing on the computationally efficient multicategory angle‐based SVM model, we first study the statistical properties of model coefficient estimation. Notice that the new challenges posed by the widespread presence of distributed data, this paper further develops a distributed smoothed estimation for the multicategory SVM and establishes its theoretical guarantees. Through the derived asymptotic properties, it can be seen that our distributed smoothed estimation can achieve the same statistical efficiency as the global estimation. Numerical studies are performed to demonstrate the highly competitive performance of our proposed distributed smoothed method.
{"title":"Statistical inference and distributed implementation for linear multicategory SVM","authors":"Gaoming Sun, Xiaozhou Wang, Yibo Yan, Riquan Zhang","doi":"10.1002/sta4.611","DOIUrl":"https://doi.org/10.1002/sta4.611","url":null,"abstract":"Support vector machine (SVM) is one of the most prevalent classification techniques due to its excellent performance. The standard binary SVM has been well‐studied. However, a large number of multicategory classification problems in the real world are equally worth attention. In this paper, focusing on the computationally efficient multicategory angle‐based SVM model, we first study the statistical properties of model coefficient estimation. Notice that the new challenges posed by the widespread presence of distributed data, this paper further develops a distributed smoothed estimation for the multicategory SVM and establishes its theoretical guarantees. Through the derived asymptotic properties, it can be seen that our distributed smoothed estimation can achieve the same statistical efficiency as the global estimation. Numerical studies are performed to demonstrate the highly competitive performance of our proposed distributed smoothed method.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"28 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78336503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivated by a genome‐wide association study on the glomerular filtration rate, we develop a new robust test for longitudinal data to detect the effects of biomarkers in high‐dimensional quantile regression, in the presence of prespecified control variables. The test is based on the sum of score‐type statistics deduced from conditional quantile regression. The test statistic is constructed in a working‐independent manner, but the calibration reflects the intrinsic within‐subject correlation. Therefore, the test takes advantage of the feature of longitudinal data and provides more information than those based on only one measurement for each subject. Asymptotic properties of the proposed test statistic are established under both the null and local alternative hypotheses. Simulation studies show that the proposed test can control the family‐wise error rate well, while providing competitive power. The proposed method is applied to the motivating glomerular filtration rate data to test the overall significance of a large number of candidate single‐nucleotide polymorphisms that are possibly associated with the Type 1 diabetes, conditioning on the patients' demographics.
{"title":"Score‐based test in high‐dimensional quantile regression for longitudinal data with application to a glomerular filtration rate data","authors":"Yinfeng Wang, H. Wang, Yanlin Tang","doi":"10.1002/sta4.610","DOIUrl":"https://doi.org/10.1002/sta4.610","url":null,"abstract":"Motivated by a genome‐wide association study on the glomerular filtration rate, we develop a new robust test for longitudinal data to detect the effects of biomarkers in high‐dimensional quantile regression, in the presence of prespecified control variables. The test is based on the sum of score‐type statistics deduced from conditional quantile regression. The test statistic is constructed in a working‐independent manner, but the calibration reflects the intrinsic within‐subject correlation. Therefore, the test takes advantage of the feature of longitudinal data and provides more information than those based on only one measurement for each subject. Asymptotic properties of the proposed test statistic are established under both the null and local alternative hypotheses. Simulation studies show that the proposed test can control the family‐wise error rate well, while providing competitive power. The proposed method is applied to the motivating glomerular filtration rate data to test the overall significance of a large number of candidate single‐nucleotide polymorphisms that are possibly associated with the Type 1 diabetes, conditioning on the patients' demographics.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"6 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81655787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For sensitivity analysis with stochastic counterfactuals, we introduce a methodology to characterize uncertainty in causal inference from natural experiments. Our sensitivity parameters are standardized measures of variation in propensity and prognosis probabilities, and one minus their geometric mean is an intuitive measure of randomness in the data generating process. Within our latent propensity‐prognosis model, we show how to compute, from contingency table data, a threshold, , of sufficient randomness for causal inference. If the actual randomness of the data generating process is greater than this threshold, then causal inference is warranted. We demonstrate our methodology with two example applications.
{"title":"An asymptotic threshold of sufficient randomness for causal inference","authors":"B. Knaeble, B. Osting, P. Tshiaba","doi":"10.1002/sta4.609","DOIUrl":"https://doi.org/10.1002/sta4.609","url":null,"abstract":"For sensitivity analysis with stochastic counterfactuals, we introduce a methodology to characterize uncertainty in causal inference from natural experiments. Our sensitivity parameters are standardized measures of variation in propensity and prognosis probabilities, and one minus their geometric mean is an intuitive measure of randomness in the data generating process. Within our latent propensity‐prognosis model, we show how to compute, from contingency table data, a threshold, , of sufficient randomness for causal inference. If the actual randomness of the data generating process is greater than this threshold, then causal inference is warranted. We demonstrate our methodology with two example applications.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"1 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90173354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural network (DNN) models have achieved state‐of‐the‐art predictive accuracy in a wide range of applications. However, it remains a challenging task to accurately quantify the uncertainty in DNN predictions, especially those of continuous outcomes. To this end, we propose the Bayesian deep noise neural network (B‐DeepNoise), which generalizes standard Bayesian DNNs by extending the random noise variable from the output layer to all hidden layers. Our model is capable of approximating highly complex predictive density functions and fully learn the possible random variation in the outcome variables. For posterior computation, we provide a closed‐form Gibbs sampling algorithm that circumvents tuning‐intensive Metropolis–Hastings methods. We establish a recursive representation of the predictive density and perform theoretical analysis on the predictive variance. Through extensive experiments, we demonstrate the superiority of B‐DeepNoise over existing methods in terms of density estimation and uncertainty quantification accuracy. A neuroimaging application is included to show our model's usefulness in scientific studies.
{"title":"Density regression and uncertainty quantification with Bayesian deep noise neural networks","authors":"Daiwei Zhang, Tianci Liu, Jian Kang","doi":"10.1002/sta4.604","DOIUrl":"https://doi.org/10.1002/sta4.604","url":null,"abstract":"Deep neural network (DNN) models have achieved state‐of‐the‐art predictive accuracy in a wide range of applications. However, it remains a challenging task to accurately quantify the uncertainty in DNN predictions, especially those of continuous outcomes. To this end, we propose the Bayesian deep noise neural network (B‐DeepNoise), which generalizes standard Bayesian DNNs by extending the random noise variable from the output layer to all hidden layers. Our model is capable of approximating highly complex predictive density functions and fully learn the possible random variation in the outcome variables. For posterior computation, we provide a closed‐form Gibbs sampling algorithm that circumvents tuning‐intensive Metropolis–Hastings methods. We establish a recursive representation of the predictive density and perform theoretical analysis on the predictive variance. Through extensive experiments, we demonstrate the superiority of B‐DeepNoise over existing methods in terms of density estimation and uncertainty quantification accuracy. A neuroimaging application is included to show our model's usefulness in scientific studies.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136020291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although there is a huge literature on feature selection for the Cox model, none of the existing approaches can control the false discovery rate (FDR) unless the sample size tends to infinity. In addition, there is no formal power analysis of the knockoffs framework for survival data in the literature. To address those issues, in this paper, we propose a novel controlled feature selection approach using knockoffs for the Cox model. We establish that the proposed method enjoys the FDR control in finite samples regardless of the number of covariates. Moreover, under mild regularity conditions, we also show that the power of our method is asymptotically one as sample size tends to infinity. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure in the survival setting. Simulation studies confirm that our method has appealing finite-sample performance with desired FDR control and high power. We further demonstrate the performance of our method through a real data example.
{"title":"CoxKnockoff: Controlled feature selection for the Cox model using knockoffs","authors":"Daoji Li, Jinzhao Yu, Hui Zhao","doi":"10.1002/sta4.607","DOIUrl":"https://doi.org/10.1002/sta4.607","url":null,"abstract":"Although there is a huge literature on feature selection for the Cox model, none of the existing approaches can control the false discovery rate (FDR) unless the sample size tends to infinity. In addition, there is no formal power analysis of the knockoffs framework for survival data in the literature. To address those issues, in this paper, we propose a novel controlled feature selection approach using knockoffs for the Cox model. We establish that the proposed method enjoys the FDR control in finite samples regardless of the number of covariates. Moreover, under mild regularity conditions, we also show that the power of our method is asymptotically one as sample size tends to infinity. To the best of our knowledge, this is the first formal theoretical result on the power for the knockoffs procedure in the survival setting. Simulation studies confirm that our method has appealing finite-sample performance with desired FDR control and high power. We further demonstrate the performance of our method through a real data example.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"25 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86667170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A trinomial difference autoregressive model and its applications","authors":"Huaping Chen, Jiayue Zhang, Fukang Zhu","doi":"10.1002/sta4.596","DOIUrl":"https://doi.org/10.1002/sta4.596","url":null,"abstract":"","PeriodicalId":56159,"journal":{"name":"Stat","volume":"22 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82416971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dirichlet process mixture models using matrix‐generalized half‐t distribution","authors":"Sanghyun Lee, C. Kim","doi":"10.1002/sta4.599","DOIUrl":"https://doi.org/10.1002/sta4.599","url":null,"abstract":"","PeriodicalId":56159,"journal":{"name":"Stat","volume":"56 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75091018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In block designs, the responses of plots are potentially influenced by the treatments of neighbouring plots and the surrounding environment. Many researchers use two guarding plots next to the edge plots, for which we apply certain treatments to control these environmental effects. Thus, a design is presented as a collection of treatment sequences. For the estimation of total effects, existing results consider circular designs, whose constraints are unnecessary in common applications. In this paper, we construct optimal or highly efficient non-circular designs under interference models. It is observed that the optimal non-circular designs for the total effects outperform the optimal circular designs in many instances. In fact, a design containing a circular sequence cannot be optimal for