Pub Date : 2023-12-09DOI: 10.1007/s00180-023-01426-5
Baihua He, Di Xia, Yingli Pan
Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.
{"title":"High dimensional controlled variable selection with model-X knockoffs in the AFT model","authors":"Baihua He, Di Xia, Yingli Pan","doi":"10.1007/s00180-023-01426-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01426-5","url":null,"abstract":"<p>Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138563591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-06DOI: 10.1007/s00180-023-01440-7
Emily Chia-Yu Su, Han-Ming Wu
Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.
在尝试建模和预测之前,对多个时间序列数据进行探索性分析和可视化对于发现序列的内在动态至关重要。本研究将两种降维方法--主成分分析(PCA)和切片反回归(SIR)--扩展到多时间序列数据。这是通过创新的路径点方法来实现的,该方法是对符号数据分析框架的新补充。通过将多个时间序列数据转换为以起始值和终止值为标志的时间相关区间,每个序列被几何表示为具有唯一路径点的连续有向线段。这些路径点是我们新颖表示方法的基础。然后,将 PCA 和 SIR 应用于由这些路径点坐标形成的数据表,从而在一个缩减维度的子空间内实现对象时间轨迹的可视化。包括模拟、酵母细胞周期微阵列时间序列数据和金融数据在内的实证研究证实了我们的路径点方法在揭示二维因子平面内对象的结构和行为方面的有效性。与现有方法的比较分析,如 PCA 的应用向量法和时间相关区间数据的 SIR,进一步强调了我们的路径点表示法在时间序列数据领域的优势和多功能性。
{"title":"Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach","authors":"Emily Chia-Yu Su, Han-Ming Wu","doi":"10.1007/s00180-023-01440-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01440-7","url":null,"abstract":"<p>Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"93 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138548069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-06DOI: 10.1007/s00180-023-01432-7
Emna Ghorbel, Mahdi Louati
Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.
{"title":"An expectation maximization algorithm for the hidden markov models with multiparameter student-t observations","authors":"Emna Ghorbel, Mahdi Louati","doi":"10.1007/s00180-023-01432-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01432-7","url":null,"abstract":"<p>Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 8","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-03DOI: 10.1007/s00180-023-01439-0
Sean Yiu
In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.
{"title":"Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions","authors":"Sean Yiu","doi":"10.1007/s00180-023-01439-0","DOIUrl":"https://doi.org/10.1007/s00180-023-01439-0","url":null,"abstract":"<p>In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 9","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01DOI: 10.1007/s00180-023-01437-2
Jan Górecki
Without writing a single line of code by a human, an example Monte Carlo simulation-based application for stochastic dependence modeling with copulas is developed through pair programming involving a human partner and a large language model (LLM) fine-tuned for conversations. This process encompasses interacting with ChatGPT using both natural language and mathematical formalism. Under the careful supervision of a human expert, this interaction facilitated the creation of functioning code in MATLAB, Python, and R. The code performs a variety of tasks including sampling from a given copula model, evaluating the model’s density, conducting maximum likelihood estimation, optimizing for parallel computing on CPUs and GPUs, and visualizing the computed results. In contrast to other emerging studies that assess the accuracy of LLMs like ChatGPT on tasks from a selected area, this work rather investigates ways how to achieve a successful solution of a standard statistical task in a collaboration of a human expert and artificial intelligence (AI). Particularly, through careful prompt engineering, we separate successful solutions generated by ChatGPT from unsuccessful ones, resulting in a comprehensive list of related pros and cons. It is demonstrated that if the typical pitfalls are avoided, we can substantially benefit from collaborating with an AI partner. For example, we show that if ChatGPT is not able to provide a correct solution due to a lack of or incorrect knowledge, the human-expert can feed it with the correct knowledge, e.g., in the form of mathematical theorems and formulas, and make it to apply the gained knowledge in order to provide a correct solution. Such ability presents an attractive opportunity to achieve a programmed solution even for users with rather limited knowledge of programming techniques.
{"title":"Pair programming with ChatGPT for sampling and estimation of copulas","authors":"Jan Górecki","doi":"10.1007/s00180-023-01437-2","DOIUrl":"https://doi.org/10.1007/s00180-023-01437-2","url":null,"abstract":"<p>Without writing a single line of code by a human, an example Monte Carlo simulation-based application for stochastic dependence modeling with copulas is developed through pair programming involving a human partner and a large language model (LLM) fine-tuned for conversations. This process encompasses interacting with ChatGPT using both natural language and mathematical formalism. Under the careful supervision of a human expert, this interaction facilitated the creation of functioning code in MATLAB, Python, and <span>R</span>. The code performs a variety of tasks including sampling from a given copula model, evaluating the model’s density, conducting maximum likelihood estimation, optimizing for parallel computing on CPUs and GPUs, and visualizing the computed results. In contrast to other emerging studies that assess the accuracy of LLMs like ChatGPT on tasks from a selected area, this work rather investigates ways how to achieve a successful solution of a standard statistical task in a collaboration of a human expert and artificial intelligence (AI). Particularly, through careful prompt engineering, we separate successful solutions generated by ChatGPT from unsuccessful ones, resulting in a comprehensive list of related pros and cons. It is demonstrated that if the typical pitfalls are avoided, we can substantially benefit from collaborating with an AI partner. For example, we show that if ChatGPT is not able to provide a correct solution due to a lack of or incorrect knowledge, the human-expert can feed it with the correct knowledge, e.g., in the form of mathematical theorems and formulas, and make it to apply the gained knowledge in order to provide a correct solution. Such ability presents an attractive opportunity to achieve a programmed solution even for users with rather limited knowledge of programming techniques.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"26 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-26DOI: 10.1007/s00180-023-01438-1
Wenxing Guo, Xueying Zhang, Bei Jiang, Linglong Kong, Yaozhong Hu
Kernel methods are often used for nonlinear regression and classification in statistics and machine learning because they are computationally cheap and accurate. The wavelet kernel functions based on wavelet analysis can efficiently approximate any nonlinear functions. In this article, we construct a novel wavelet kernel function in terms of random wavelet bases and define a linear vector space that captures nonlinear structures in reproducing kernel Hilbert spaces (RKHS). Based on the wavelet transform, the data are mapped into a low-dimensional randomized feature space and convert kernel function into operations of a linear machine. We then propose a new Bayesian approximate kernel model with the random wavelet expansion and use the Gibbs sampler to compute the model’s parameters. Finally, some simulation studies and two real datasets analyses are carried out to demonstrate that the proposed method displays good stability, prediction performance compared to some other existing methods.
{"title":"Wavelet-based Bayesian approximate kernel method for high-dimensional data analysis","authors":"Wenxing Guo, Xueying Zhang, Bei Jiang, Linglong Kong, Yaozhong Hu","doi":"10.1007/s00180-023-01438-1","DOIUrl":"https://doi.org/10.1007/s00180-023-01438-1","url":null,"abstract":"<p>Kernel methods are often used for nonlinear regression and classification in statistics and machine learning because they are computationally cheap and accurate. The wavelet kernel functions based on wavelet analysis can efficiently approximate any nonlinear functions. In this article, we construct a novel wavelet kernel function in terms of random wavelet bases and define a linear vector space that captures nonlinear structures in reproducing kernel Hilbert spaces (RKHS). Based on the wavelet transform, the data are mapped into a low-dimensional randomized feature space and convert kernel function into operations of a linear machine. We then propose a new Bayesian approximate kernel model with the random wavelet expansion and use the Gibbs sampler to compute the model’s parameters. Finally, some simulation studies and two real datasets analyses are carried out to demonstrate that the proposed method displays good stability, prediction performance compared to some other existing methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"49 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-24DOI: 10.1007/s00180-023-01433-6
Tianming Zhu, Pengfei Wang, Jin-Ting Zhang
The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an F-type test for two-sample Behrens–Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed F-type test statistic is shown to be an F-type mixture, a ratio of two independent (chi ^2)-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed F-type test statistic and the above F-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed F-type test statistic by that of the F-type mixture, resulting in the so-called normal reference F-type test. Since the F-type mixture is a ratio of two independent (chi ^2)-type mixtures, we employ the Welch–Satterthwaite (chi ^2)-approximation to the distributions of the numerator and the denominator of the F-type mixture respectively, resulting in an approximation F-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed F-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed F-type test outperforms two existing competitors. The good performance of the proposed F-type test is also illustrated by a COVID-19 data example.
{"title":"Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test","authors":"Tianming Zhu, Pengfei Wang, Jin-Ting Zhang","doi":"10.1007/s00180-023-01433-6","DOIUrl":"https://doi.org/10.1007/s00180-023-01433-6","url":null,"abstract":"<p>The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an <i>F</i>-type test for two-sample Behrens–Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed <i>F</i>-type test statistic is shown to be an <i>F</i>-type mixture, a ratio of two independent <span>(chi ^2)</span>-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed <i>F</i>-type test statistic and the above <i>F</i>-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed <i>F</i>-type test statistic by that of the <i>F</i>-type mixture, resulting in the so-called normal reference <i>F</i>-type test. Since the <i>F</i>-type mixture is a ratio of two independent <span>(chi ^2)</span>-type mixtures, we employ the Welch–Satterthwaite <span>(chi ^2)</span>-approximation to the distributions of the numerator and the denominator of the <i>F</i>-type mixture respectively, resulting in an approximation <i>F</i>-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed <i>F</i>-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed <i>F</i>-type test outperforms two existing competitors. The good performance of the proposed <i>F</i>-type test is also illustrated by a COVID-19 data example.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-18DOI: 10.1007/s00180-023-01435-4
Hongpeng Yuan, Sijia Xiang, Weixin Yao
As a complement to standard mean and quantile regression, nonparametric modal regression has been broadly applied in various fields. By focusing on the most likely conditional value of Y given x, the nonparametric modal regression is shown to be resistant to outliers and some forms of measurement error, and the prediction intervals are shorter when data is skewed. However, the bandwidth selection is critical but very challenging, since the traditional least-squares based cross-validation method cannot be applied. We propose to select the bandwidth by applying the asymptotic global optimal bandwidth and the flexible generalized hyperbolic (GH) distribution as the distribution of the error. Unlike the plug-in method, the new method does not require preliminary parameters to be chosen in advance, is easy to compute by any statistical software, and is computationally efficient compared to the existing kernel density estimator (KDE) based method. Numerical studies show that the GH based bandwidth performs better than existing bandwidth selector, in terms of higher coverage probabilities. Real data applications also illustrate the superior performance of the new bandwidth.
{"title":"A new bandwidth selection method for nonparametric modal regression based on generalized hyperbolic distributions","authors":"Hongpeng Yuan, Sijia Xiang, Weixin Yao","doi":"10.1007/s00180-023-01435-4","DOIUrl":"https://doi.org/10.1007/s00180-023-01435-4","url":null,"abstract":"<p>As a complement to standard mean and quantile regression, nonparametric modal regression has been broadly applied in various fields. By focusing on the most likely conditional value of Y given x, the nonparametric modal regression is shown to be resistant to outliers and some forms of measurement error, and the prediction intervals are shorter when data is skewed. However, the bandwidth selection is critical but very challenging, since the traditional least-squares based cross-validation method cannot be applied. We propose to select the bandwidth by applying the asymptotic global optimal bandwidth and the flexible generalized hyperbolic (GH) distribution as the distribution of the error. Unlike the plug-in method, the new method does not require preliminary parameters to be chosen in advance, is easy to compute by any statistical software, and is computationally efficient compared to the existing kernel density estimator (KDE) based method. Numerical studies show that the GH based bandwidth performs better than existing bandwidth selector, in terms of higher coverage probabilities. Real data applications also illustrate the superior performance of the new bandwidth.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-17DOI: 10.1007/s00180-023-01436-3
Huicong Yu, Jiaqi Wu, Weiping Zhang
The high dimensionality of genetic data poses many challenges for subgroup identification, both computationally and theoretically. This paper proposes a double-penalized regression model for subgroup analysis and variable selection for heterogeneous high-dimensional data. The proposed approach can automatically identify the underlying subgroups, recover the sparsity, and simultaneously estimate all regression coefficients without prior knowledge of grouping structure or sparsity construction within variables. We optimize the objective function using the alternating direction method of multipliers with a proximal gradient algorithm and demonstrate the convergence of the proposed procedure. We show that the proposed estimator enjoys the oracle property. Simulation studies demonstrate the effectiveness of the novel method with finite samples, and a real data example is provided for illustration.
{"title":"Simultaneous subgroup identification and variable selection for high dimensional data","authors":"Huicong Yu, Jiaqi Wu, Weiping Zhang","doi":"10.1007/s00180-023-01436-3","DOIUrl":"https://doi.org/10.1007/s00180-023-01436-3","url":null,"abstract":"<p>The high dimensionality of genetic data poses many challenges for subgroup identification, both computationally and theoretically. This paper proposes a double-penalized regression model for subgroup analysis and variable selection for heterogeneous high-dimensional data. The proposed approach can automatically identify the underlying subgroups, recover the sparsity, and simultaneously estimate all regression coefficients without prior knowledge of grouping structure or sparsity construction within variables. We optimize the objective function using the alternating direction method of multipliers with a proximal gradient algorithm and demonstrate the convergence of the proposed procedure. We show that the proposed estimator enjoys the oracle property. Simulation studies demonstrate the effectiveness of the novel method with finite samples, and a real data example is provided for illustration.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"47 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}