首页 > 最新文献

Computational Statistics最新文献

英文 中文
High dimensional controlled variable selection with model-X knockoffs in the AFT model 在 AFT 模型中使用 X 模型山寨版进行高维受控变量选择
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-09 DOI: 10.1007/s00180-023-01426-5
Baihua He, Di Xia, Yingli Pan

Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.

可解释性和稳定性是统计中应用高维数据所需的两个重要特征。虽然前者在一定程度上得到了许多现有预测方法的青睐,但后者在控制错误特征发现率的意义上仍有很大欠缺。在加速失效时间模型下,本文介绍了一种受控变量选择方法,该方法具有模型-X山寨版的一般框架,可用于处理高维数据。我们提供了渐近错误发现率(FDR)控制的理论依据。由于能在保持预测能力的同时对 FDR 进行强有力的控制,所提出的方法引起了极大的兴趣。我们通过几个模拟示例来评估有限样本的性能,以及所需的可解释性和稳定性。分析了急性髓性白血病研究的真实数据示例,以证明所提方法在实践中的实用性。
{"title":"High dimensional controlled variable selection with model-X knockoffs in the AFT model","authors":"Baihua He, Di Xia, Yingli Pan","doi":"10.1007/s00180-023-01426-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01426-5","url":null,"abstract":"<p>Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138563591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach 多时间序列数据的降维与可视化:一种符号数据分析方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-06 DOI: 10.1007/s00180-023-01440-7
Emily Chia-Yu Su, Han-Ming Wu

Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.

在尝试建模和预测之前,对多个时间序列数据进行探索性分析和可视化对于发现序列的内在动态至关重要。本研究将两种降维方法--主成分分析(PCA)和切片反回归(SIR)--扩展到多时间序列数据。这是通过创新的路径点方法来实现的,该方法是对符号数据分析框架的新补充。通过将多个时间序列数据转换为以起始值和终止值为标志的时间相关区间,每个序列被几何表示为具有唯一路径点的连续有向线段。这些路径点是我们新颖表示方法的基础。然后,将 PCA 和 SIR 应用于由这些路径点坐标形成的数据表,从而在一个缩减维度的子空间内实现对象时间轨迹的可视化。包括模拟、酵母细胞周期微阵列时间序列数据和金融数据在内的实证研究证实了我们的路径点方法在揭示二维因子平面内对象的结构和行为方面的有效性。与现有方法的比较分析,如 PCA 的应用向量法和时间相关区间数据的 SIR,进一步强调了我们的路径点表示法在时间序列数据领域的优势和多功能性。
{"title":"Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach","authors":"Emily Chia-Yu Su, Han-Ming Wu","doi":"10.1007/s00180-023-01440-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01440-7","url":null,"abstract":"<p>Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"93 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138548069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An expectation maximization algorithm for the hidden markov models with multiparameter student-t observations 具有多参数student-t观测值的隐马尔可夫模型期望最大化算法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-06 DOI: 10.1007/s00180-023-01432-7
Emna Ghorbel, Mahdi Louati

Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.

隐马尔可夫模型是一类概率图模型,用于描述一系列未知变量从一组观测变量的演化过程。它们是Baum和Petrie在Baum (JMA 101:789-810)中引入的统计模型,属于潜在变量模型的一类。它们最初是在语音识别的背景下发展和应用的,在许多应用领域受到了广泛的关注。这项研究工作的中心目标是对这些模型的扩展。更准确地说,我们定义了多参数隐马尔可夫模型,使用多个观测过程和对称矩阵空间上的Riesz分布作为gamma分布的自然扩展。讨论了一些基本的相关性质,并导出了边际分布和后验分布。采用前向-后向动态规划算法和经典期望最大化算法对全局参数集进行估计。利用仿真数据,通过Matlab程序方便地实现了这些估计器的性能。这使我们能够通过真实值和估计值之间的均方误差来评估所提出估计器的质量。
{"title":"An expectation maximization algorithm for the hidden markov models with multiparameter student-t observations","authors":"Emna Ghorbel, Mahdi Louati","doi":"10.1007/s00180-023-01432-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01432-7","url":null,"abstract":"<p>Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 8","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions 参考假设下纵向连续结果条件均值估算的序贯线性回归
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-03 DOI: 10.1007/s00180-023-01439-0
Sean Yiu

In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.

在纵向连续结果的临床试验中,基于参考的归算(RBI)通常用于处理在估计包含交叉事件(如停止治疗)影响的情况下缺失的结果数据。RBI最初是在多重归算框架下发展起来的,但最近提出了条件平均归算(CMI)与标准误差的折刀估计相结合的方法,以获得确定性的治疗效果估计和纠正频率推断。对于多重和CMI,通常使用重复测量的混合模型(MMRM)作为输入模型,但这可能是计算密集型的,以拟合多个数据集(例如jackknife样本),并导致具有许多参数的复杂MMRM模型的收敛问题。因此,基于每次就诊结果的顺序线性回归(SLR)的逐步方法被开发用于多重输入框架中的输入模型,但在CMI框架中缺乏类似的发展。在本文中,我们通过提出在CMI框架中实现RBI的单反方法来填补文献中的这一空白,并使用理论结果和模拟来证明其有效性。我们还在一个实际的数据应用中说明了我们的建议。
{"title":"Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions","authors":"Sean Yiu","doi":"10.1007/s00180-023-01439-0","DOIUrl":"https://doi.org/10.1007/s00180-023-01439-0","url":null,"abstract":"<p>In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 9","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pair programming with ChatGPT for sampling and estimation of copulas 用ChatGPT进行结对编程的抽样和估计
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-01 DOI: 10.1007/s00180-023-01437-2
Jan Górecki

Without writing a single line of code by a human, an example Monte Carlo simulation-based application for stochastic dependence modeling with copulas is developed through pair programming involving a human partner and a large language model (LLM) fine-tuned for conversations. This process encompasses interacting with ChatGPT using both natural language and mathematical formalism. Under the careful supervision of a human expert, this interaction facilitated the creation of functioning code in MATLAB, Python, and R. The code performs a variety of tasks including sampling from a given copula model, evaluating the model’s density, conducting maximum likelihood estimation, optimizing for parallel computing on CPUs and GPUs, and visualizing the computed results. In contrast to other emerging studies that assess the accuracy of LLMs like ChatGPT on tasks from a selected area, this work rather investigates ways how to achieve a successful solution of a standard statistical task in a collaboration of a human expert and artificial intelligence (AI). Particularly, through careful prompt engineering, we separate successful solutions generated by ChatGPT from unsuccessful ones, resulting in a comprehensive list of related pros and cons. It is demonstrated that if the typical pitfalls are avoided, we can substantially benefit from collaborating with an AI partner. For example, we show that if ChatGPT is not able to provide a correct solution due to a lack of or incorrect knowledge, the human-expert can feed it with the correct knowledge, e.g., in the form of mathematical theorems and formulas, and make it to apply the gained knowledge in order to provide a correct solution. Such ability presents an attractive opportunity to achieve a programmed solution even for users with rather limited knowledge of programming techniques.

无需编写一行代码,通过结对编程开发了一个基于蒙特卡罗模拟的示例应用程序,该应用程序用于使用copula进行随机依赖建模,涉及一个人类伙伴和一个针对对话进行微调的大型语言模型(LLM)。这个过程包括使用自然语言和数学形式与ChatGPT进行交互。在人类专家的仔细监督下,这种交互促进了MATLAB, Python和r中功能代码的创建。代码执行各种任务,包括从给定的copula模型中采样,评估模型的密度,进行最大似然估计,优化cpu和gpu上的并行计算,以及可视化计算结果。与其他评估法学硕士(如ChatGPT)在选定领域任务上的准确性的新兴研究相比,这项工作更像是研究如何在人类专家和人工智能(AI)的合作下成功解决标准统计任务的方法。特别是,通过仔细的快速工程,我们将ChatGPT生成的成功解决方案与不成功的解决方案区分开来,从而得出相关利弊的综合列表。事实证明,如果避免了典型的陷阱,我们可以从与AI合作伙伴的合作中受益匪浅。例如,我们表明,如果ChatGPT由于缺乏或不正确的知识而无法提供正确的解决方案,人类专家可以向其提供正确的知识,例如以数学定理和公式的形式,并使其应用获得的知识以提供正确的解决方案。这种能力为实现编程解决方案提供了一个有吸引力的机会,即使对编程技术知识相当有限的用户也是如此。
{"title":"Pair programming with ChatGPT for sampling and estimation of copulas","authors":"Jan Górecki","doi":"10.1007/s00180-023-01437-2","DOIUrl":"https://doi.org/10.1007/s00180-023-01437-2","url":null,"abstract":"<p>Without writing a single line of code by a human, an example Monte Carlo simulation-based application for stochastic dependence modeling with copulas is developed through pair programming involving a human partner and a large language model (LLM) fine-tuned for conversations. This process encompasses interacting with ChatGPT using both natural language and mathematical formalism. Under the careful supervision of a human expert, this interaction facilitated the creation of functioning code in MATLAB, Python, and <span>R</span>. The code performs a variety of tasks including sampling from a given copula model, evaluating the model’s density, conducting maximum likelihood estimation, optimizing for parallel computing on CPUs and GPUs, and visualizing the computed results. In contrast to other emerging studies that assess the accuracy of LLMs like ChatGPT on tasks from a selected area, this work rather investigates ways how to achieve a successful solution of a standard statistical task in a collaboration of a human expert and artificial intelligence (AI). Particularly, through careful prompt engineering, we separate successful solutions generated by ChatGPT from unsuccessful ones, resulting in a comprehensive list of related pros and cons. It is demonstrated that if the typical pitfalls are avoided, we can substantially benefit from collaborating with an AI partner. For example, we show that if ChatGPT is not able to provide a correct solution due to a lack of or incorrect knowledge, the human-expert can feed it with the correct knowledge, e.g., in the form of mathematical theorems and formulas, and make it to apply the gained knowledge in order to provide a correct solution. Such ability presents an attractive opportunity to achieve a programmed solution even for users with rather limited knowledge of programming techniques.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"26 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wavelet-based Bayesian approximate kernel method for high-dimensional data analysis 基于小波的贝叶斯近似核方法用于高维数据分析
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-11-26 DOI: 10.1007/s00180-023-01438-1
Wenxing Guo, Xueying Zhang, Bei Jiang, Linglong Kong, Yaozhong Hu

Kernel methods are often used for nonlinear regression and classification in statistics and machine learning because they are computationally cheap and accurate. The wavelet kernel functions based on wavelet analysis can efficiently approximate any nonlinear functions. In this article, we construct a novel wavelet kernel function in terms of random wavelet bases and define a linear vector space that captures nonlinear structures in reproducing kernel Hilbert spaces (RKHS). Based on the wavelet transform, the data are mapped into a low-dimensional randomized feature space and convert kernel function into operations of a linear machine. We then propose a new Bayesian approximate kernel model with the random wavelet expansion and use the Gibbs sampler to compute the model’s parameters. Finally, some simulation studies and two real datasets analyses are carried out to demonstrate that the proposed method displays good stability, prediction performance compared to some other existing methods.

核方法通常用于统计和机器学习中的非线性回归和分类,因为它们在计算上便宜且准确。基于小波分析的小波核函数可以有效地逼近任意非线性函数。在本文中,我们用随机小波基构造了一个新的小波核函数,并定义了一个线性向量空间来捕获再现核希尔伯特空间(RKHS)中的非线性结构。基于小波变换,将数据映射到低维随机特征空间中,并将核函数转换为线性机器的操作。然后我们提出了一个新的贝叶斯近似核模型与随机小波展开和使用吉布斯采样器计算模型的参数。最后,通过仿真研究和两个真实数据集的分析表明,与现有方法相比,该方法具有良好的稳定性和预测性能。
{"title":"Wavelet-based Bayesian approximate kernel method for high-dimensional data analysis","authors":"Wenxing Guo, Xueying Zhang, Bei Jiang, Linglong Kong, Yaozhong Hu","doi":"10.1007/s00180-023-01438-1","DOIUrl":"https://doi.org/10.1007/s00180-023-01438-1","url":null,"abstract":"<p>Kernel methods are often used for nonlinear regression and classification in statistics and machine learning because they are computationally cheap and accurate. The wavelet kernel functions based on wavelet analysis can efficiently approximate any nonlinear functions. In this article, we construct a novel wavelet kernel function in terms of random wavelet bases and define a linear vector space that captures nonlinear structures in reproducing kernel Hilbert spaces (RKHS). Based on the wavelet transform, the data are mapped into a low-dimensional randomized feature space and convert kernel function into operations of a linear machine. We then propose a new Bayesian approximate kernel model with the random wavelet expansion and use the Gibbs sampler to compute the model’s parameters. Finally, some simulation studies and two real datasets analyses are carried out to demonstrate that the proposed method displays good stability, prediction performance compared to some other existing methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"49 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test 高维数据的双样本Behrens-Fisher问题:一个正常的参考f型检验
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-11-24 DOI: 10.1007/s00180-023-01433-6
Tianming Zhu, Pengfei Wang, Jin-Ting Zhang

The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an F-type test for two-sample Behrens–Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed F-type test statistic is shown to be an F-type mixture, a ratio of two independent (chi ^2)-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed F-type test statistic and the above F-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed F-type test statistic by that of the F-type mixture, resulting in the so-called normal reference F-type test. Since the F-type mixture is a ratio of two independent (chi ^2)-type mixtures, we employ the Welch–Satterthwaite (chi ^2)-approximation to the distributions of the numerator and the denominator of the F-type mixture respectively, resulting in an approximation F-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed F-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed F-type test outperforms two existing competitors. The good performance of the proposed F-type test is also illustrated by a COVID-19 data example.

对高维数据的平均向量的相等性的检验问题在文献中得到了深入的研究。然而,现有的大多数检验都对潜在的群体协方差矩阵施加了很强的假设,这些假设在实践中可能不被满足或很难被检验。本文提出并研究了高维数据下双样本Behrens-Fisher问题的f型检验。当两个样本呈正态分布且零假设有效时,所提出的f型检验统计量显示为f型混合物,即两个独立(chi ^2)型混合物的比率。在某些正则性条件和原假设下,证明了所提出的f型检验统计量和上述f型混合物具有相同的正态和非正态极限。然后可以通过f型混合统计量来近似所提出的f型检验统计量的零分布,从而得到所谓的正态参考f型检验。由于f型混合物是两个独立的(chi ^2)型混合物的比率,我们分别对f型混合物的分子和分母的分布采用Welch-Satterthwaite (chi ^2) -近似,从而得到一个近似的f -分布,其自由度可以从数据中一致地估计出来。建立了所提出的f型检验的渐近幂。进行了两次仿真研究,结果表明,在尺寸控制方面,所提出的f型测试优于现有的两个竞争对手。通过一个COVID-19数据实例验证了所提出的f型检验的良好性能。
{"title":"Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference F-type test","authors":"Tianming Zhu, Pengfei Wang, Jin-Ting Zhang","doi":"10.1007/s00180-023-01433-6","DOIUrl":"https://doi.org/10.1007/s00180-023-01433-6","url":null,"abstract":"<p>The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an <i>F</i>-type test for two-sample Behrens–Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed <i>F</i>-type test statistic is shown to be an <i>F</i>-type mixture, a ratio of two independent <span>(chi ^2)</span>-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed <i>F</i>-type test statistic and the above <i>F</i>-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed <i>F</i>-type test statistic by that of the <i>F</i>-type mixture, resulting in the so-called normal reference <i>F</i>-type test. Since the <i>F</i>-type mixture is a ratio of two independent <span>(chi ^2)</span>-type mixtures, we employ the Welch–Satterthwaite <span>(chi ^2)</span>-approximation to the distributions of the numerator and the denominator of the <i>F</i>-type mixture respectively, resulting in an approximation <i>F</i>-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed <i>F</i>-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed <i>F</i>-type test outperforms two existing competitors. The good performance of the proposed <i>F</i>-type test is also illustrated by a COVID-19 data example.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new bandwidth selection method for nonparametric modal regression based on generalized hyperbolic distributions 基于广义双曲分布的非参数模态回归带宽选择新方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-11-18 DOI: 10.1007/s00180-023-01435-4
Hongpeng Yuan, Sijia Xiang, Weixin Yao

As a complement to standard mean and quantile regression, nonparametric modal regression has been broadly applied in various fields. By focusing on the most likely conditional value of Y given x, the nonparametric modal regression is shown to be resistant to outliers and some forms of measurement error, and the prediction intervals are shorter when data is skewed. However, the bandwidth selection is critical but very challenging, since the traditional least-squares based cross-validation method cannot be applied. We propose to select the bandwidth by applying the asymptotic global optimal bandwidth and the flexible generalized hyperbolic (GH) distribution as the distribution of the error. Unlike the plug-in method, the new method does not require preliminary parameters to be chosen in advance, is easy to compute by any statistical software, and is computationally efficient compared to the existing kernel density estimator (KDE) based method. Numerical studies show that the GH based bandwidth performs better than existing bandwidth selector, in terms of higher coverage probabilities. Real data applications also illustrate the superior performance of the new bandwidth.

非参数模态回归作为标准均值回归和分位数回归的补充,在各个领域得到了广泛的应用。通过关注给定x的Y的最可能条件值,非参数模态回归显示出对异常值和某些形式的测量误差的抗性,并且当数据偏斜时预测间隔更短。然而,由于传统的基于最小二乘的交叉验证方法无法应用,带宽选择非常关键,但非常具有挑战性。我们提出用渐近全局最优带宽和柔性广义双曲(GH)分布作为误差的分布来选择带宽。与插件方法不同,新方法不需要预先选择初始参数,任何统计软件都易于计算,与现有的基于核密度估计器(KDE)的方法相比,计算效率更高。数值研究表明,基于GH的带宽选择器在更高的覆盖概率方面优于现有的带宽选择器。实际数据应用也证明了新带宽的优越性能。
{"title":"A new bandwidth selection method for nonparametric modal regression based on generalized hyperbolic distributions","authors":"Hongpeng Yuan, Sijia Xiang, Weixin Yao","doi":"10.1007/s00180-023-01435-4","DOIUrl":"https://doi.org/10.1007/s00180-023-01435-4","url":null,"abstract":"<p>As a complement to standard mean and quantile regression, nonparametric modal regression has been broadly applied in various fields. By focusing on the most likely conditional value of Y given x, the nonparametric modal regression is shown to be resistant to outliers and some forms of measurement error, and the prediction intervals are shorter when data is skewed. However, the bandwidth selection is critical but very challenging, since the traditional least-squares based cross-validation method cannot be applied. We propose to select the bandwidth by applying the asymptotic global optimal bandwidth and the flexible generalized hyperbolic (GH) distribution as the distribution of the error. Unlike the plug-in method, the new method does not require preliminary parameters to be chosen in advance, is easy to compute by any statistical software, and is computationally efficient compared to the existing kernel density estimator (KDE) based method. Numerical studies show that the GH based bandwidth performs better than existing bandwidth selector, in terms of higher coverage probabilities. Real data applications also illustrate the superior performance of the new bandwidth.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous subgroup identification and variable selection for high dimensional data 高维数据的同时子群识别和变量选择
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-11-17 DOI: 10.1007/s00180-023-01436-3
Huicong Yu, Jiaqi Wu, Weiping Zhang

The high dimensionality of genetic data poses many challenges for subgroup identification, both computationally and theoretically. This paper proposes a double-penalized regression model for subgroup analysis and variable selection for heterogeneous high-dimensional data. The proposed approach can automatically identify the underlying subgroups, recover the sparsity, and simultaneously estimate all regression coefficients without prior knowledge of grouping structure or sparsity construction within variables. We optimize the objective function using the alternating direction method of multipliers with a proximal gradient algorithm and demonstrate the convergence of the proposed procedure. We show that the proposed estimator enjoys the oracle property. Simulation studies demonstrate the effectiveness of the novel method with finite samples, and a real data example is provided for illustration.

遗传数据的高维性给子群识别带来了计算和理论上的诸多挑战。本文提出了一种用于异构高维数据子群分析和变量选择的双惩罚回归模型。该方法可以自动识别潜在的子组,恢复稀疏性,同时估计所有回归系数,而不需要预先知道分组结构或变量内部的稀疏性构造。我们使用乘法器的交替方向方法和近端梯度算法来优化目标函数,并证明了该过程的收敛性。我们证明了所提出的估计器具有oracle属性。仿真研究证明了该方法在有限样本情况下的有效性,并给出了一个实际数据算例。
{"title":"Simultaneous subgroup identification and variable selection for high dimensional data","authors":"Huicong Yu, Jiaqi Wu, Weiping Zhang","doi":"10.1007/s00180-023-01436-3","DOIUrl":"https://doi.org/10.1007/s00180-023-01436-3","url":null,"abstract":"<p>The high dimensionality of genetic data poses many challenges for subgroup identification, both computationally and theoretically. This paper proposes a double-penalized regression model for subgroup analysis and variable selection for heterogeneous high-dimensional data. The proposed approach can automatically identify the underlying subgroups, recover the sparsity, and simultaneously estimate all regression coefficients without prior knowledge of grouping structure or sparsity construction within variables. We optimize the objective function using the alternating direction method of multipliers with a proximal gradient algorithm and demonstrate the convergence of the proposed procedure. We show that the proposed estimator enjoys the oracle property. Simulation studies demonstrate the effectiveness of the novel method with finite samples, and a real data example is provided for illustration.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"47 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138516645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonparametric estimation of expected shortfall for α-mixing financial losses α-混合财务损失预期缺口的非参数估计
4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-11-14 DOI: 10.1007/s00180-023-01434-5
Xuejun Wang, Yi Wu, Wei Wang
{"title":"Nonparametric estimation of expected shortfall for α-mixing financial losses","authors":"Xuejun Wang, Yi Wu, Wei Wang","doi":"10.1007/s00180-023-01434-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01434-5","url":null,"abstract":"","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"27 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134991778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1