首页 > 最新文献

Computational Statistics最新文献

英文 中文
Structured dictionary learning of rating migration matrices for credit risk modeling 用于信用风险建模的评级迁移矩阵的结构化词典学习
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-10 DOI: 10.1007/s00180-023-01449-y

Abstract

Rating migration matrix is a crux to assess credit risks. Modeling and predicting these matrices are then an issue of great importance for risk managers in any financial institution. As a challenger to usual parametric modeling approaches, we propose a new structured dictionary learning model with auto-regressive regularization that is able to meet key expectations and constraints: small amount of data, fast evolution in time of these matrices, economic interpretability of the calibrated model. To show the model applicability, we present a numerical test with both synthetic and real data and a comparison study with the widely used parametric Gaussian Copula model: it turns out that our new approach based on dictionary learning significantly outperforms the Gaussian Copula model.

摘要 评级迁移矩阵是评估信贷风险的关键。因此,对这些矩阵进行建模和预测对任何金融机构的风险管理人员来说都是一个非常重要的问题。作为通常参数建模方法的挑战者,我们提出了一种具有自动回归正则化的新型结构化字典学习模型,该模型能够满足以下关键期望和约束条件:数据量小、这些矩阵随时间的快速演变、校准模型的经济可解释性。为了证明模型的适用性,我们用合成数据和真实数据进行了数值测试,并与广泛使用的参数高斯 Copula 模型进行了比较研究:结果表明,我们基于字典学习的新方法明显优于高斯 Copula 模型。
{"title":"Structured dictionary learning of rating migration matrices for credit risk modeling","authors":"","doi":"10.1007/s00180-023-01449-y","DOIUrl":"https://doi.org/10.1007/s00180-023-01449-y","url":null,"abstract":"<h3>Abstract</h3> <p>Rating migration matrix is a crux to assess credit risks. Modeling and predicting these matrices are then an issue of great importance for risk managers in any financial institution. As a challenger to usual parametric modeling approaches, we propose a new structured dictionary learning model with auto-regressive regularization that is able to meet key expectations and constraints: small amount of data, fast evolution in time of these matrices, economic interpretability of the calibrated model. To show the model applicability, we present a numerical test with both synthetic and real data and a comparison study with the widely used parametric Gaussian Copula model: it turns out that our new approach based on dictionary learning significantly outperforms the Gaussian Copula model.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"44 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139421947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A latent variable approach for modeling recall-based time-to-event data with Weibull distribution 基于 Weibull 分布的事件时间回忆数据建模的潜在变量方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-03 DOI: 10.1007/s00180-023-01444-3

Abstract

The ability of individuals to recall events is influenced by the time interval between the monitoring time and the occurrence of the event. In this article, we introduce a non-recall probability function that incorporates this information into our modeling framework. We model the time-to-event using the Weibull distribution and adopt a latent variable approach to handle situations where recall is not possible. In the classical framework, we obtain point estimators using expectation-maximization algorithm and construct the observed Fisher information matrix using missing information principle. Within the Bayesian paradigm, we derive point estimators under suitable choice of priors and calculate highest posterior density intervals using Markov Chain Monte Carlo samples. To assess the performance of the proposed estimators, we conduct an extensive simulation study. Additionally, we utilize age at menarche and breastfeeding datasets as examples to illustrate the effectiveness of the proposed methodology.

摘要 个人回忆事件的能力受监测时间与事件发生之间时间间隔的影响。在本文中,我们引入了一种非回忆概率函数,将这一信息纳入我们的建模框架。我们使用 Weibull 分布对事件发生时间进行建模,并采用潜变量方法来处理无法回忆的情况。在经典框架中,我们使用期望最大化算法获得点估计值,并利用缺失信息原理构建观察到的费雪信息矩阵。在贝叶斯范式中,我们在适当的先验选择下得到点估计器,并使用马尔可夫链蒙特卡罗样本计算最高后验密度区间。为了评估所提出的估计器的性能,我们进行了广泛的模拟研究。此外,我们还以初潮年龄和母乳喂养数据集为例,说明了所提方法的有效性。
{"title":"A latent variable approach for modeling recall-based time-to-event data with Weibull distribution","authors":"","doi":"10.1007/s00180-023-01444-3","DOIUrl":"https://doi.org/10.1007/s00180-023-01444-3","url":null,"abstract":"<h3>Abstract</h3> <p>The ability of individuals to recall events is influenced by the time interval between the monitoring time and the occurrence of the event. In this article, we introduce a non-recall probability function that incorporates this information into our modeling framework. We model the time-to-event using the Weibull distribution and adopt a latent variable approach to handle situations where recall is not possible. In the classical framework, we obtain point estimators using expectation-maximization algorithm and construct the observed Fisher information matrix using missing information principle. Within the Bayesian paradigm, we derive point estimators under suitable choice of priors and calculate highest posterior density intervals using Markov Chain Monte Carlo samples. To assess the performance of the proposed estimators, we conduct an extensive simulation study. Additionally, we utilize age at menarche and breastfeeding datasets as examples to illustrate the effectiveness of the proposed methodology.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139096435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Testing for linearity in scalar-on-function regression with responses missing at random 测试随机缺失响应的标量-函数回归的线性度
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-01-03 DOI: 10.1007/s00180-023-01445-2
Manuel Febrero-Bande, Pedro Galeano, Eduardo García-Portugués, Wenceslao González-Manteiga

A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the simplified method estimates the functional slope by simply discarding observations with missing responses. Second, the imputed method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the inverse probability weighted method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain.

本文提出了带有随机缺失(MAR)响应的标量响应功能线性模型(FLMSR)的拟合优度检验。该检验统计量依赖于以投影函数协变量为索引的标记经验过程,其在零假设下的分布是通过野外自举程序校准的。当样本有 MAR 反应时,检验的计算和性能依赖于对 FLMSR 函数斜率的准确估计。我们考虑了三种基于协变量函数主成分(FPCs)的估计方法。首先,简化方法通过简单地剔除缺失响应的观测值来估计功能斜率。第二,估算法通过使用简化估算器估算缺失的响应来估计功能斜率。第三,反概率加权法在估算时纳入了缺失响应生成机制。此外,还使用交叉验证和 LASSO 回归来选择每种估计器使用的 FPC。我们进行了多次蒙特卡罗实验,分析了测试程序与函数斜率估计器相结合的行为。结果表明,进行缺失反应归因的估计器的功率最高。测试程序被用于检查西班牙气象站的年平均晴天数与日平均气温曲线之间是否存在线性关系。
{"title":"Testing for linearity in scalar-on-function regression with responses missing at random","authors":"Manuel Febrero-Bande, Pedro Galeano, Eduardo García-Portugués, Wenceslao González-Manteiga","doi":"10.1007/s00180-023-01445-2","DOIUrl":"https://doi.org/10.1007/s00180-023-01445-2","url":null,"abstract":"<p>A goodness-of-fit test for the Functional Linear Model with Scalar Response (FLMSR) with responses Missing at Random (MAR) is proposed in this paper. The test statistic relies on a marked empirical process indexed by the projected functional covariate and its distribution under the null hypothesis is calibrated using a wild bootstrap procedure. The computation and performance of the test rely on having an accurate estimator of the functional slope of the FLMSR when the sample has MAR responses. Three estimation methods based on the Functional Principal Components (FPCs) of the covariate are considered. First, the <i>simplified</i> method estimates the functional slope by simply discarding observations with missing responses. Second, the <i>imputed</i> method estimates the functional slope by imputing the missing responses using the simplified estimator. Third, the <i>inverse probability weighted</i> method incorporates the missing response generation mechanism when imputing. Furthermore, both cross-validation and LASSO regression are used to select the FPCs used by each estimator. Several Monte Carlo experiments are conducted to analyze the behavior of the testing procedure in combination with the functional slope estimators. Results indicate that estimators performing missing-response imputation achieve the highest power. The testing procedure is applied to check for linear dependence between the average number of sunny days per year and the mean curve of daily temperatures at weather stations in Spain.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"8 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimation and prediction with data quality indexes in linear regressions 利用线性回归中的数据质量指标进行估计和预测
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-20 DOI: 10.1007/s00180-023-01441-6

Abstract

Despite many statistical applications brush the question of data quality aside, it is a fundamental concern inherent to external data collection. In this paper, data quality relates to the confidence one can have about the covariate values in a regression framework. More precisely, we study how to integrate the information of data quality given by a ((n times p)) -matrix, with n the number of individuals and p the number of explanatory variables. In this view, we suggest a latent variable model that drives the generation of the covariate values, and introduce a new algorithm that takes all these information into account for prediction. Our approach provides unbiased estimators of the regression coefficients, and allows to make predictions adapted to some given quality pattern. The usefulness of our procedure is illustrated through simulations and real-life applications. Kindly check and confirm whether the corresponding author is correctly identified.Yes

摘要 尽管许多统计应用将数据质量问题搁置一旁,但它却是外部数据收集所固有的一个基本问题。在本文中,数据质量关系到人们对回归框架中协变量值的置信度。更准确地说,我们研究的是如何整合由 (((n 次 p))-矩阵给出的数据质量信息。-矩阵给出的数据质量信息,其中 n 代表个体数量,p 代表解释变量数量。根据这一观点,我们提出了一个驱动协变量值生成的潜变量模型,并引入了一种新算法,将所有这些信息纳入预测考虑。我们的方法可提供无偏的回归系数估计值,并可根据给定的质量模式进行预测。我们通过模拟和实际应用说明了我们的程序的实用性。请检查并确认相应作者的身份是否正确。
{"title":"Estimation and prediction with data quality indexes in linear regressions","authors":"","doi":"10.1007/s00180-023-01441-6","DOIUrl":"https://doi.org/10.1007/s00180-023-01441-6","url":null,"abstract":"<h3>Abstract</h3> <p>Despite many statistical applications brush the question of data quality aside, it is a fundamental concern inherent to external data collection. In this paper, data quality relates to the confidence one can have about the covariate values in a regression framework. More precisely, we study how to integrate the information of data quality given by a <span> <span>((n times p))</span> </span>-matrix, with <em>n</em> the number of individuals and <em>p</em> the number of explanatory variables. In this view, we suggest a latent variable model that drives the generation of the covariate values, and introduce a new algorithm that takes all these information into account for prediction. Our approach provides unbiased estimators of the regression coefficients, and allows to make predictions adapted to some given quality pattern. The usefulness of our procedure is illustrated through simulations and real-life applications. <?oxy_aq_start?>Kindly check and confirm whether the corresponding author is correctly identified.<?oxy_aq_end?><?oxy_aqreply_start?>Yes<?oxy_aqreply_end?></p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"6 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138818581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An extended Langevinized ensemble Kalman filter for non-Gaussian dynamic systems 用于非高斯动态系统的扩展朗格文集合卡尔曼滤波器
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-14 DOI: 10.1007/s00180-023-01443-4
Peiyi Zhang, Tianning Dong, Faming Liang

State estimation for large-scale non-Gaussian dynamic systems remains an unresolved issue, given nonscalability of the existing particle filter algorithms. To address this issue, this paper extends the Langevinized ensemble Kalman filter (LEnKF) algorithm to non-Gaussian dynamic systems by introducing a latent Gaussian measurement variable to the dynamic system. The extended LEnKF algorithm can converge to the right filtering distribution as the number of stages become large, while inheriting the scalability of the LEnKF algorithm with respect to the sample size and state dimension. The performance of the extended LEnKF algorithm is illustrated by dynamic network embedding and dynamic Poisson spatial models.

鉴于现有粒子滤波算法的不可扩展性,大规模非高斯动态系统的状态估计一直是一个未解决的问题。为了解决这一问题,本文通过在非高斯动态系统中引入一个潜在的高斯测量变量,将Langevinized ensemble Kalman filter (LEnKF)算法扩展到非高斯动态系统。扩展的LEnKF算法在继承了LEnKF算法在样本量和状态维数方面的可扩展性的同时,可以随着阶段数的增大收敛到正确的滤波分布。通过动态网络嵌入和动态泊松空间模型说明了扩展的LEnKF算法的性能。
{"title":"An extended Langevinized ensemble Kalman filter for non-Gaussian dynamic systems","authors":"Peiyi Zhang, Tianning Dong, Faming Liang","doi":"10.1007/s00180-023-01443-4","DOIUrl":"https://doi.org/10.1007/s00180-023-01443-4","url":null,"abstract":"<p>State estimation for large-scale non-Gaussian dynamic systems remains an unresolved issue, given nonscalability of the existing particle filter algorithms. To address this issue, this paper extends the Langevinized ensemble Kalman filter (LEnKF) algorithm to non-Gaussian dynamic systems by introducing a latent Gaussian measurement variable to the dynamic system. The extended LEnKF algorithm can converge to the right filtering distribution as the number of stages become large, while inheriting the scalability of the LEnKF algorithm with respect to the sample size and state dimension. The performance of the extended LEnKF algorithm is illustrated by dynamic network embedding and dynamic Poisson spatial models.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"38 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138629856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An effective method for identifying clusters of robot strengths 识别机器人优势集群的有效方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-11 DOI: 10.1007/s00180-023-01442-5
Jen-Chieh Teng, Chin-Tsang Chiang, Alvin Lim

In the analysis of qualification stage data from FIRST Robotics Competition (FRC) championships, the ratio (1.67–1.68) of the number of observations (110–114 matches) to the number of parameters (66–68 robots) in each division has been found to be quite small for the most commonly used winning margin power rating (WMPR) model. This usually leads to imprecise estimates and inaccurate predictions in such three-on-three matches that FRC tournaments are composed of. With the recognition of a clustering feature in estimated robot strengths, a more flexible model with latent clusters of robots was proposed to alleviate overparameterization of the WMPR model. Since its structure can be regarded as a dimension reduction of the parameter space in the WMPR model, the identification of clusters of robot strengths is naturally transformed into a model selection problem. Instead of comparing a huge number of competing models ((7.76times 10^{67}) to (3.66times 10^{70})), we develop an effective method to estimate the number of clusters, clusters of robots and robot strengths in the format of qualification stage data from the FRC championships. The new method consists of two parts: (i) a combination of hierarchical and non-hierarchical classifications to determine candidate models; and (ii) variant goodness-of-fit criteria to select optimal models. In contrast to existing hierarchical classification, each step of our proposed non-hierarchical classification is based on estimated robot strengths from a candidate model in the preceding non-hierarchical classification step. A great advantage of the proposed methodology is its ability to consider the possibility of reassigning robots to other clusters. To reduce overestimation of the number of clusters by the mean squared prediction error criteria, corresponding Bayesian information criteria are further established as alternatives for model selection. With a coherent assembly of these essential elements, a systematic procedure is presented to perform the estimation of parameters. In addition, we propose two indices to measure the nested relation between clusters from any two models and monotonic association between robot strengths from any two models. Data from the 2018 and 2019 FRC championships and a simulation study are also used to illustrate the applicability and superiority of our proposed methodology.

在对 FIRST 机器人竞赛(FRC)锦标赛资格赛阶段的数据进行分析时发现,对于最常用的获胜能力评级(WMPR)模型而言,每个分区的观察数(110-114 场比赛)与参数数(66-68 个机器人)之比(1.67-1.68)相当小。这通常会导致在 FRC 锦标赛这种三对三比赛中出现不精确的估计和不准确的预测。由于认识到了机器人实力估算中的聚类特征,因此提出了一种具有潜在机器人聚类的更灵活模型,以减轻 WMPR 模型的参数过多问题。由于其结构可被视为 WMPR 模型参数空间的降维,因此机器人强度集群的识别自然而然地转化为模型选择问题。我们并没有比较大量的竞争模型((7.76乘以10^{67})到(3.66乘以10^{70})),而是开发了一种有效的方法,以FRC锦标赛资格赛阶段数据的形式来估计机器人集群的数量、机器人集群和机器人强度。新方法由两部分组成:(i) 结合层次分类法和非层次分类法确定候选模型;(ii) 采用变异拟合优度标准选择最优模型。与现有的分层分类法不同,我们提出的非分层分类法的每一步都是基于前一步非分层分类法中候选模型的机器人强度估计值。所提方法的一大优势是能够考虑将机器人重新分配到其他群组的可能性。为了减少均方预测误差标准对集群数量的过高估计,还进一步建立了相应的贝叶斯信息标准,作为模型选择的替代方案。通过对这些基本要素的整合,我们提出了一套系统的参数估计程序。此外,我们还提出了两个指数,用于衡量任意两个模型的聚类之间的嵌套关系,以及任意两个模型的机器人强度之间的单调关联。我们还使用了 2018 年和 2019 年 FRC 锦标赛的数据以及一项模拟研究来说明我们提出的方法的适用性和优越性。
{"title":"An effective method for identifying clusters of robot strengths","authors":"Jen-Chieh Teng, Chin-Tsang Chiang, Alvin Lim","doi":"10.1007/s00180-023-01442-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01442-5","url":null,"abstract":"<p>In the analysis of qualification stage data from FIRST Robotics Competition (FRC) championships, the ratio (1.67–1.68) of the number of observations (110–114 matches) to the number of parameters (66–68 robots) in each division has been found to be quite small for the most commonly used winning margin power rating (WMPR) model. This usually leads to imprecise estimates and inaccurate predictions in such three-on-three matches that FRC tournaments are composed of. With the recognition of a clustering feature in estimated robot strengths, a more flexible model with latent clusters of robots was proposed to alleviate overparameterization of the WMPR model. Since its structure can be regarded as a dimension reduction of the parameter space in the WMPR model, the identification of clusters of robot strengths is naturally transformed into a model selection problem. Instead of comparing a huge number of competing models <span>((7.76times 10^{67})</span> to <span>(3.66times 10^{70}))</span>, we develop an effective method to estimate the number of clusters, clusters of robots and robot strengths in the format of qualification stage data from the FRC championships. The new method consists of two parts: (i) a combination of hierarchical and non-hierarchical classifications to determine candidate models; and (ii) variant goodness-of-fit criteria to select optimal models. In contrast to existing hierarchical classification, each step of our proposed non-hierarchical classification is based on estimated robot strengths from a candidate model in the preceding non-hierarchical classification step. A great advantage of the proposed methodology is its ability to consider the possibility of reassigning robots to other clusters. To reduce overestimation of the number of clusters by the mean squared prediction error criteria, corresponding Bayesian information criteria are further established as alternatives for model selection. With a coherent assembly of these essential elements, a systematic procedure is presented to perform the estimation of parameters. In addition, we propose two indices to measure the nested relation between clusters from any two models and monotonic association between robot strengths from any two models. Data from the 2018 and 2019 FRC championships and a simulation study are also used to illustrate the applicability and superiority of our proposed methodology.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"12 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138576940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High dimensional controlled variable selection with model-X knockoffs in the AFT model 在 AFT 模型中使用 X 模型山寨版进行高维受控变量选择
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-09 DOI: 10.1007/s00180-023-01426-5
Baihua He, Di Xia, Yingli Pan

Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.

可解释性和稳定性是统计中应用高维数据所需的两个重要特征。虽然前者在一定程度上得到了许多现有预测方法的青睐,但后者在控制错误特征发现率的意义上仍有很大欠缺。在加速失效时间模型下,本文介绍了一种受控变量选择方法,该方法具有模型-X山寨版的一般框架,可用于处理高维数据。我们提供了渐近错误发现率(FDR)控制的理论依据。由于能在保持预测能力的同时对 FDR 进行强有力的控制,所提出的方法引起了极大的兴趣。我们通过几个模拟示例来评估有限样本的性能,以及所需的可解释性和稳定性。分析了急性髓性白血病研究的真实数据示例,以证明所提方法在实践中的实用性。
{"title":"High dimensional controlled variable selection with model-X knockoffs in the AFT model","authors":"Baihua He, Di Xia, Yingli Pan","doi":"10.1007/s00180-023-01426-5","DOIUrl":"https://doi.org/10.1007/s00180-023-01426-5","url":null,"abstract":"<p>Interpretability and stability are two important characteristics required for the application of high dimensional data in statistics. Although the former has been favored by many existing forecasting methods to some extent, the latter in the sense of controlling the fraction of wrongly discovered features is still largely underdeveloped. Under the accelerated failure time model, this paper introduces a controlled variable selection method with the general framework of Model-X knockoffs to tackle high dimensional data. We provide theoretical justifications on the asymptotic false discovery rate (FDR) control. The proposed method has attracted significant interest due to its strong control of the FDR while preserving predictive power. Several simulation examples are conducted to assess the finite sample performance with desired interpretability and stability. A real data example from Acute Myeloid Leukemia study is analyzed to demonstrate the utility of the proposed method in practice.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"23 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138563591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach 多时间序列数据的降维与可视化:一种符号数据分析方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-06 DOI: 10.1007/s00180-023-01440-7
Emily Chia-Yu Su, Han-Ming Wu

Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.

在尝试建模和预测之前,对多个时间序列数据进行探索性分析和可视化对于发现序列的内在动态至关重要。本研究将两种降维方法--主成分分析(PCA)和切片反回归(SIR)--扩展到多时间序列数据。这是通过创新的路径点方法来实现的,该方法是对符号数据分析框架的新补充。通过将多个时间序列数据转换为以起始值和终止值为标志的时间相关区间,每个序列被几何表示为具有唯一路径点的连续有向线段。这些路径点是我们新颖表示方法的基础。然后,将 PCA 和 SIR 应用于由这些路径点坐标形成的数据表,从而在一个缩减维度的子空间内实现对象时间轨迹的可视化。包括模拟、酵母细胞周期微阵列时间序列数据和金融数据在内的实证研究证实了我们的路径点方法在揭示二维因子平面内对象的结构和行为方面的有效性。与现有方法的比较分析,如 PCA 的应用向量法和时间相关区间数据的 SIR,进一步强调了我们的路径点表示法在时间序列数据领域的优势和多功能性。
{"title":"Dimension reduction and visualization of multiple time series data: a symbolic data analysis approach","authors":"Emily Chia-Yu Su, Han-Ming Wu","doi":"10.1007/s00180-023-01440-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01440-7","url":null,"abstract":"<p>Exploratory analysis and visualization of multiple time series data are essential for discovering the underlying dynamics of a series before attempting modeling and forecasting. This study extends two dimension reduction methods - principal component analysis (PCA) and sliced inverse regression (SIR) - to multiple time series data. This is achieved through the innovative path point approach, a new addition to the symbolic data analysis framework. By transforming multiple time series data into time-dependent intervals marked by starting and ending values, each series is geometrically represented as successive directed segments with unique path points. These path points serve as the foundation of our novel representation approach. PCA and SIR are then applied to the data table formed by the coordinates of these path points, enabling visualization of temporal trajectories of objects within a reduced-dimensional subspace. Empirical studies encompassing simulations, microarray time series data from a yeast cell cycle, and financial data confirm the effectiveness of our path point approach in revealing the structure and behavior of objects within a 2D factorial plane. Comparative analyses with existing methods, such as the applied vector approach for PCA and SIR on time-dependent interval data, further underscore the strength and versatility of our path point representation in the realm of time series data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"93 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138548069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An expectation maximization algorithm for the hidden markov models with multiparameter student-t observations 具有多参数student-t观测值的隐马尔可夫模型期望最大化算法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-06 DOI: 10.1007/s00180-023-01432-7
Emna Ghorbel, Mahdi Louati

Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.

隐马尔可夫模型是一类概率图模型,用于描述一系列未知变量从一组观测变量的演化过程。它们是Baum和Petrie在Baum (JMA 101:789-810)中引入的统计模型,属于潜在变量模型的一类。它们最初是在语音识别的背景下发展和应用的,在许多应用领域受到了广泛的关注。这项研究工作的中心目标是对这些模型的扩展。更准确地说,我们定义了多参数隐马尔可夫模型,使用多个观测过程和对称矩阵空间上的Riesz分布作为gamma分布的自然扩展。讨论了一些基本的相关性质,并导出了边际分布和后验分布。采用前向-后向动态规划算法和经典期望最大化算法对全局参数集进行估计。利用仿真数据,通过Matlab程序方便地实现了这些估计器的性能。这使我们能够通过真实值和估计值之间的均方误差来评估所提出估计器的质量。
{"title":"An expectation maximization algorithm for the hidden markov models with multiparameter student-t observations","authors":"Emna Ghorbel, Mahdi Louati","doi":"10.1007/s00180-023-01432-7","DOIUrl":"https://doi.org/10.1007/s00180-023-01432-7","url":null,"abstract":"<p>Hidden Markov models are a class of probabilistic graphical models used to describe the evolution of a sequence of unknown variables from a set of observed variables. They are statistical models introduced by Baum and Petrie in Baum (JMA 101:789–810) and belong to the class of latent variable models. Initially developed and applied in the context of speech recognition, they have attracted much attention in many fields of application. The central objective of this research work is upon an extension of these models. More accurately, we define multiparameter hidden Markov models, using multiple observation processes and the Riesz distribution on the space of symmetric matrices as a natural extension of the gamma one. Some basic related properties are discussed and marginal and posterior distributions are derived. We conduct the Forward-Backward dynamic programming algorithm and the classical Expectation Maximization algorithm to estimate the global set of parameters. Using simulated data, the performance of these estimators is conveniently achieved by the Matlab program. This allows us to assess the quality of the proposed estimators by means of the mean square errors between the true and the estimated values.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 8","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions 参考假设下纵向连续结果条件均值估算的序贯线性回归
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2023-12-03 DOI: 10.1007/s00180-023-01439-0
Sean Yiu

In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.

在纵向连续结果的临床试验中,基于参考的归算(RBI)通常用于处理在估计包含交叉事件(如停止治疗)影响的情况下缺失的结果数据。RBI最初是在多重归算框架下发展起来的,但最近提出了条件平均归算(CMI)与标准误差的折刀估计相结合的方法,以获得确定性的治疗效果估计和纠正频率推断。对于多重和CMI,通常使用重复测量的混合模型(MMRM)作为输入模型,但这可能是计算密集型的,以拟合多个数据集(例如jackknife样本),并导致具有许多参数的复杂MMRM模型的收敛问题。因此,基于每次就诊结果的顺序线性回归(SLR)的逐步方法被开发用于多重输入框架中的输入模型,但在CMI框架中缺乏类似的发展。在本文中,我们通过提出在CMI框架中实现RBI的单反方法来填补文献中的这一空白,并使用理论结果和模拟来证明其有效性。我们还在一个实际的数据应用中说明了我们的建议。
{"title":"Sequential linear regression for conditional mean imputation of longitudinal continuous outcomes under reference-based assumptions","authors":"Sean Yiu","doi":"10.1007/s00180-023-01439-0","DOIUrl":"https://doi.org/10.1007/s00180-023-01439-0","url":null,"abstract":"<p>In clinical trials of longitudinal continuous outcomes, reference based imputation (RBI) has commonly been applied to handle missing outcome data in settings where the estimand incorporates the effects of intercurrent events, e.g. treatment discontinuation. RBI was originally developed in the multiple imputation framework, however recently conditional mean imputation (CMI) combined with the jackknife estimator of the standard error was proposed as a way to obtain deterministic treatment effect estimates and correct frequentist inference. For both multiple and CMI, a mixed model for repeated measures (MMRM) is often used for the imputation model, but this can be computationally intensive to fit to multiple data sets (e.g. the jackknife samples) and lead to convergence issues with complex MMRM models with many parameters. Therefore, a step-wise approach based on sequential linear regression (SLR) of the outcomes at each visit was developed for the imputation model in the multiple imputation framework, but similar developments in the CMI framework are lacking. In this article, we fill this gap in the literature by proposing a SLR approach to implement RBI in the CMI framework, and justify its validity using theoretical results and simulations. We also illustrate our proposal on a real data application.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":" 9","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138493828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1