Computational Statistics & Data Analysis最新文献

英文中文

Spectral co-clustering in multi-layer directed networks 多层定向网络中的光谱协同聚类

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-23 DOI: 10.1016/j.csda.2024.107987

Wenqing Su , Xiao Guo , Xiangyu Chang , Ying Yang

Modern network analysis often involves multi-layer network data in which the nodes are aligned, and the edges on each layer represent one of the multiple relations among the nodes. Current literature on multi-layer network data is mostly limited to undirected relations. However, direct relations are more common and may introduce extra information. This study focuses on community detection (or clustering) in multi-layer directed networks. To take into account the asymmetry, a novel spectral-co-clustering-based algorithm is developed to detect co-clusters, which capture the sending patterns and receiving patterns of nodes, respectively. Specifically, the eigendecomposition of the debiased sum of Gram matrices over the layer-wise adjacency matrices is computed, followed by the k-means, where the sum of Gram matrices is used to avoid possible cancellation of clusters caused by direct summation. Theoretical analysis of the algorithm under the multi-layer stochastic co-block model is provided, where the common assumption that the cluster number is coupled with the rank of the model is relaxed. After a systematic analysis of the eigenvectors of the population version algorithm, the misclassification rates are derived, which show that multi-layers would bring benefits to the clustering performance. The experimental results of simulated data corroborate the theoretical predictions, and the analysis of a real-world trade network dataset provides interpretable results.

现代网络分析通常涉及多层网络数据，其中节点是对齐的，每一层的边代表节点之间的多种关系之一。目前关于多层网络数据的文献大多局限于无向关系。然而，直接关系更为常见，并可能引入额外的信息。本研究侧重于多层有向网络中的社群检测（或聚类）。考虑到非对称性，本研究开发了一种基于光谱聚类的新型算法来检测共聚类，分别捕捉节点的发送模式和接收模式。具体来说，先计算层向邻接矩阵上的格兰矩阵去重和的eigendecomposition，然后进行k-means，其中使用格兰矩阵和来避免直接求和可能造成的簇取消。对多层随机共块模型下的算法进行了理论分析，其中放宽了聚类数与模型秩耦合的常见假设。在对群体版算法的特征向量进行系统分析后，得出了误分类率，这表明多层算法会给聚类性能带来好处。模拟数据的实验结果证实了理论预测，对现实世界贸易网络数据集的分析也提供了可解释的结果。

{"title":"Spectral co-clustering in multi-layer directed networks","authors":"Wenqing Su , Xiao Guo , Xiangyu Chang , Ying Yang","doi":"10.1016/j.csda.2024.107987","DOIUrl":"10.1016/j.csda.2024.107987","url":null,"abstract":"<div><p>Modern network analysis often involves multi-layer network data in which the nodes are aligned, and the edges on each layer represent one of the multiple relations among the nodes. Current literature on multi-layer network data is mostly limited to undirected relations. However, direct relations are more common and may introduce extra information. This study focuses on community detection (or clustering) in multi-layer directed networks. To take into account the asymmetry, a novel spectral-co-clustering-based algorithm is developed to detect <em>co-clusters</em>, which capture the sending patterns and receiving patterns of nodes, respectively. Specifically, the eigendecomposition of the <em>debiased</em> sum of Gram matrices over the layer-wise adjacency matrices is computed, followed by the <em>k</em>-means, where the sum of Gram matrices is used to avoid possible cancellation of clusters caused by direct summation. Theoretical analysis of the algorithm under the multi-layer stochastic co-block model is provided, where the common assumption that the cluster number is coupled with the rank of the model is relaxed. After a systematic analysis of the eigenvectors of the population version algorithm, the misclassification rates are derived, which show that multi-layers would bring benefits to the clustering performance. The experimental results of simulated data corroborate the theoretical predictions, and the analysis of a real-world trade network dataset provides interpretable results.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"198 ","pages":"Article 107987"},"PeriodicalIF":1.8,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141132464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A conditional approach for regression analysis of case K interval-censored failure time data with informative censoring 对带有信息普查的 K 例间隔删失故障时间数据进行回归分析的条件方法

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-23 DOI: 10.1016/j.csda.2024.107991

Mingyue Du , Xingqiu Zhao

This paper discusses regression analysis of case K interval-censored failure time data, a general type of failure time data, in the presence of informative censoring with the focus on simultaneous variable selection and estimation. Although many authors have considered the challenging variable selection problem for interval-censored data, most of the existing methods assume independent or non-informative censoring. More importantly, the existing methods that allow for informative censoring are frailty model-based approaches and cannot directly assess the degree of informative censoring among other shortcomings. To address these, we propose a conditional approach and develop a penalized sieve maximum likelihood procedure for the simultaneous variable selection and estimation of covariate effects. Furthermore, we establish the oracle property of the proposed method and illustrate the appropriateness and usefulness of the approach using a simulation study. Finally we apply the proposed method to a set of real data on Alzheimer's disease and provide some new insights.

本文讨论了 K 例区间删失故障时间数据（一种常见的故障时间数据）在有信息删失情况下的回归分析，重点是同时进行变量选择和估计。尽管许多学者都考虑过区间删失数据的变量选择问题，但现有的大多数方法都假设了独立或非信息删失。更重要的是，现有的允许信息剔除的方法都是基于虚弱模型的方法，不能直接评估信息剔除的程度等缺点。为了解决这些问题，我们提出了一种有条件的方法，并开发了一种惩罚性筛最大似然程序，用于同时选择变量和估计协变量效应。此外，我们还建立了所提方法的甲骨文属性，并通过模拟研究说明了该方法的适当性和实用性。最后，我们将提出的方法应用于一组关于阿尔茨海默病的真实数据，并提出了一些新的见解。

引用次数: 0

Principal component analysis for zero-inflated compositional data 零膨胀成分数据的主成分分析

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-21 DOI: 10.1016/j.csda.2024.107989

Kipoong Kim , Jaesung Park , Sungkyu Jung

Recent advances in DNA sequencing technology have led to a growing interest in microbiome data. Since the data are often high-dimensional, there is a clear need for dimensionality reduction. However, the compositional nature and zero-inflation of microbiome data present many challenges in developing new methodologies. New PCA methods for zero-inflated compositional data are presented, based on a novel framework called principal compositional subspace. These methods aim to identify both the principal compositional subspace and the corresponding principal scores that best approximate the given data, ensuring that their reconstruction remains within the compositional simplex. To this end, the constrained optimization problems are established and alternating minimization algorithms are provided to solve the problems. The theoretical properties of the principal compositional subspace, particularly focusing on its existence and consistency, are further investigated. Simulation studies have demonstrated that the methods achieve lower reconstruction errors than the existing log-ratio PCA in the presence of a linear pattern and have shown comparable performance in a curved pattern. The methods have been applied to four microbiome compositional datasets with excessive zeros, successfully recovering the underlying low-rank structure.

DNA 测序技术的最新进展使得人们对微生物组数据的兴趣与日俱增。由于数据通常是高维的，因此显然需要降维。然而，微生物组数据的组成性质和零膨胀性给开发新方法带来了许多挑战。本文基于一个称为主成分子空间的新框架，介绍了用于零膨胀成分数据的新 PCA 方法。这些方法旨在找出最接近给定数据的主成分子空间和相应的主分数，确保它们的重构保持在成分单纯形内。为此，建立了约束优化问题，并提供了交替最小化算法来解决这些问题。此外，还进一步研究了主组成子空间的理论特性，特别是其存在性和一致性。模拟研究表明，与现有的对数比率 PCA 相比，这些方法在线性模式下的重建误差更小，在曲线模式下的性能相当。这些方法已应用于四个零点过多的微生物组成分数据集，成功地恢复了底层的低秩结构。

{"title":"Principal component analysis for zero-inflated compositional data","authors":"Kipoong Kim , Jaesung Park , Sungkyu Jung","doi":"10.1016/j.csda.2024.107989","DOIUrl":"10.1016/j.csda.2024.107989","url":null,"abstract":"<div><p>Recent advances in DNA sequencing technology have led to a growing interest in microbiome data. Since the data are often high-dimensional, there is a clear need for dimensionality reduction. However, the compositional nature and zero-inflation of microbiome data present many challenges in developing new methodologies. New PCA methods for zero-inflated compositional data are presented, based on a novel framework called principal compositional subspace. These methods aim to identify both the principal compositional subspace and the corresponding principal scores that best approximate the given data, ensuring that their reconstruction remains within the compositional simplex. To this end, the constrained optimization problems are established and alternating minimization algorithms are provided to solve the problems. The theoretical properties of the principal compositional subspace, particularly focusing on its existence and consistency, are further investigated. Simulation studies have demonstrated that the methods achieve lower reconstruction errors than the existing log-ratio PCA in the presence of a linear pattern and have shown comparable performance in a curved pattern. The methods have been applied to four microbiome compositional datasets with excessive zeros, successfully recovering the underlying low-rank structure.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"198 ","pages":"Article 107989"},"PeriodicalIF":1.8,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141130855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gibbs sampler approach for objective Bayesian inference in elliptical multivariate meta-analysis random effects model 椭圆多元荟萃随机效应模型中客观贝叶斯推断的吉布斯采样器方法

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-20 DOI: 10.1016/j.csda.2024.107990

Olha Bodnar , Taras Bodnar

Bayesian inference procedures for the parameters of the multivariate random effects model are derived under the assumption of an elliptically contoured distribution when the Berger and Bernardo reference and the Jeffreys priors are assigned to the model parameters. A new numerical algorithm for drawing samples from the posterior distribution is developed, which is based on the hybrid Gibbs sampler. The new approach is compared to the two Metropolis-Hastings algorithms previously derived in the literature via an extensive simulation study. The findings are applied to a Bayesian multivariate meta-analysis, conducted using the results of ten studies on the effectiveness of a treatment for hypertension. The analysis investigates the treatment effects on systolic and diastolic blood pressure. The second empirical illustration deals with measurement data from the CCAUV.V-K1 key comparison, aiming to compare measurement results of sinusoidal linear accelerometers at four frequencies.

在椭圆轮廓分布的假设条件下，为多元随机效应模型的参数导出了贝叶斯推断程序，即为模型参数指定 Berger 和 Bernardo 参考先验和 Jeffreys 先验。在混合吉布斯采样器的基础上，开发了一种从后验分布中抽取样本的新数值算法。通过广泛的模拟研究，将新方法与之前在文献中得出的两种 Metropolis-Hastings 算法进行了比较。研究结果被应用于贝叶斯多元荟萃分析，该分析使用了十项关于高血压治疗效果的研究结果。分析调查了治疗对收缩压和舒张压的影响。第二个实证说明涉及 CCAUV.V-K1 关键比较的测量数据，目的是比较正弦线性加速度计在四种频率下的测量结果。

{"title":"Gibbs sampler approach for objective Bayesian inference in elliptical multivariate meta-analysis random effects model","authors":"Olha Bodnar , Taras Bodnar","doi":"10.1016/j.csda.2024.107990","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107990","url":null,"abstract":"<div><p>Bayesian inference procedures for the parameters of the multivariate random effects model are derived under the assumption of an elliptically contoured distribution when the Berger and Bernardo reference and the Jeffreys priors are assigned to the model parameters. A new numerical algorithm for drawing samples from the posterior distribution is developed, which is based on the hybrid Gibbs sampler. The new approach is compared to the two Metropolis-Hastings algorithms previously derived in the literature via an extensive simulation study. The findings are applied to a Bayesian multivariate meta-analysis, conducted using the results of ten studies on the effectiveness of a treatment for hypertension. The analysis investigates the treatment effects on systolic and diastolic blood pressure. The second empirical illustration deals with measurement data from the CCAUV.V-K1 key comparison, aiming to compare measurement results of sinusoidal linear accelerometers at four frequencies.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107990"},"PeriodicalIF":1.8,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000744/pdfft?md5=f03345bd15314ef0a3bf57ae49fa38db&pid=1-s2.0-S0167947324000744-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141084457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Goodness–of–fit tests based on the min–characteristic function 基于最小特征函数的拟合优度检验

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-16 DOI: 10.1016/j.csda.2024.107988

S.G. Meintanis , B. Milošević , M.D. Jiménez–Gamero

Tests of fit for classes of distributions that include the Weibull, the Pareto and the Fréchet families are proposed. The new tests employ the novel tool of the min–characteristic function and are based on an $L^{2}$ –type weighted distance between this function and its empirical counterpart applied on suitably standardized data. If data–standardization is performed using the MLE of the distributional parameters then the method reduces to testing for the standard member of the family, with parameter values known and set equal to one. Asymptotic properties of the tests are investigated. A Monte Carlo study is presented that includes the new procedure as well as competitors for the purpose of specification testing with three extreme value distributions. The new tests are also applied on a few real–data sets.

提出了对包括魏布勒、帕累托和弗雷谢家族在内的各类分布的拟合度测试。新测试采用了新颖的最小特征函数工具，并基于该函数与应用于适当标准化数据的经验对应函数之间的 L2 型加权距离。如果使用分布参数的 MLE 进行数据标准化，那么该方法就简化为对已知参数值并设为 1 的标准族成员进行检验。对测试的渐近特性进行了研究。本文介绍了蒙特卡罗研究，其中包括新程序以及竞争对手使用三种极值分布进行规范测试的情况。新的检验还应用于一些真实数据集。

引用次数: 0

Rank-based sequential feature selection for high-dimensional accelerated failure time models with main and interaction effects 基于等级的序列特征选择，用于具有主效应和交互效应的高维加速故障时间模型

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-13 DOI: 10.1016/j.csda.2024.107978

Ke Yu, Shan Luo

High-dimensional accelerated failure time (AFT) models are commonly used regression models in survival analysis. Feature selection problem in high-dimensional AFT models is addressed, considering scenarios involving solely main effects or encompassing both main and interaction effects. A rank-based sequential feature selection (RankSFS) method is proposed, the selection consistency is established and illustrated by comparing it with existing methods through extensive numerical simulations. The results show that RankSFS achieves a higher Positive Discovery Rate (PDR) and lower False Discovery Rate (FDR). Additionally, RankSFS is applied to analyze the data on Breast Cancer Relapse. With a remarkable short computational time, RankSFS successfully identifies two crucial genes.

高维加速失效时间（AFT）模型是生存分析中常用的回归模型。考虑到只涉及主效应或同时包含主效应和交互效应的情况，本文探讨了高维 AFT 模型中的特征选择问题。本文提出了一种基于秩的序列特征选择（RankSFS）方法，并通过大量数值模拟将其与现有方法进行比较，从而确定了选择的一致性。结果表明，RankSFS 实现了更高的正发现率（PDR）和更低的误发现率（FDR）。此外，RankSFS 还被用于分析乳腺癌复发数据。在极短的计算时间内，RankSFS 成功识别了两个关键基因。

引用次数: 0

A switching state-space transmission model for tracking epidemics and assessing interventions 用于追踪流行病和评估干预措施的切换状态空间传播模型

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-09 DOI: 10.1016/j.csda.2024.107977

Jingxue Feng, Liangliang Wang

The effective control of infectious diseases relies on accurate assessment of the impact of interventions, which is often hindered by the complex dynamics of the spread of disease. A Beta-Dirichlet switching state-space transmission model is proposed to track underlying dynamics of disease and evaluate the effectiveness of interventions simultaneously. As time evolves, the switching mechanism introduced in the susceptible-exposed-infected-recovered (SEIR) model is able to capture the timing and magnitude of changes in the transmission rate due to the effectiveness of control measures. The implementation of this model is based on a particle Markov Chain Monte Carlo algorithm, which can estimate the time evolution of SEIR states, switching states, and high-dimensional parameters efficiently. The efficacy of the proposed model and estimation procedure are demonstrated through simulation studies. With a real-world application to British Columbia's COVID-19 outbreak, the proposed switching state-space transmission model quantifies the reduction of transmission rate following interventions. The proposed model provides a promising tool to inform public health policies aimed at studying the underlying dynamics and evaluating the effectiveness of interventions during the spread of the disease.

传染病的有效控制有赖于对干预措施效果的准确评估，而这往往受到疾病传播复杂动态的阻碍。本文提出了一种 Beta-Dirichlet 切换状态空间传播模型，用于跟踪疾病的基本动态并同时评估干预措施的效果。随着时间的推移，易感-暴露-感染-恢复（SEIR）模型中引入的切换机制能够捕捉到因控制措施的有效性而导致的传播率变化的时间和幅度。该模型的实现基于粒子马尔可夫链蒙特卡洛算法，该算法能有效估计 SEIR 状态、切换状态和高维参数的时间演化。通过模拟研究，证明了所提模型和估算程序的有效性。通过对不列颠哥伦比亚省 COVID-19 疫情的实际应用，提出的切换状态空间传播模型量化了干预措施后传播率的降低。所提出的模型为公共卫生政策提供了一个很有前景的工具，旨在研究疾病传播过程中的基本动态并评估干预措施的有效性。

{"title":"A switching state-space transmission model for tracking epidemics and assessing interventions","authors":"Jingxue Feng, Liangliang Wang","doi":"10.1016/j.csda.2024.107977","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107977","url":null,"abstract":"<div><p>The effective control of infectious diseases relies on accurate assessment of the impact of interventions, which is often hindered by the complex dynamics of the spread of disease. A Beta-Dirichlet switching state-space transmission model is proposed to track underlying dynamics of disease and evaluate the effectiveness of interventions simultaneously. As time evolves, the switching mechanism introduced in the susceptible-exposed-infected-recovered (SEIR) model is able to capture the timing and magnitude of changes in the transmission rate due to the effectiveness of control measures. The implementation of this model is based on a particle Markov Chain Monte Carlo algorithm, which can estimate the time evolution of SEIR states, switching states, and high-dimensional parameters efficiently. The efficacy of the proposed model and estimation procedure are demonstrated through simulation studies. With a real-world application to British Columbia's COVID-19 outbreak, the proposed switching state-space transmission model quantifies the reduction of transmission rate following interventions. The proposed model provides a promising tool to inform public health policies aimed at studying the underlying dynamics and evaluating the effectiveness of interventions during the spread of the disease.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107977"},"PeriodicalIF":1.8,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000616/pdfft?md5=2ef429fe1ac8d3ce2c054c514b5fee1b&pid=1-s2.0-S0167947324000616-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140947674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Empirical Bayes Poisson matrix completion 经验贝叶斯泊松矩阵补全

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-06 DOI: 10.1016/j.csda.2024.107976

Xiao Li , Takeru Matsuda , Fumiyasu Komaki

An empirical Bayes method for the Poisson matrix denoising and completion problems is proposed, and a corresponding algorithm called EBPM (Empirical Bayes Poisson Matrix) is developed. This approach is motivated by the non-central singular value shrinkage prior, which was used for the estimation of the mean matrix parameter of a matrix-variate normal distribution. Numerical experiments show that the EBPM algorithm outperforms the common nuclear norm penalized method in both matrix denoising and completion. The EBPM algorithm is highly efficient and does not require heuristic parameter tuning, as opposed to the nuclear norm penalized method, in which the regularization parameter should be selected. The EBPM algorithm also performs better than others in real-data applications.

针对泊松矩阵去噪和补全问题提出了一种经验贝叶斯方法，并开发了一种名为 EBPM（经验贝叶斯泊松矩阵）的相应算法。这种方法的灵感来自非中心奇异值收缩先验，该先验用于估计矩阵变量正态分布的平均矩阵参数。数值实验表明，EBPM 算法在矩阵去噪和补全方面都优于普通核规范惩罚法。与需要选择正则化参数的核规范惩罚法相比，EBPM 算法效率高，不需要启发式参数调整。EBPM 算法在实际数据应用中的表现也优于其他算法。

引用次数: 0

Transfer learning via random forests: A one-shot federated approach 通过随机森林进行迁移学习：单次联合方法

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-06 DOI: 10.1016/j.csda.2024.107975

Pengcheng Xiang , Ling Zhou , Lu Tang

A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.

我们开发了一种使用随机森林（FTRF）的单次联合迁移学习方法，通过利用来自辅助站点的信息来提高目标数据站点的预测准确性。理论和数值结果表明，无论可能存在的数据异质性（包括各站点数据分布不平衡和非 IID 数据分布以及模型规范错误）如何，所提出的联合迁移学习方法的准确性至少与单独在目标数据上训练的模型相当。FTRF 能够评估目标站点和辅助站点之间的相似性，使目标站点能够自主选择更多相似站点信息，从而提高预测性能。为确保通信效率，FTRF 采用了模型平均化思想，目标站点和辅助站点之间只需进行一轮通信。只有来自辅助站点的拟合模型才会被发送到目标站点。与传统的模型平均不同，FTRF 在估算模型平均权重时，将其他站点的预测结果和原始变量纳入其中，从而形成了一种取决于变量的权重，以更好地利用辅助站点的模型来改进预测。五个实际数据实例表明，与不利用辅助信息的方法相比，FTRF 可将预测误差减少 2-40%。

{"title":"Transfer learning via random forests: A one-shot federated approach","authors":"Pengcheng Xiang , Ling Zhou , Lu Tang","doi":"10.1016/j.csda.2024.107975","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107975","url":null,"abstract":"<div><p>A one-shot <u>f</u>ederated <u>t</u>ransfer learning method using <u>r</u>andom <u>f</u>orests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107975"},"PeriodicalIF":1.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FDR control for linear log-contrast models with high-dimensional compositional covariates 具有高维组成协变量的线性对数对比模型的 FDR 控制

IF 1.8 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2024-05-03 DOI: 10.1016/j.csda.2024.107973

Panxu Yuan, Changhan Jin, Gaorong Li

Linear log-contrast models have been widely used to describe the relationship between the response of interest and the compositional covariates, in which one central task is to identify the significant compositional covariates while controlling the false discovery rate (FDR) at a nominal level. To achieve this goal, a new FDR control method is proposed for linear log-contrast models with high-dimensional compositional covariates. An appealing feature of the proposed method is that it completely bypasses the traditional p-values and utilizes only the symmetry property of the test statistic for the unimportant compositional covariates to give an upper bound of the FDR. Under some regularity conditions, the FDR can be asymptotically controlled at the nominal level for the proposed method in theory, and the theoretical power is also proven to approach one as the sample size tends to infinity. The finite-sample performance of the proposed method is evaluated through extensive simulation studies, and applications to microbiome compositional datasets are also provided.

线性对数对比模型已被广泛用于描述相关响应与组成协变量之间的关系，其中的一个核心任务是识别重要的组成协变量，同时将误诊率（FDR）控制在名义水平。为了实现这一目标，我们针对具有高维组成协变量的线性对数对比模型提出了一种新的 FDR 控制方法。所提方法的一个吸引人的特点是，它完全绕过了传统的 p 值，只利用不重要的组成协变量的检验统计量的对称性来给出 FDR 的上界。在某些规则性条件下，所提方法的 FDR 可以在理论上渐进地控制在标称水平，而且当样本量趋于无穷大时，理论功率也被证明接近于 1。通过大量的模拟研究评估了所提方法的有限样本性能，并将其应用于微生物组成分数据集。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational Statistics & Data Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀