Testing sufficiency for transfer learning

IF 1.6 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computational Statistics & Data Analysis Pub Date : 2025-03-01 Epub Date: 2024-10-16 DOI:10.1016/j.csda.2024.108075

Ziqian Lin , Yuan Gao , Feifei Wang , Hansheng Wang

{"title":"Testing sufficiency for transfer learning","authors":"Ziqian Lin , Yuan Gao , Feifei Wang , Hansheng Wang","doi":"10.1016/j.csda.2024.108075","DOIUrl":null,"url":null,"abstract":"<div><div>Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes it difficult to estimate high-dimensional statistical models based on target data with limited sample size. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this regard have been developed recently. In this work, we study transfer learning from a different perspective. Specifically, we consider here the problem of testing for transfer learning sufficiency. We denote <em>transfer learning sufficiency</em> to be the null hypothesis. It refers to the situation that, with the help of the source data, the useful information contained in the feature vectors of the target data can be sufficiently extracted for predicting the interested target response. Therefore, the rejection of the null hypothesis implies that information useful for prediction remains in the feature vectors of the target data and thus calls for further exploration. To this end, we develop a novel testing procedure and a centralized and standardized test statistic, whose asymptotic null distribution is analytically derived. Simulation studies are presented to demonstrate the finite sample performance of the proposed method. A deep learning related real data example is presented for illustration purpose.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108075"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947324001592","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/16 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Modern statistical analysis often encounters high dimensional models but with limited sample sizes. This makes it difficult to estimate high-dimensional statistical models based on target data with limited sample size. Then how to borrow information from another large sized source data for more accurate target model estimation becomes an interesting problem. This leads to the useful idea of transfer learning. Various estimation methods in this regard have been developed recently. In this work, we study transfer learning from a different perspective. Specifically, we consider here the problem of testing for transfer learning sufficiency. We denote transfer learning sufficiency to be the null hypothesis. It refers to the situation that, with the help of the source data, the useful information contained in the feature vectors of the target data can be sufficiently extracted for predicting the interested target response. Therefore, the rejection of the null hypothesis implies that information useful for prediction remains in the feature vectors of the target data and thus calls for further exploration. To this end, we develop a novel testing procedure and a centralized and standardized test statistic, whose asymptotic null distribution is analytically derived. Simulation studies are presented to demonstrate the finite sample performance of the proposed method. A deep learning related real data example is presented for illustration purpose.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

测试迁移学习的充分性

现代统计分析经常会遇到高维模型但样本量有限的情况。这就给基于有限样本量的目标数据估计高维统计模型带来了困难。那么，如何从另一个大样本数据中借用信息来更准确地估计目标模型就成了一个有趣的问题。这就产生了迁移学习这一有用的想法。最近，人们在这方面开发出了各种估计方法。在这项工作中，我们从另一个角度研究迁移学习。具体来说，我们在此考虑转移学习充分性的检验问题。我们将转移学习充分性视为零假设。它是指在源数据的帮助下，目标数据的特征向量中包含的有用信息可以被充分提取出来，用于预测感兴趣的目标响应。因此，拒绝零假设意味着目标数据的特征向量中仍然存在对预测有用的信息，因此需要进一步探索。为此，我们开发了一种新颖的检验程序和集中标准化检验统计量，并对其渐近零分布进行了分析推导。仿真研究展示了所提方法的有限样本性能。为了便于说明，还介绍了一个与深度学习相关的真实数据示例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Statistics & Data Analysis 数学-计算机：跨学科应用

CiteScore

3.70

自引率

5.60%

发文量

167

审稿时长

60 days

期刊介绍： Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures. [...] III) Special Applications - [...] IV) Annals of Statistical Data Science [...]

期刊最新文献

Random irregular histograms Reduced-bias whittle likelihood estimation for short- and long-memory processes Dirichlet process multi-state mixture models Robust Bayesian high-dimensional variable selection and inference with the horseshoe family of priors Adaptive accelerated failure time modeling with a semiparametric skewed error distribution