通过半监督学习进行双样本比较中的高效半参数估计

The Canadian Journal of Statistics Pub Date : 2024-09-03 DOI:10.1002/cjs.11813

Tao Tan, Shuyi Zhang, Yong Zhou

{"title":"通过半监督学习进行双样本比较中的高效半参数估计","authors":"Tao Tan, Shuyi Zhang, Yong Zhou","doi":"10.1002/cjs.11813","DOIUrl":null,"url":null,"abstract":"We develop a general semisupervised framework for statistical inference in the two‐sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two‐sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two‐step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow‐up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two‐step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.","PeriodicalId":501595,"journal":{"name":"The Canadian Journal of Statistics","volume":"101 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient semiparametric estimation in two‐sample comparison via semisupervised learning\",\"authors\":\"Tao Tan, Shuyi Zhang, Yong Zhou\",\"doi\":\"10.1002/cjs.11813\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We develop a general semisupervised framework for statistical inference in the two‐sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two‐sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two‐step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow‐up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two‐step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.\",\"PeriodicalId\":501595,\"journal\":{\"name\":\"The Canadian Journal of Statistics\",\"volume\":\"101 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Canadian Journal of Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/cjs.11813\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Canadian Journal of Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/cjs.11813","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们为双样本比较环境下的统计推断开发了一个通用的半监督框架。虽然在非正态分布响应的双样本问题中，有监督的曼-惠特尼统计法优于许多估计法，但由于它忽略了大量未标记的信息，因此效率过低。为了从无标记数据中借力，我们提出了一类使用两步半参数估算的高效自适应估计器。采用概率指数模型主要是为了降低多元协变量的维度，而后续的重新加权步骤则是为了平衡标记数据和非标记数据的贡献。我们通过相图进行方差比较，得出了估计器的渐近特性。效率理论表明，如果正确指定了概率指数模型，我们的估计器就能达到半参数方差下限；如果模型没有退化，我们的估计器比监督估计器更有效率。渐近方差是通过两步扰动重采样程序估算出来的。为了衡量有限样本的性能，我们进行了广泛的模拟研究，验证了我们的方法对模型错误指定的适应性。为了说明我们提出的方法的优点，我们分析了一个有关洛杉矶无家可归者的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient semiparametric estimation in two‐sample comparison via semisupervised learning

We develop a general semisupervised framework for statistical inference in the two‐sample comparison setting. Although the supervised Mann–Whitney statistic outperforms many estimators in the two‐sample problem for nonnormally distributed responses, it is excessively inefficient because it ignores large amounts of unlabelled information. To borrow strength from unlabelled data, we propose a class of efficient and adaptive estimators that use two‐step semiparametric imputation. The probabilistic index model is adopted primarily to achieve dimension reduction for multivariate covariates, and a follow‐up reweighting step balances the contributions of labelled and unlabelled data. The asymptotic properties of our estimator are derived with variance comparison through a phase diagram. Efficiency theory shows our estimators achieve the semiparametric variance lower bound if the probabilistic index model is correctly specified, and are more efficient than their supervised counterpart when the model is not degenerate. The asymptotic variance is estimated through a two‐step perturbation resampling procedure. To gauge the finite sample performance, we conducted extensive simulation studies which verify the adaptive nature of our methods with respect to model misspecification. To illustrate the merits of our proposed method, we analyze a dataset concerning homelessness in Los Angeles.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Canadian Journal of Statistics

自引率

0.00%

发文量

期刊最新文献

Efficient semiparametric estimation in two‐sample comparison via semisupervised learning Distributed learning for kernel mode–based regression A new copula regression model for hierarchical data A framework for incorporating behavioural change into individual‐level spatial epidemic models Fast and scalable inference for spatial extreme value models