Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping.

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Journal of Machine Learning Research Pub Date : 2022-01-01

Yichi Zhang, Molei Liu, Matey Neykov, Tianxi Cai

{"title":"Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping.","authors":"Yichi Zhang, Molei Liu, Matey Neykov, Tianxi Cai","doi":"","DOIUrl":null,"url":null,"abstract":"Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, <math><mi>p</mi></math>, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label <math><mi>Y</mi></math> and the feature set <math><mi>X</mi></math> are observed) and a much larger, weakly-labeled dataset in which the feature set <math><mi>X</mi></math> is accompanied only by a surrogate label <math><mi>S</mi></math> that is available to all patients. Under a working prior assumption that <math><mi>S</mi></math> is related to <math><mi>X</mi></math> only through <math><mi>Y</mi></math> and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"23 ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10653017/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, $p$ , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label $Y$ and the feature set $X$ are observed) and a much larger, weakly-labeled dataset in which the feature set $X$ is accompanied only by a surrogate label $S$ that is available to all patients. Under a working prior assumption that $S$ is related to $X$ only through $Y$ and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.

Abstract Image

微信好友朋友圈 QQ好友复制链接

本刊更多论文

先验自适应半监督学习在EHR表型中的应用。

电子健康记录(EHR)数据是生物医学研究的一个丰富来源，已成功地用于获得对各种疾病的新见解。尽管电子病历具有潜力，但由于其缺乏精确表型信息的主要限制，目前在发现研究中未得到充分利用。为了克服这些困难，最近的努力致力于开发监督算法，以基于相对较小的训练数据集，通过图表审查提取金标准标签，准确预测表型。然而，监督方法通常需要一个相当大的训练集来产生可推广的算法，特别是当候选特征的数量p很大时。在本文中，我们提出了一种半监督(SS) EHR表型方法，该方法从一个小的、标记的数据集(其中标签Y和特征集X都被观察到)和一个更大的、弱标记的数据集(其中特征集X只伴随着一个可用于所有患者的替代标签S)中借鉴信息。在工作先验假设S仅通过Y与X相关并允许其近似保持的情况下，我们提出了一个先验自适应半监督(PASS)估计器，该估计器通过将估计器缩小到在先验下导出的方向来结合先验知识。我们推导了该估计器的渐近理论，并证明了其对低质量先验信息的有效性和鲁棒性。我们还通过模拟研究和在一家大型三级医院进行的三个现实世界的EHR表型研究，证明了它在各种情况下比现有估计器的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.