Distributed non-disclosive validation of predictive models by a modified ROC-GLM.

IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES BMC Medical Research Methodology Pub Date : 2024-08-29 DOI:10.1186/s12874-024-02312-4
Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann
{"title":"Distributed non-disclosive validation of predictive models by a modified ROC-GLM.","authors":"Daniel Schalk, Raphael Rehms, Verena S Hoffmann, Bernd Bischl, Ulrich Mansmann","doi":"10.1186/s12874-024-02312-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach.</p><p><strong>Methods: </strong>We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the true ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> ) to the data and enables a global handling of placement numbers. The impact of DP parameters was studied by simulations.</p><p><strong>Results: </strong>In our simulation scenario, the true and distributed AUC measures differ by <math><mrow><mi>Δ</mi> <mtext>AUC</mtext> <mo><</mo> <mn>0.01</mn></mrow> </math> depending heavily on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios along with a reasonable choice of DP parameters. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.</p><p><strong>Conclusions: </strong>The applicability of our algorithms depends on the <math><msub><mi>ℓ</mi> <mn>2</mn></msub> </math> sensitivity <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high <math> <mrow><msub><mi>Δ</mi> <mn>2</mn></msub> <mrow><mo>(</mo> <mover><mi>f</mi> <mo>^</mo></mover> <mo>)</mo></mrow> </mrow> </math> , the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC, are applicable for validation in distributed setups while preserving an individual's privacy.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11363434/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02312-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Distributed statistical analyses provide a promising approach for privacy protection when analyzing data distributed over several databases. Instead of directly operating on data, the analyst receives anonymous summary statistics, which are combined into an aggregated result. Further, in discrimination model (prognosis, diagnosis, etc.) development, it is key to evaluate a trained model w.r.t. to its prognostic or predictive performance on new independent data. For binary classification, quantifying discrimination uses the receiver operating characteristics (ROC) and its area under the curve (AUC) as aggregation measure. We are interested to calculate both as well as basic indicators of calibration-in-the-large for a binary classification task using a distributed and privacy-preserving approach.

Methods: We employ DataSHIELD as the technology to carry out distributed analyses, and we use a newly developed algorithm to validate the prediction score by conducting distributed and privacy-preserving ROC analysis. Calibration curves are constructed from mean values over sites. The determination of ROC and its AUC is based on a generalized linear model (GLM) approximation of the true ROC curve, the ROC-GLM, as well as on ideas of differential privacy (DP). DP adds noise (quantified by the 2 sensitivity Δ 2 ( f ^ ) ) to the data and enables a global handling of placement numbers. The impact of DP parameters was studied by simulations.

Results: In our simulation scenario, the true and distributed AUC measures differ by Δ AUC < 0.01 depending heavily on the choice of the differential privacy parameters. It is recommended to check the accuracy of the distributed AUC estimator in specific simulation scenarios along with a reasonable choice of DP parameters. Here, the accuracy of the distributed AUC estimator may be impaired by too much artificial noise added from DP.

Conclusions: The applicability of our algorithms depends on the 2 sensitivity Δ 2 ( f ^ ) of the underlying statistical/predictive model. The simulations carried out have shown that the approximation error is acceptable for the majority of simulated cases. For models with high Δ 2 ( f ^ ) , the privacy parameters must be set accordingly higher to ensure sufficient privacy protection, which affects the approximation error. This work shows that complex measures, as the AUC, are applicable for validation in distributed setups while preserving an individual's privacy.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过改进的 ROC-GLM 对预测模型进行分布式非披露验证。
背景:在分析分布在多个数据库中的数据时,分布式统计分析为隐私保护提供了一种可行的方法。分析人员不直接对数据进行操作,而是接收匿名的汇总统计数据,并将其合并为一个综合结果。此外,在开发判别模型(预后、诊断等)时,关键是评估训练有素的模型在新的独立数据上的预后或预测性能。对于二元分类,量化判别使用接收者操作特征(ROC)及其曲线下面积(AUC)作为集合度量。我们有兴趣采用分布式和保护隐私的方法,计算二元分类任务的这两个指标以及大校准的基本指标:方法:我们采用 DataSHIELD 作为进行分布式分析的技术,并使用一种新开发的算法,通过进行分布式和保护隐私的 ROC 分析来验证预测得分。校准曲线由各站点的平均值构建。ROC 及其 AUC 的确定基于真实 ROC 曲线的广义线性模型 (GLM) 近似值,即 ROC-GLM,以及差分隐私 (DP) 思想。DP 增加了数据中的噪声(通过 ℓ 2 敏感度 Δ 2 ( f ^ ) 量化),并实现了对位置数字的全局处理。我们通过模拟研究了 DP 参数的影响:在我们的模拟场景中,真实的 AUC 测量值和分布式 AUC 测量值相差 Δ AUC 0.01,这在很大程度上取决于差异隐私参数的选择。建议在具体的模拟场景中检查分布式 AUC 估计器的准确性,同时合理选择 DP 参数。在这种情况下,分布式 AUC 估计器的准确性可能会因为 DP 人为添加的噪声过多而受到影响:我们算法的适用性取决于基础统计/预测模型的 ℓ 2 灵敏度 Δ 2 ( f ^ )。模拟结果表明,在大多数模拟情况下,近似误差是可以接受的。对于高 Δ 2 ( f ^ ) 的模型,隐私参数必须相应设置得更高,以确保足够的隐私保护,这会影响近似误差。这项工作表明,复杂的测量方法(如 AUC)适用于分布式设置中的验证,同时还能保护个人隐私。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Medical Research Methodology
BMC Medical Research Methodology 医学-卫生保健
CiteScore
6.50
自引率
2.50%
发文量
298
审稿时长
3-8 weeks
期刊介绍: BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.
期刊最新文献
Challenges in measurement of adolescent mental health: how are gender patterns affected when level of symptoms is analysed simultaneously with impairment? Motivations for enrollment in a COVID-19 ring-based post-exposure prophylaxis trial: qualitative examination of participant experiences. Concordance between humans and GPT-4 in appraising the methodological quality of case reports and case series using the Murad tool. Bayesian additive regression trees for predicting childhood asthma in the CHILD cohort study. Incorporating external controls in the design of randomized clinical trials: a case study in solid tumors.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1