Data adequacy bias impact in a data-blinded semi-supervised GAN for privacy-aware COVID-19 chest X-ray classification

Javier Pastorino, A. Biswas
{"title":"Data adequacy bias impact in a data-blinded semi-supervised GAN for privacy-aware COVID-19 chest X-ray classification","authors":"Javier Pastorino, A. Biswas","doi":"10.1145/3535508.3545560","DOIUrl":null,"url":null,"abstract":"Supervised machine learning models are, by definition, data-sighted, requiring to view all or most parts of the training dataset which are labeled. This paradigm presents two bottlenecks which are intertwined: risk of exposing sensitive data samples to the third-party site with machine learning engineers, and time-consuming, laborious, bias-prone nature of data annotations by the personnel at the data source site. In this paper we studied learning impact of data adequacy as bias source in a data-blinded semi-supervised learning model for covid chest X-ray classification. Data-blindedness was put in action on a semi-supervised generative adversarial network to generate synthetic data based only on a few labeled data samples and concurrently learn to classify targets. We designed and developed a data-blind COVID-19 patient classifier that classifies whether an individual is suffering from COVID-19 or other type of illness with the ultimate goal of producing a system to assist in labeling large datasets. However, the availability of the labels in the training data had an impact in the model performance, and when a new disease spreads, as it was COVID9-19 in 2019, access to labeled data may be limited. Here, we studied how bias in the labeled sample distribution per class impacted in classification performance for three models: a Convolution Neural Network based classifier (CNN), a semi-supervised GAN using the source data (SGAN), and finally our proposed data-blinded semi-supervised GAN (BSGAN). Data-blind prevents machine learning engineers from directly accessing the source data during training, thereby ensuring data confidentiality. This was achieved by using synthetic data samples, generated by a separate generative model which were then used to train the proposed model. Our model achieved comparable performance, with the trade-off between a privacy-aware model and a traditionally-learnt model of 0.05 AUC-score, and it maintained stable, following the same learning performance as the data distribution was changed.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Supervised machine learning models are, by definition, data-sighted, requiring to view all or most parts of the training dataset which are labeled. This paradigm presents two bottlenecks which are intertwined: risk of exposing sensitive data samples to the third-party site with machine learning engineers, and time-consuming, laborious, bias-prone nature of data annotations by the personnel at the data source site. In this paper we studied learning impact of data adequacy as bias source in a data-blinded semi-supervised learning model for covid chest X-ray classification. Data-blindedness was put in action on a semi-supervised generative adversarial network to generate synthetic data based only on a few labeled data samples and concurrently learn to classify targets. We designed and developed a data-blind COVID-19 patient classifier that classifies whether an individual is suffering from COVID-19 or other type of illness with the ultimate goal of producing a system to assist in labeling large datasets. However, the availability of the labels in the training data had an impact in the model performance, and when a new disease spreads, as it was COVID9-19 in 2019, access to labeled data may be limited. Here, we studied how bias in the labeled sample distribution per class impacted in classification performance for three models: a Convolution Neural Network based classifier (CNN), a semi-supervised GAN using the source data (SGAN), and finally our proposed data-blinded semi-supervised GAN (BSGAN). Data-blind prevents machine learning engineers from directly accessing the source data during training, thereby ensuring data confidentiality. This was achieved by using synthetic data samples, generated by a separate generative model which were then used to train the proposed model. Our model achieved comparable performance, with the trade-off between a privacy-aware model and a traditionally-learnt model of 0.05 AUC-score, and it maintained stable, following the same learning performance as the data distribution was changed.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据盲半监督GAN对隐私意识COVID-19胸部x射线分类的数据充分性偏差影响
根据定义,有监督的机器学习模型是具有数据视野的,需要查看全部或大部分标记的训练数据集。这种模式提出了两个相互交织的瓶颈:将敏感数据样本暴露给机器学习工程师的第三方站点的风险,以及数据源站点人员进行数据注释的耗时、费力、容易产生偏见的性质。在本文中,我们研究了数据充分性作为偏倚源在数据盲半监督学习模型中对covid胸部x射线分类的学习影响。将数据盲目性应用于半监督生成对抗网络,仅基于少量标记数据样本生成合成数据,同时学习对目标进行分类。我们设计并开发了一个数据盲的COVID-19患者分类器,用于对个体是否患有COVID-19或其他类型的疾病进行分类,最终目标是创建一个系统来协助标记大型数据集。然而,训练数据中标签的可用性对模型性能有影响,当一种新的疾病传播时,比如2019年的covid -19,对标记数据的访问可能会受到限制。在这里,我们研究了每个类别标记样本分布中的偏差如何影响三种模型的分类性能:基于卷积神经网络的分类器(CNN),使用源数据的半监督GAN (SGAN),最后是我们提出的数据盲半监督GAN (BSGAN)。数据盲可以防止机器学习工程师在训练过程中直接访问源数据,从而保证数据的保密性。这是通过使用合成数据样本来实现的,这些数据样本由一个单独的生成模型生成,然后用于训练所提出的模型。我们的模型在隐私感知模型和0.05 AUC-score的传统学习模型之间进行了权衡,取得了相当的性能,并且随着数据分布的改变,它保持了相同的学习性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Examining post-pandemic behaviors influencing human mobility trends Geographic ensembles of observations using randomised ensembles of autoregression chains: ensemble methods for spatio-temporal time series forecasting of influenza-like illness Trajectory-based and sound-based medical data clustering Session details: Graphs & networks TopographyNET
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1