Revealing transparency gaps in publicly available COVID-19 datasets used for medical artificial intelligence development—a systematic review

IF 23.8 1区 医学 Q1 MEDICAL INFORMATICS Lancet Digital Health Pub Date : 2024-10-23 DOI:10.1016/S2589-7500(24)00146-8
{"title":"Revealing transparency gaps in publicly available COVID-19 datasets used for medical artificial intelligence development—a systematic review","authors":"","doi":"10.1016/S2589-7500(24)00146-8","DOIUrl":null,"url":null,"abstract":"<div><div>During the COVID-19 pandemic, artificial intelligence (AI) models were created to address health-care resource constraints. Previous research shows that health-care datasets often have limitations, leading to biased AI technologies. This systematic review assessed datasets used for AI development during the pandemic, identifying several deficiencies. Datasets were identified by screening articles from MEDLINE and using Google Dataset Search. 192 datasets were analysed for metadata completeness, composition, data accessibility, and ethical considerations. Findings revealed substantial gaps: only 48% of datasets documented individuals’ country of origin, 43% reported age, and under 25% included sex, gender, race, or ethnicity. Information on data labelling, ethical review, or consent was frequently missing. Many datasets reused data with inadequate traceability. Notably, historical paediatric chest x-rays appeared in some datasets without acknowledgment. These deficiencies highlight the need for better data quality and transparent documentation to lessen the risk that biased AI models are developed in future health emergencies.</div></div>","PeriodicalId":48534,"journal":{"name":"Lancet Digital Health","volume":null,"pages":null},"PeriodicalIF":23.8000,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lancet Digital Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589750024001468","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

During the COVID-19 pandemic, artificial intelligence (AI) models were created to address health-care resource constraints. Previous research shows that health-care datasets often have limitations, leading to biased AI technologies. This systematic review assessed datasets used for AI development during the pandemic, identifying several deficiencies. Datasets were identified by screening articles from MEDLINE and using Google Dataset Search. 192 datasets were analysed for metadata completeness, composition, data accessibility, and ethical considerations. Findings revealed substantial gaps: only 48% of datasets documented individuals’ country of origin, 43% reported age, and under 25% included sex, gender, race, or ethnicity. Information on data labelling, ethical review, or consent was frequently missing. Many datasets reused data with inadequate traceability. Notably, historical paediatric chest x-rays appeared in some datasets without acknowledgment. These deficiencies highlight the need for better data quality and transparent documentation to lessen the risk that biased AI models are developed in future health emergencies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
揭示用于医学人工智能开发的 COVID-19 公开数据集的透明度差距--系统综述。
在 COVID-19 大流行期间,人们创建了人工智能(AI)模型来解决医疗资源紧张的问题。以往的研究表明,医疗数据集往往存在局限性,从而导致人工智能技术出现偏差。本系统性综述评估了大流行期间用于人工智能开发的数据集,发现了一些不足之处。数据集是通过筛选MEDLINE上的文章和使用谷歌数据集搜索确定的。对 192 个数据集的元数据完整性、组成、数据可访问性和伦理因素进行了分析。研究结果显示存在很大差距:只有 48% 的数据集记录了个人的原籍国,43% 的数据集报告了年龄,不到 25% 的数据集包含性、性别、种族或民族。数据标签、伦理审查或同意书方面的信息经常缺失。许多数据集重复使用了可追溯性不足的数据。值得注意的是,一些数据集中出现了历史性的儿科胸部 X 光片,但并未注明。这些缺陷凸显了提高数据质量和文档透明度的必要性,以降低在未来的突发卫生事件中开发出有偏见的人工智能模型的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
41.20
自引率
1.60%
发文量
232
审稿时长
13 weeks
期刊介绍: The Lancet Digital Health publishes important, innovative, and practice-changing research on any topic connected with digital technology in clinical medicine, public health, and global health. The journal’s open access content crosses subject boundaries, building bridges between health professionals and researchers.By bringing together the most important advances in this multidisciplinary field,The Lancet Digital Health is the most prominent publishing venue in digital health. We publish a range of content types including Articles,Review, Comment, and Correspondence, contributing to promoting digital technologies in health practice worldwide.
期刊最新文献
Artificial intelligence-enabled electrocardiogram for mortality and cardiovascular risk estimation: a model development and validation study COVID-19 testing and reporting behaviours in England across different sociodemographic groups: a population-based study using testing data and data from community prevalence surveillance surveys Fairly evaluating the performance of normative models – Authors' reply Fairly evaluating the performance of normative models Lifting the veil on health datasets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1