Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2024-07-12 DOI:10.1186/s13040-024-00373-1

Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar

{"title":"Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database.","authors":"Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar","doi":"10.1186/s13040-024-00373-1","DOIUrl":null,"url":null,"abstract":"Background: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.Results: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.Conclusions: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"22"},"PeriodicalIF":6.1000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11245804/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00373-1","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.

Results: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.

Conclusions: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

构建放射学网络：注释大规模多模态医学数据库的无监督方法。

背景：近年来，随着计算机辅助诊断系统的发展，机器学习在医学诊断和治疗中的应用有了显著增长，这些系统通常是基于有注释的医学放射图像。然而，由于注释过程耗时且成本高昂，缺乏大型注释图像数据集仍是一大障碍。本研究旨在通过提出一种基于语义相似性的自动注释大型医学放射图像数据库的方法来克服这一挑战：结果：采用一种自动化、无监督的方法创建了一个大型医学放射图像注释数据集，该数据集来自克罗地亚里耶卡临床医院中心。该管道是通过对三种不同类型的医疗数据进行数据挖掘而建立的：图像、DICOM 元数据和叙述性诊断。然后将最佳特征提取器集成到多模态表示中，再对其进行聚类，从而创建一个自动管道，将包含 1,337,926 张医疗图像的前体数据集标记为 50 个视觉相似图像集群。考虑到解剖区域和模式表示，通过检查其同质性和互信息来评估聚类的质量：结果表明，在对大规模医疗数据进行无监督聚类时，将所有三个数据源的嵌入融合在一起可获得最佳结果，并产生最简洁的聚类。因此，这项工作标志着我们朝着建立一个更大、更精细的医学放射图像注释数据集迈出了第一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.