首页 > 最新文献

Biodata Mining最新文献

英文 中文
The goldmine of GWAS summary statistics: a systematic review of methods and tools. GWAS 摘要统计的金矿:对方法和工具的系统回顾。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-05 DOI: 10.1186/s13040-024-00385-x
Panagiota I Kontou, Pantelis G Bagos

Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.

全基因组关联研究(GWAS)彻底改变了我们对复杂性状和疾病遗传结构的认识。全基因组关联研究摘要统计已成为各种遗传分析(包括荟萃分析、精细图谱绘制和风险预测)的基本工具。然而,GWAS 统计摘要的数量越来越多,用于分析的软件工具也多种多样,这使得研究人员在选择最适合其特定需求的工具时面临挑战。本系统综述旨在全面概述目前可用于 GWAS 摘要统计分析的软件工具和数据库。我们进行了全面的文献检索,以确定相关的软件工具和数据库。我们按照工具和数据库的功能进行了分类,包括数据管理、质量控制、单性状分析和多性状分析。我们还根据工具和数据库的功能、局限性和易用性对其进行了比较。我们的研究共发现了 305 种专用于 GWAS 摘要统计的功能软件工具和数据库,每种工具和数据库都有其独特的优势和局限性。我们对每种工具和数据库的主要特点进行了描述,包括其输入/输出格式、数据类型和计算要求。我们还讨论了每种工具在不同研究方案中的整体可用性和适用性。对于有兴趣使用 GWAS 摘要统计来研究复杂性状和疾病遗传基础的研究人员来说,这篇综合综述将成为宝贵的资源。通过对现有工具和数据库的详细概述,我们旨在促进对工具的知情选择,并最大限度地提高 GWAS 概要统计分析的有效性。
{"title":"The goldmine of GWAS summary statistics: a systematic review of methods and tools.","authors":"Panagiota I Kontou, Pantelis G Bagos","doi":"10.1186/s13040-024-00385-x","DOIUrl":"10.1186/s13040-024-00385-x","url":null,"abstract":"<p><p>Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"31"},"PeriodicalIF":4.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11375927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142141566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Processing imbalanced medical data at the data level with assisted-reproduction data as an example. 以辅助生产数据为例,在数据层面处理不平衡的医疗数据。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-04 DOI: 10.1186/s13040-024-00384-y
Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu

Objective: Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.

Methods: We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.

Results: The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.

Conclusions: The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.

目的:数据不平衡是医学数据挖掘中普遍存在的问题,往往会导致预测模型有偏差且不可靠。本研究旨在满足对有效策略的迫切需求,以减轻数据不平衡对分类模型的影响。我们的重点是量化不同失衡程度和样本量对模型性能的影响,确定最佳截断值,并评估各种方法在高度失衡和样本量较小的情况下提高模型准确性的效果:方法:我们收集了一家生殖医学中心接受辅助生殖治疗的患者的医疗记录。方法:我们收集了一家生殖医学中心接受辅助生殖治疗的患者的医疗记录,并使用随机森林筛选预测目标的关键变量。我们构建了不同失衡程度和样本量的数据集,以比较逻辑回归模型的分类性能。评估指标包括 AUC、G-mean、F1-Score、Accuracy、Recall 和 Precision。四种不平衡处理方法(SMOTE、ADASYN、OSS 和 CNN)被应用于阳性率低、样本量小的数据集,以评估其有效性:结果:当阳性率低于 10%时,逻辑模型的性能较低,但超过这一阈值后性能趋于稳定。同样,样本量低于 1200 个时,效果不佳,超过这一临界值时,效果会有所改善。为确保稳健性,确定阳性率和样本量的最佳临界值分别为 15%和 1500。在阳性率低、样本量小的数据集中,SMOTE 和 ADASYN 超采样显著提高了分类性能:结论:这项研究确定了 15%的阳性率和 1500 个样本量是逻辑模型性能稳定的最佳临界值。对于阳性率低、样本量小的数据集,建议使用 SMOTE 和 ADASYN 来提高平衡性和模型准确性。
{"title":"Processing imbalanced medical data at the data level with assisted-reproduction data as an example.","authors":"Junliang Zhu, Shaowei Pu, Jiaji He, Dongchao Su, Weijie Cai, Xueying Xu, Hongbo Liu","doi":"10.1186/s13040-024-00384-y","DOIUrl":"10.1186/s13040-024-00384-y","url":null,"abstract":"<p><strong>Objective: </strong>Data imbalance is a pervasive issue in medical data mining, often leading to biased and unreliable predictive models. This study aims to address the urgent need for effective strategies to mitigate the impact of data imbalance on classification models. We focus on quantifying the effects of different imbalance degrees and sample sizes on model performance, identifying optimal cut-off values, and evaluating the efficacy of various methods to enhance model accuracy in highly imbalanced and small sample size scenarios.</p><p><strong>Methods: </strong>We collected medical records of patients receiving assisted reproductive treatment in a reproductive medicine center. Random forest was used to screen the key variables for the prediction target. Various datasets with different imbalance degrees and sample sizes were constructed to compare the classification performance of logistic regression models. Metrics such as AUC, G-mean, F1-Score, Accuracy, Recall, and Precision were used for evaluation. Four imbalance treatment methods (SMOTE, ADASYN, OSS, and CNN) were applied to datasets with low positive rates and small sample sizes to assess their effectiveness.</p><p><strong>Results: </strong>The logistic model's performance was low when the positive rate was below 10% but stabilized beyond this threshold. Similarly, sample sizes below 1200 yielded poor results, with improvement seen above this threshold. For robustness, the optimal cut-offs for positive rate and sample size were identified as 15% and 1500, respectively. SMOTE and ADASYN oversampling significantly improved classification performance in datasets with low positive rates and small sample sizes.</p><p><strong>Conclusions: </strong>The study identifies a positive rate of 15% and a sample size of 1500 as optimal cut-offs for stable logistic model performance. For datasets with low positive rates and small sample sizes, SMOTE and ADASYN are recommended to improve balance and model accuracy.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"29"},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11373105/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QIGTD: identifying critical genes in the evolution of lung adenocarcinoma with tensor decomposition. QIGTD:通过张量分解确定肺腺癌演变过程中的关键基因。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-04 DOI: 10.1186/s13040-024-00386-w
Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang

Background: Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.

Methods: A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.

Results: The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.

Conclusion: In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.

背景:识别关键基因对于了解复杂疾病的发病机制非常重要。传统研究通常比较正常样本与疾病样本之间生物分子的变化,或从单一静态生物分子网络中检测重要顶点,这往往忽略了不同疾病阶段之间发生的动态变化。然而,研究生物分子网络的时间变化并确定关键基因对于了解疾病的发生和发展至关重要:方法:本研究提出了一种名为 "张量分解基因重要性量化(QIGTD)"的新方法。它首先通过整合时间内和时间间的网络信息构建时间序列网络,根据局部相似性保留相邻阶段网络之间的连接。采用张量来描述该时间序列网络的连接,并提出了一种三阶张量分解方法,以捕捉每个网络快照的拓扑信息和整个网络的时间序列特征。QIGTD 也是一种无需学习的高效方法,可用于样本数量较少的数据集:使用肺腺癌(LUAD)数据集和三种最先进的方法评估了 QIGTD 的有效性:以 T-degree、T-closeness 和 T-betweenness 作为基准方法。数值实验结果表明,QIGTD 在精确度和 mAP 两项指标上都优于这些方法。值得注意的是,根据 DisGeNET 数据库,在前 50 个基因中,有 29 个已被证实与 LUAD 高度相关,有 36 个显著富集了与 LUAD 相关的基因本体(Gene Ontology,GO)术语,包括核分裂、有丝分裂核分裂、染色体分离、细胞器裂变和有丝分裂姐妹染色单体分离:总之,QIGTD 能有效捕捉基因网络的时间变化并识别关键基因。结论:QIGTD 能有效捕捉基因网络的时间变化并识别关键基因,它为研究生物网络的时间动态提供了一种有价值的工具,有助于了解 LUAD 等疾病的潜在机制。
{"title":"QIGTD: identifying critical genes in the evolution of lung adenocarcinoma with tensor decomposition.","authors":"Bolin Chen, Jinlei Zhang, Ci Shao, Jun Bian, Ruiming Kang, Xuequn Shang","doi":"10.1186/s13040-024-00386-w","DOIUrl":"10.1186/s13040-024-00386-w","url":null,"abstract":"<p><strong>Background: </strong>Identifying critical genes is important for understanding the pathogenesis of complex diseases. Traditional studies typically comparing the change of biomecules between normal and disease samples or detecting important vertices from a single static biomolecular network, which often overlook the dynamic changes that occur between different disease stages. However, investigating temporal changes in biomolecular networks and identifying critical genes is critical for understanding the occurrence and development of diseases.</p><p><strong>Methods: </strong>A novel method called Quantifying Importance of Genes with Tensor Decomposition (QIGTD) was proposed in this study. It first constructs a time series network by integrating both the intra and inter temporal network information, which preserving connections between networks at adjacent stages according to the local similarities. A tensor is employed to describe the connections of this time series network, and a 3-order tensor decomposition method was proposed to capture both the topological information of each network snapshot and the time series characteristics of the whole network. QIGTD is also a learning-free and efficient method that can be applied to datasets with a small number of samples.</p><p><strong>Results: </strong>The effectiveness of QIGTD was evaluated using lung adenocarcinoma (LUAD) datasets and three state-of-the-art methods: T-degree, T-closeness, and T-betweenness were employed as benchmark methods. Numerical experimental results demonstrate that QIGTD outperforms these methods in terms of the indices of both precision and mAP. Notably, out of the top 50 genes, 29 have been verified to be highly related to LUAD according to the DisGeNET Database, and 36 are significantly enriched in LUAD related Gene Ontology (GO) terms, including nuclear division, mitotic nuclear division, chromosome segregation, organelle fission, and mitotic sister chromatid segregation.</p><p><strong>Conclusion: </strong>In conclusion, QIGTD effectively captures the temporal changes in gene networks and identifies critical genes. It provides a valuable tool for studying temporal dynamics in biological networks and can aid in understanding the underlying mechanisms of diseases such as LUAD.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"30"},"PeriodicalIF":4.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11376055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seven quick tips for gene-focused computational pangenomic analysis. 以基因为重点的计算庞基因组分析的七个快速提示。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-09-03 DOI: 10.1186/s13040-024-00380-2
Vincenzo Bonnici, Davide Chicco

Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.

泛基因组学(Pangenomics)是一个相对较新的科学领域,研究一个支系所有基因组的结合。在古希腊语中,"pan "意为万物;"pangenomics "一词最初指细菌基因组,后来也指人类基因组。现代生物信息学为分析泛基因组学数据提供了多种工具,为我们称之为计算泛基因组学的新兴领域铺平了道路。目前生物信息学界可用的计算能力使计算庞基因组学分析变得容易执行,但庞基因组学分析的更高可及性也增加了犯错和产生误导性或夸大结果的机会,尤其是初学者。为了解决这个问题,我们在此介绍一些快速窍门,以高效、正确地进行计算庞基因组学分析,重点是细菌庞基因组学,介绍该领域应避免的常见错误和应遵循的最佳实践经验。我们相信,我们的建议能帮助读者进行更稳健、更合理的庞基因组分析,并得出更可靠的结果。
{"title":"Seven quick tips for gene-focused computational pangenomic analysis.","authors":"Vincenzo Bonnici, Davide Chicco","doi":"10.1186/s13040-024-00380-2","DOIUrl":"10.1186/s13040-024-00380-2","url":null,"abstract":"<p><p>Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"28"},"PeriodicalIF":4.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11370085/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning for automatic calcium detection in echocardiography. 深度学习用于超声心动图中的自动钙检测。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-28 DOI: 10.1186/s13040-024-00381-1
Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão

Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.

心血管疾病是世界上最主要的死亡原因,而心血管成像技术是无创诊断的主要手段。主动脉瓣狭窄是一种致命的心脏疾病,主动脉瓣钙化会持续数年。利用深度学习(DL)算法开发的数据驱动工具可以对医学影像数据进行处理和分类,提供可靠的快速诊断,从而提高医疗保健的效率。一项关于将深度学习应用于病理钙检测的医学图像的系统性综述得出结论,该领域已有成熟的技术,主要使用 CT 扫描,但以辐射暴露为代价。超声心动图是一种尚未开发的检测钙的替代方法,但仍需要技术发展。本文开发了一种基于卷积神经网络(CNN)的全自动方法来检测超声心动图图像中的主动脉钙化,该方法由两个基本过程组成:(1)定位主动脉瓣的物体检测器--精确度达到 95%,召回率达到 100%;(2)识别瓣膜中钙结构的分类器--精确度达到 92%,召回率达到 100%。这项工作的成果是实现了主动脉瓣钙化这一致命流行病的超声心动图自动化检测。
{"title":"Deep learning for automatic calcium detection in echocardiography.","authors":"Luís B Elvas, Sara Gomes, João C Ferreira, Luís Brás Rosário, Tomás Brandão","doi":"10.1186/s13040-024-00381-1","DOIUrl":"10.1186/s13040-024-00381-1","url":null,"abstract":"<p><p>Cardiovascular diseases are the main cause of death in the world and cardiovascular imaging techniques are the mainstay of noninvasive diagnosis. Aortic stenosis is a lethal cardiac disease preceded by aortic valve calcification for several years. Data-driven tools developed with Deep Learning (DL) algorithms can process and categorize medical images data, providing fast diagnoses with considered reliability, to improve healthcare effectiveness. A systematic review of DL applications on medical images for pathologic calcium detection concluded that there are established techniques in this field, using primarily CT scans, at the expense of radiation exposure. Echocardiography is an unexplored alternative to detect calcium, but still needs technological developments. In this article, a fully automated method based on Convolutional Neural Networks (CNNs) was developed to detect Aortic Calcification in Echocardiography images, consisting of two essential processes: (1) an object detector to locate aortic valve - achieving 95% of precision and 100% of recall; and (2) a classifier to identify calcium structures in the valve - which achieved 92% of precision and 100% of recall. The outcome of this work is the possibility of automation of the detection with Echocardiography of Aortic Valve Calcification, a lethal and prevalent disease.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"27"},"PeriodicalIF":4.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11351547/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142094005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating transcriptomics and proteomics to analyze the immune microenvironment of cytomegalovirus associated ulcerative colitis and identify relevant biomarkers. 整合转录组学和蛋白质组学,分析巨细胞病毒相关性溃疡性结肠炎的免疫微环境并确定相关生物标记物。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-27 DOI: 10.1186/s13040-024-00382-0
Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng

Background: In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.

Method: The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.

Results: In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.

Conclusion: This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.

背景:近年来,严重炎症性肠病(IBD)和巨细胞病毒(CMV)患者的发病率和死亡率显著上升,这引起了人们对IBD患者肠粘膜CMV感染状况及其在疾病进展中所起作用的极大关注。然而,目前还没有CMV感染的溃疡性结肠炎患者(CMV + UC)的高通量测序数据,CMV + UC患者的免疫微环境也有待探索:方法:采用 xCell 算法评估 CMV + UC 患者的免疫微环境。方法:采用 xCell 算法评估 CMV + UC 患者的免疫微环境,然后通过 WGCNA 分析获得异常免疫细胞与基因水平或蛋白质水平的共表达模块。接着,使用随机森林、SVM-rfe 和 Lasso 三种机器学习方法筛选候选生物标记物。最后,采用最佳子集选择算法构建诊断模型:在这项研究中,我们对 CMV + UC 患者进行了转录组学和蛋白质组学测序,以建立全面的免疫微环境谱,并在 CMV + UC 组中发现了 11 种特异性异常免疫细胞。在使用多组学整合算法后,我们确定了 7 个共表达基因模块和 5 个共表达蛋白质模块。随后,我们利用各种机器学习算法确定了具有诊断功效的关键生物标志物,并构建了早期诊断模型。我们共发现了8个生物标志物(PPP1R12B、CIRBP、CSNK2A2、DNAJB11、PIK3R4、RRBP1、STX5、TMEM214),它们在CMV + UC的免疫微环境中发挥着关键作用,并对CMV + UC表现出卓越的诊断性能:结论:这 8 个生物标志物模型为 CMV 感染后 IBD 患者的诊断和治疗提供了新的范例。对该模型的进一步研究将对了解 CMV 感染后宿主免疫微环境的变化具有重要意义。
{"title":"Integrating transcriptomics and proteomics to analyze the immune microenvironment of cytomegalovirus associated ulcerative colitis and identify relevant biomarkers.","authors":"Yang Chen, Qingqing Zheng, Hui Wang, Peiren Tang, Li Deng, Pu Li, Huan Li, Jianhong Hou, Jie Li, Li Wang, Jun Peng","doi":"10.1186/s13040-024-00382-0","DOIUrl":"10.1186/s13040-024-00382-0","url":null,"abstract":"<p><strong>Background: </strong>In recent years, significant morbidity and mortality in patients with severe inflammatory bowel disease (IBD) and cytomegalovirus (CMV) have drawn considerable attention to the status of CMV infection in the intestinal mucosa of IBD patients and its role in disease progression. However, there is currently no high-throughput sequencing data for ulcerative colitis patients with CMV infection (CMV + UC), and the immune microenvironment in CMV + UC patients have yet to be explored.</p><p><strong>Method: </strong>The xCell algorithm was used for evaluate the immune microenvironment of CMV + UC patients. Then, WGCNA analysis was explored to obtain the co-expression modules between abnormal immune cells and gene level or protein level. Next, three machine learning approach include Random Forest, SVM-rfe, and Lasso were used to filter candidate biomarkers. Finally, Best Subset Selection algorithms was performed to construct the diagnostic model.</p><p><strong>Results: </strong>In this study, we performed transcriptomic and proteomic sequencing on CMV + UC patients to establish a comprehensive immune microenvironment profile and found 11 specific abnormal immune cells in CMV + UC group. After using multi-omics integration algorithms, we identified seven co-expression gene modules and five co-expression protein modules. Subsequently, we utilized various machine learning algorithms to identify key biomarkers with diagnostic efficacy and constructed an early diagnostic model. We identified a total of eight biomarkers (PPP1R12B, CIRBP, CSNK2A2, DNAJB11, PIK3R4, RRBP1, STX5, TMEM214) that play crucial roles in the immune microenvironment of CMV + UC and exhibit superior diagnostic performance for CMV + UC.</p><p><strong>Conclusion: </strong>This 8 biomarkers model offers a new paradigm for the diagnosis and treatment of IBD patients post-CMV infection. Further research into this model will be significant for understanding the changes in the host immune microenvironment following CMV infection.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"26"},"PeriodicalIF":4.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11348729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding predictions of drug profiles using explainable machine learning models 利用可解释的机器学习模型了解药物概况预测
IF 4.5 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-08-01 DOI: 10.1186/s13040-024-00378-w
Caroline König, Alfredo Vellido
The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.
吸收、分布、代谢和排泄(ADME)分子特性的分析与药物设计息息相关,因为它们直接影响药物在靶点的有效性。本研究利用可解释的机器学习(ML)模型对其进行预测。研究的目的是找出与预测不同 ADME 特性相关的分子特征,并衡量它们对预测模型的影响。通过估算特征在 ML 模型预测中的重要性来衡量各个特征与 ADME 活性的相对相关性。特征重要性通过特征排列来计算,特征的个体影响则通过 SHAP 相加解释来衡量。该研究揭示了特定分子描述符对每种 ADME 特性的相关性,并量化了它们对 ADME 特性预测的影响。所报告的研究说明了可解释的 ML 模型如何能够提供有关分子特征对 ADME 特性最终预测的个别贡献的详细见解,从而通过更好地了解分子特征的影响,在候选药物选择过程中为专家提供支持。
{"title":"Understanding predictions of drug profiles using explainable machine learning models","authors":"Caroline König, Alfredo Vellido","doi":"10.1186/s13040-024-00378-w","DOIUrl":"https://doi.org/10.1186/s13040-024-00378-w","url":null,"abstract":"The analysis of absorption, distribution, metabolism, and excretion (ADME) molecular properties is of relevance to drug design, as they directly influence the drug’s effectiveness at its target location. This study concerns their prediction, using explainable Machine Learning (ML) models. The aim of the study is to find which molecular features are relevant to the prediction of the different ADME properties and measure their impact on the predictive model. The relative relevance of individual features for ADME activity is gauged by estimating feature importance in ML models’ predictions. Feature importance is calculated using feature permutation and the individual impact of features is measured by SHAP additive explanations. The study reveals the relevance of specific molecular descriptors for each ADME property and quantifies their impact on the ADME property prediction. The reported research illustrates how explainable ML models can provide detailed insights about the individual contributions of molecular features to the final prediction of an ADME property, as an effort to support experts in the process of drug candidate selection through a better understanding of the impact of molecular features.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"45 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141862771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modelling the nicotine pharmacokinetic profile for e-cigarettes using real time monitoring of consumers' physiological measurements and mouth level exposure. 利用对消费者生理测量数据和口腔接触水平的实时监测,模拟电子烟的尼古丁药代动力学特征。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-17 DOI: 10.1186/s13040-024-00375-z
Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward

Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R2 of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.

药代动力学(PK)研究可以提供有关尼古丁和烟草产品滥用责任的重要信息,但具有侵入性,必须在临床环境中进行。这项研究的目的是探索在使用电子烟之前、期间和之后,利用可穿戴设备对生理参数和口腔尼古丁暴露水平(MLE)进行实时监测,是否可以预测使用电子烟后血浆尼古丁水平的变化。建立这种预测模型可能有助于纵向收集在正常环境中使用尼古丁产品的消费者的产品使用和尼古丁表达数据,从而减少对侵入性临床研究的需求,同时生成与真实世界中产品使用相关的 PK 数据。我们开发了一个探索性的机器学习模型,以预测使用电子烟后血浆尼古丁水平的变化;该模型来自对电子烟吸食前、吸食中和吸食后的生理参数和尼古丁 MLE 的实时监测。这项初步研究确定了一些关键参数,如心率(HR)、心率变异性(HRV)和生理压力(PS),这些参数可作为个人血浆尼古丁反应(PK 曲线)的预测因子。相对于基线测量值(每位参与者),含有尼古丁的电子烟的心率显著增加,并且在不同疗程中(参与者内部)保持一致。对所有数据进行缺失值补偿和模型训练后,原始 "学习 "数据提高了 57%,中位验证 R2 为 0.70。该研究目前处于探索阶段,其局限性包括样本量小且不多样化,以及依赖于单一电子烟产品的数据。这些发现需要进一步的研究来验证,并增强模型在现实环境中的普遍性和适用性。这项研究为开发尼古丁产品使用的非侵入式 PK 模型迈出了基础性的一步。
{"title":"Modelling the nicotine pharmacokinetic profile for e-cigarettes using real time monitoring of consumers' physiological measurements and mouth level exposure.","authors":"Krishna Prasad, Allen Griffiths, Kavya Agrawal, Michael McEwan, Flavio Macci, Marco Ghisoni, Matthew Stopher, Matthew Napleton, Joel Strickland, David Keating, Thomas Whitehead, Gareth Conduit, Stacey Murray, Lauren Edward","doi":"10.1186/s13040-024-00375-z","DOIUrl":"10.1186/s13040-024-00375-z","url":null,"abstract":"<p><p>Pharmacokinetic (PK) studies can provide essential information on abuse liability of nicotine and tobacco products but are intrusive and must be conducted in a clinical environment. The objective of the study was to explore whether changes in plasma nicotine levels following use of an e-cigarette can be predicted from real time monitoring of physiological parameters and mouth level exposure (MLE) to nicotine before, during, and after e-cigarette vaping, using wearable devices. Such an approach would allow an -effective pre-screening process, reducing the number of clinical studies, reducing the number of products to be tested and the number of blood draws required in a clinical PK study Establishing such a prediction model might facilitate the longitudinal collection of data on product use and nicotine expression among consumers using nicotine products in their normal environments, thereby reducing the need for intrusive clinical studies while generating PK data related to product use in the real world.An exploratory machine learning model was developed to predict changes in plasma nicotine levels following the use of an e-cigarette; from real time monitoring of physiological parameters and MLE to nicotine before, during, and after e-cigarette vaping. This preliminary study identified key parameters, such as heart rate (HR), heart rate variability (HRV), and physiological stress (PS) that may act as predictors for an individual's plasma nicotine response (PK curve). Relative to baseline measurements (per participant), HR showed a significant increase for nicotine containing e-liquids and was consistent across sessions (intra-participant). Imputing missing values and training the model on all data resulted in 57% improvement from the original'learning' data and achieved a median validation R<sup>2</sup> of 0.70.The study is in its exploratory phase, with limitations including a small and non-diverse sample size and reliance on data from a single e-cigarette product. These findings necessitate further research for validation and to enhance the model's generalisability and applicability in real-world settings. This study serves as a foundational step towards developing non-intrusive PK models for nicotine product use.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"24"},"PeriodicalIF":4.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11253374/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141635153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construction and application of medication reminder system: intelligent generation of universal medication schedule. 用药提醒系统的构建与应用:智能生成通用用药计划表。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-15 DOI: 10.1186/s13040-024-00376-y
Hangxing Huang, Lu Zhang, Yongyu Yang, Ling Huang, Xikui Lu, Jingyang Li, Huimin Yu, Shuqiao Cheng, Jian Xiao

Background: Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.

Methods: To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.

Results: The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.

Conclusions: Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.

背景:慢性病患者每天需要服用多种药物来控制病情。然而,大多数患者的依从性较差,影响了治疗效果。为解决这些难题,我们建立了一个用药提醒系统,用于智能生成通用用药时间表(UMS),提醒慢性病患者准确服药,并提高家庭用药的安全性:方法:设计每种药物的单药服药时间约束(MTCOD)和每两种药物的多药服药时间约束(MTCMD),以更好地调节患者服药的间隔和时间。建立由药物信息云数据库、医务人员操作终端和患者终端组成的用药提醒系统:云数据库共有 153,916 个药品、496,708 个药物相互作用数据和 153,390 对药品成分。MTCOD 数据为 153,916 条,MTCMD 数据为 8,552,712 条。构建了一个智能 UMS 用药提醒系统。该系统可读取患者的处方信息,并为慢性病患者提供个性化的用药指导和用药时间表。同时,患者可实时查询用药信息并获得远程药房指导:总之,用药提醒系统提供了智能用药提醒、药物相互作用自动识别和监测系统,有助于监测慢性病患者的整个治疗过程。
{"title":"Construction and application of medication reminder system: intelligent generation of universal medication schedule.","authors":"Hangxing Huang, Lu Zhang, Yongyu Yang, Ling Huang, Xikui Lu, Jingyang Li, Huimin Yu, Shuqiao Cheng, Jian Xiao","doi":"10.1186/s13040-024-00376-y","DOIUrl":"10.1186/s13040-024-00376-y","url":null,"abstract":"<p><strong>Background: </strong>Patients with chronic conditions need multiple medications daily to manage their condition. However, most patients have poor compliance, which affects the effectiveness of treatment. To address these challenges, we establish a medication reminder system for the intelligent generation of universal medication schedule (UMS) to remind patients with chronic diseases to take medication accurately and to improve safety of home medication.</p><p><strong>Methods: </strong>To design medication time constraint with one drug (MTCOD) for each drug and medication time constraint with multi-drug (MTCMD) for each two drugs in order to better regulate the interval and time of patients' medication. Establishment of a medication reminder system consisting of a cloud database of drug information, an operator terminal for medical staff and a patient terminal.</p><p><strong>Results: </strong>The cloud database has a total of 153,916 pharmaceutical products, 496,708 drug interaction data, and 153,390 pharmaceutical product-ingredient pairs. The MTCOD data was 153,916, and the MTCMD data was 8,552,712. An intelligent UMS medication reminder system was constructed. The system can read the prescription information of patients and provide personalized medication guidance with medication timeline for chronic patients. At the same time, patients can query medication information and get remote pharmacy guidance in real time.</p><p><strong>Conclusions: </strong>Overall, the medication reminder system provides intelligent medication reminders, automatic drug interaction identification, and monitoring system, which is helpful to monitor the entire process of treatment in patients with chronic diseases.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"23"},"PeriodicalIF":4.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11247871/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database. 构建放射学网络:注释大规模多模态医学数据库的无监督方法。
IF 4 3区 生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2024-07-12 DOI: 10.1186/s13040-024-00373-1
Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar

Background: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.

Results: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.

Conclusions: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

背景:近年来,随着计算机辅助诊断系统的发展,机器学习在医学诊断和治疗中的应用有了显著增长,这些系统通常是基于有注释的医学放射图像。然而,由于注释过程耗时且成本高昂,缺乏大型注释图像数据集仍是一大障碍。本研究旨在通过提出一种基于语义相似性的自动注释大型医学放射图像数据库的方法来克服这一挑战:结果:采用一种自动化、无监督的方法创建了一个大型医学放射图像注释数据集,该数据集来自克罗地亚里耶卡临床医院中心。该管道是通过对三种不同类型的医疗数据进行数据挖掘而建立的:图像、DICOM 元数据和叙述性诊断。然后将最佳特征提取器集成到多模态表示中,再对其进行聚类,从而创建一个自动管道,将包含 1,337,926 张医疗图像的前体数据集标记为 50 个视觉相似图像集群。考虑到解剖区域和模式表示,通过检查其同质性和互信息来评估聚类的质量:结果表明,在对大规模医疗数据进行无监督聚类时,将所有三个数据源的嵌入融合在一起可获得最佳结果,并产生最简洁的聚类。因此,这项工作标志着我们朝着建立一个更大、更精细的医学放射图像注释数据集迈出了第一步。
{"title":"Building RadiologyNET: an unsupervised approach to annotating a large-scale multimodal medical database.","authors":"Mateja Napravnik, Franko Hržić, Sebastian Tschauner, Ivan Štajduhar","doi":"10.1186/s13040-024-00373-1","DOIUrl":"10.1186/s13040-024-00373-1","url":null,"abstract":"<p><strong>Background: </strong>The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.</p><p><strong>Results: </strong>An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.</p><p><strong>Conclusions: </strong>The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"22"},"PeriodicalIF":4.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11245804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141602017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Biodata Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1