Journal of Biomedical Informatics最新文献_第7页

MultiADE: A Multi-domain benchmark for Adverse Drug Event extraction MultiADE：药物不良事件提取的多领域基准。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-12 DOI: 10.1016/j.jbi.2024.104744

Xiang Dai , Sarvnaz Karimi , Abeed Sarker , Ben Hachey , Cecile Paris

Objective:

Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources, such as electronic health records, medical literature, social media and search engine logs. Over the years, many datasets have been created, and shared tasks have been organised to facilitate active adverse event surveillance. However, most – if not all – datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation – the ability of a machine learning model to perform well on new, unseen domains (text types) – is under-explored. Given the rapid advancements in natural language processing, one unanswered question is how far we are from having a single ADE extraction model that is effective on various types of text, such as scientific literature and social media posts.

Methods:

We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named MultiADE. The new benchmark comprises several existing datasets sampled from different text types and our newly created dataset—CADECv2, which is an extension of CADEC (Karimi et al., 2015), covering online posts regarding more diverse drugs than CADEC. Our new dataset is carefully annotated by human annotators following detailed annotation guidelines.

Conclusion:

Our benchmark results show that the generalisation of the trained models is far from perfect, making it infeasible to be deployed to process different types of text. In addition, although intermediate transfer learning is a promising approach to utilising existing resources, further investigation is needed on methods of domain adaptation, particularly cost-effective methods to select useful training instances.

The newly created CADECv2 and the scripts for building the benchmark are publicly available at CSIRO’s Data Portal (https://data.csiro.au/collection/csiro:62387). These resources enable the research community to further information extraction, leading to more effective active adverse drug event surveillance.

目的：主动不良事件监测从电子健康记录、医学文献、社交媒体和搜索引擎日志等不同数据源监测药物不良事件 (ADE)。多年来，人们创建了许多数据集，并组织了共享任务，以促进主动不良事件监测。然而，大多数（如果不是全部的话）数据集或共享任务都侧重于从特定类型的文本中提取 ADE。领域泛化--机器学习模型在新的、未见过的领域（文本类型）中表现良好的能力--还未得到充分探索。鉴于自然语言处理技术的飞速发展，一个悬而未决的问题是，我们离建立一个能在科学文献和社交媒体帖子等各种类型文本中有效使用的单一 ADE 提取模型还有多远：我们建立了一个多领域药物不良事件提取基准，并将其命名为 MultiADE，从而为回答这个问题做出了贡献。新基准包括从不同文本类型中采样的几个现有数据集和我们新创建的数据集--CADECv2，它是 CADEC（Karimi 等人，2015 年）的扩展，涵盖了比 CADEC 更多不同药物的在线帖子。我们的新数据集由人类注释者按照详细的注释指南进行仔细注释：我们的基准结果表明，训练模型的泛化能力远非完美，因此无法用于处理不同类型的文本。此外，尽管中间转移学习是一种很有前途的利用现有资源的方法，但还需要进一步研究领域适应方法，特别是选择有用的训练实例的经济有效的方法。新创建的 CADECv2 和用于构建基准的脚本可在 CSIRO 的数据门户网站 (https://data.csiro.au/collection/csiro:62387) 上公开获取。这些资源使研究界能够进一步提取信息，从而更有效地开展药物不良事件主动监测。

{"title":"MultiADE: A Multi-domain benchmark for Adverse Drug Event extraction","authors":"Xiang Dai , Sarvnaz Karimi , Abeed Sarker , Ben Hachey , Cecile Paris","doi":"10.1016/j.jbi.2024.104744","DOIUrl":"10.1016/j.jbi.2024.104744","url":null,"abstract":"<div><h3>Objective:</h3><div>Active adverse event surveillance monitors Adverse Drug Events (ADE) from different data sources, such as electronic health records, medical literature, social media and search engine logs. Over the years, many datasets have been created, and shared tasks have been organised to facilitate active adverse event surveillance. However, most – if not all – datasets or shared tasks focus on extracting ADEs from a particular type of text. Domain generalisation – the ability of a machine learning model to perform well on new, unseen domains (text types) – is under-explored. Given the rapid advancements in natural language processing, one unanswered question is how far we are from having a single ADE extraction model that is effective on various <em>types of text</em>, such as scientific literature and social media posts.</div></div><div><h3>Methods:</h3><div>We contribute to answering this question by building a multi-domain benchmark for adverse drug event extraction, which we named <span>MultiADE</span>. The new benchmark comprises several existing datasets sampled from different text types and our newly created dataset—<span>CADECv2</span>, which is an extension of <span>CADEC</span> (Karimi et al., 2015), covering online posts regarding more diverse drugs than CADEC. Our new dataset is carefully annotated by human annotators following detailed annotation guidelines.</div></div><div><h3>Conclusion:</h3><div>Our benchmark results show that the generalisation of the trained models is far from perfect, making it infeasible to be deployed to process different types of text. In addition, although intermediate transfer learning is a promising approach to utilising existing resources, further investigation is needed on methods of domain adaptation, particularly cost-effective methods to select useful training instances.</div><div>The newly created <span>CADECv2</span> and the scripts for building the benchmark are publicly available at CSIRO’s Data Portal (<span><span>https://data.csiro.au/collection/csiro:62387</span><svg><path></path></svg></span>). These resources enable the research community to further information extraction, leading to more effective active adverse drug event surveillance.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"160 ","pages":"Article 104744"},"PeriodicalIF":4.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142621233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Disentangling the phenotypic patterns of hypertension and chronic hypotension 解析高血压和慢性低血压的表型模式。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104743

William W. Stead , Adam Lewis , Nunzia B. Giuse , Annette M. Williams , Italo Biaggioni , Lisa Bastarache

Objective

2017 blood pressure (BP) categories focus on cardiac risk. We hypothesize that studying the balance between mechanisms that increase or decrease BP across the medical phenome will lead to new insights. We devised a classifier that uses BP measures to assign individuals to mutually exclusive categories centered in the upper (Htn), lower (Hotn) and middle (Naf) zones of the BP spectrum; and examined the epidemiologic and phenotypic patterns of these BP-categories.

Methods

We classified a cohort of 832,560 deidentified electronic health records by BP-category; compared the frequency of BP-categories and four subtypes of Htn and Hotn by sex and age-decade; visualized the distributions of systolic, diastolic, mean arterial and pulse pressures stratified by BP-category; and ran Phenome-wide Association Studies (PheWAS) for Htn and Hotn. We paired knowledgebases for hypertension and hypotension and computed aggregate knowledgebase status (KB-status) indicating known associations. We assessed alignment of PheWAS results with KB-status for phecodes in the knowledgebase, and paired PheWAS correlations with KB-status to surface phenotypic patterns.

Results

BP-categories represent distinct distributions within the multimodal distributions of systolic and diastolic pressure. They are centered in the upper, lower, and middle zones of mean arterial pressure and provide a different signal than pulse pressure. For phecodes in the knowledgebase, 85% of positive correlations align with KB-status. Phenotypic patterns for Htn and Hotn overlap for several phecodes and are separate for others. Our analysis suggests five candidates for hypothesis testing research, two where the prevalence of the association with Htn or Hotn may be under appreciated, three where mechanisms that increase and decrease blood pressure may be affecting one another’s expression.

Conclusion

PairedPheWAS methods may open a phenome-wide path to disentangling hypertension and chronic hypotension. Our classifier provides a starting point for assigning individuals to BP-categories representing the upper, lower, and middle zones of the BP spectrum. 4.7 % of individuals matching 2017 BP categories for normal, elevated BP or isolated hypertension, have diastolic pressure < 60. Research is needed to fine-tune the classifier, provide external validation, evaluate the clinical significance of diastolic pressure < 60, and test the candidate hypotheses.

目的：2017 年的血压（BP）分类侧重于心脏风险。我们假设，研究整个医学表型组中血压升高或降低机制之间的平衡将带来新的见解。我们设计了一种分类器，利用血压测量值将个体分配到以血压谱上区（Htn）、下区（Hotn）和中区（Naf）为中心的相互排斥的类别；并研究了这些血压类别的流行病学和表型模式：我们按血压类别对 832,560 份去标识化电子健康记录进行了分类；按性别和年龄段比较了血压类别以及 Htn 和 Hotn 四种亚型的频率；可视化了按血压类别分层的收缩压、舒张压、平均动脉压和脉搏压的分布；并对 Htn 和 Hotn 进行了全表型关联研究 (PheWAS)。我们将高血压和低血压知识库配对，并计算了表明已知关联的知识库总体状态（KB-status）。我们评估了PheWAS结果与知识库中嗜铬细胞编码的知识库状态的一致性，并将PheWAS相关性与知识库状态配对，以显示表型模式：血压类别代表了收缩压和舒张压多模态分布中的不同分布。它们以平均动脉压的上区、下区和中区为中心，提供与脉压不同的信号。对于知识库中的嗜铬细胞编码，85% 的正相关性与 KB 状态一致。Htn 和 Hotn 的表型模式在几个嗜铬细胞编码中重叠，而在其他编码中则分开。我们的分析为假设检验研究提出了五个候选方案，其中两个方案与 Htn 或 Hotn 的相关性可能未得到充分重视，三个方案中血压升高和降低的机制可能会影响彼此的表达：结论：PheWAS 成对方法可能为区分高血压和慢性低血压开辟了一条全表象之路。我们的分类器为将个体分配到代表血压谱上层、下层和中层区域的血压类别提供了一个起点。在符合 2017 年血压正常、血压升高或孤立性高血压类别的个体中，4.7% 的人有舒张压

{"title":"Disentangling the phenotypic patterns of hypertension and chronic hypotension","authors":"William W. Stead , Adam Lewis , Nunzia B. Giuse , Annette M. Williams , Italo Biaggioni , Lisa Bastarache","doi":"10.1016/j.jbi.2024.104743","DOIUrl":"10.1016/j.jbi.2024.104743","url":null,"abstract":"<div><h3>Objective</h3><div>2017 blood pressure (BP) categories focus on cardiac risk. We hypothesize that studying the balance between mechanisms that increase or decrease BP across the medical phenome will lead to new insights. We devised a classifier that uses BP measures to assign individuals to mutually exclusive categories centered in the upper (Htn), lower (Hotn) and middle (Naf) zones of the BP spectrum; and examined the epidemiologic and phenotypic patterns of these BP-categories.</div></div><div><h3>Methods</h3><div>We classified a cohort of 832,560 deidentified electronic health records by BP-category; compared the frequency of BP-categories and four subtypes of Htn and Hotn by sex and age-decade; visualized the distributions of systolic, diastolic, mean arterial and pulse pressures stratified by BP-category; and ran Phenome-wide Association Studies (PheWAS) for Htn and Hotn. We paired knowledgebases for hypertension and hypotension and computed aggregate knowledgebase status (KB-status) indicating known associations. We assessed alignment of PheWAS results with KB-status for phecodes in the knowledgebase, and paired PheWAS correlations with KB-status to surface phenotypic patterns.</div></div><div><h3>Results</h3><div>BP-categories represent distinct distributions within the multimodal distributions of systolic and diastolic pressure. They are centered in the upper, lower, and middle zones of mean arterial pressure and provide a different signal than pulse pressure. For phecodes in the knowledgebase, 85% of positive correlations align with KB-status. Phenotypic patterns for Htn and Hotn overlap for several phecodes and are separate for others. Our analysis suggests five candidates for hypothesis testing research, two where the prevalence of the association with Htn or Hotn may be under appreciated, three where mechanisms that increase and decrease blood pressure may be affecting one another’s expression.</div></div><div><h3>Conclusion</h3><div>PairedPheWAS methods may open a phenome-wide path to disentangling hypertension and chronic hypotension. Our classifier provides a starting point for assigning individuals to BP-categories representing the upper, lower, and middle zones of the BP spectrum. 4.7 % of individuals matching 2017 BP categories for normal, elevated BP or isolated hypertension, have diastolic pressure < 60. Research is needed to fine-tune the classifier, provide external validation, evaluate the clinical significance of diastolic pressure < 60, and test the candidate hypotheses.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104743"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142564529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demonstration-based learning for few-shot biomedical named entity recognition under machine reading comprehension 机器阅读理解下基于演示的生物医学命名实体识别学习

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104739

Leilei Su , Jian Chen , Yifan Peng , Cong Sun

Objective:

Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model’s capability to recognize biomedical entities in scenarios of few-shot learning.

Methods:

By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA.

Results:

We examined the models’ efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively.

Conclusion:

We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.

目的：虽然深度学习技术已经取得了显著的成就，但它们往往依赖于大量的手标注数据，而且在少量学习的场景中往往表现不佳。方法：通过将生物医学命名实体识别（BioNER）重新定义为机器阅读理解（MRC）问题，我们提出了一种基于演示的学习方法来解决生物医学命名实体识别（BioNER）的少量学习问题，该方法涉及构建适当的任务演示。在评估我们提出的方法时，我们使用了六个基准数据集，包括BC4CHEMD、BC5CDR-Chemical、BC5CDR-Disease、NCBI-Disease、BC2GM和JNLPBA，将我们提出的方法与现有的先进方法进行了比较。在 25 次学习中，我们观察到平均 F1 分数比基准方法提高了 1.1%，在六个数据集上分别达到 61.7%、84.1%、69.1%、70.1%、50.6% 和 59.9%。在 50 次学习中，我们的平均 F1 分数比基准方法进一步提高了 1.0%，分别达到了 73.1%、86.8%、76.1%、75.6%、61.7% 和 65.4%。此外，我们的 MRC 语言模型可以成功地与完全监督学习方法竞争，后者在很大程度上依赖于丰富的注释数据。这些结果凸显了未来推进少量生物核酸方法的可能途径。

{"title":"Demonstration-based learning for few-shot biomedical named entity recognition under machine reading comprehension","authors":"Leilei Su , Jian Chen , Yifan Peng , Cong Sun","doi":"10.1016/j.jbi.2024.104739","DOIUrl":"10.1016/j.jbi.2024.104739","url":null,"abstract":"<div><h3>Objective:</h3><div>Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model’s capability to recognize biomedical entities in scenarios of few-shot learning.</div></div><div><h3>Methods:</h3><div>By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA.</div></div><div><h3>Results:</h3><div>We examined the models’ efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively.</div></div><div><h3>Conclusion:</h3><div>We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104739"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142553603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HEART: Learning better representation of EHR data with a heterogeneous relation-aware transformer HEART：利用异构关系感知转换器学习更好的电子病历数据表示。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104741

Tinglin Huang , Syed Asad Rizvi , Rohan Krishna Thakur , Vimig Socrates , Meili Gupta , David van Dijk , R. Andrew Taylor , Rex Ying

Objective:

Pretrained language models have recently demonstrated their effectiveness in modeling Electronic Health Record (EHR) data by modeling the encounters of patients as sentences. However, existing methods fall short of utilizing the inherent heterogeneous correlations between medical entities—which include diagnoses, medications, procedures, and lab tests. Existing studies either focus merely on diagnosis entities or encode different entities in a homogeneous space, leading to suboptimal performance. Motivated by this, we aim to develop a foundational language model pre-trained on EHR data with explicitly incorporating the heterogeneous correlations among these entities.

Methods:

In this study, we propose HEART, a heterogeneous relation-aware transformer for EHR. Our model includes a range of heterogeneous entities within each input sequence and represents pairwise relationships between entities as a relation embedding. Such a higher-order representation allows the model to perform complex reasoning and derive attention weights in the heterogeneous context. Additionally, a multi-level attention scheme is employed to exploit the connection between different encounters while alleviating the high computational costs. For pretraining, HEART engages with two tasks, missing entity prediction and anomaly detection, which both effectively enhance the model’s performance on various downstream tasks.

Results:

Extensive experiments on two EHR datasets and five downstream tasks demonstrate HEART’s superior performance compared to four SOTA foundation models. For instance, HEART achieves improvements of 12.1% and 4.1% over Med-BERT in death and readmission prediction, respectively. Additionally, case studies show that HEART offers interpretable insights into the relationships between entities through the learned relation embeddings.

Conclusion:

We study the problem of EHR representation learning and propose HEART, a model that leverages the heterogeneous relationships between medical entities. Our approach includes a multi-level encoding scheme and two specialized pretrained objectives, designed to boost both the efficiency and effectiveness of the model. We have comprehensively evaluated HEART across five clinically significant downstream tasks using two EHR datasets. The experimental results verify the model’s great performance and validate its practical utility in healthcare applications. Code: https://github.com/Graph-and-Geometric-Learning/HEART.

目的：最近，预训练语言模型通过将患者的就诊情况建模为句子，证明了其在电子健康记录（EHR）数据建模方面的有效性。然而，现有的方法无法利用医疗实体（包括诊断、药物、手术和化验）之间固有的异质性关联。现有研究要么只关注诊断实体，要么在同质空间中对不同实体进行编码，从而导致性能不佳。受此启发，我们旨在开发一种在电子病历数据上进行预训练的基础语言模型，明确纳入这些实体之间的异质相关性：在这项研究中，我们提出了 HEART，一种用于电子病历的异构关系感知转换器。我们的模型包括每个输入序列中的一系列异构实体，并将实体间的成对关系表示为关系嵌入。这种高阶表示法允许模型在异构环境中执行复杂的推理并推导出关注权重。此外，HEART 还采用了多级注意力方案，以利用不同遭遇之间的联系，同时降低高昂的计算成本。在预训练中，HEART 参与了缺失实体预测和异常检测两项任务，这两项任务都能有效提高模型在各种下游任务中的性能：在两个电子病历数据集和五个下游任务上进行的广泛实验表明，与四个 SOTA 基础模型相比，HEART 的性能更为出色。例如，在死亡预测和再入院预测方面，HEART 比 Med-BERT 分别提高了 12.1% 和 4.1%。此外，案例研究表明，HEART 通过学习到的关系嵌入对实体之间的关系提供了可解释的见解：我们对电子病历表示学习问题进行了研究，并提出了 HEART 模型，该模型充分利用了医疗实体之间的异构关系。我们的方法包括多级编码方案和两个专门的预训练目标，旨在提高模型的效率和有效性。我们利用两个电子病历数据集对 HEART 的五项临床重要下游任务进行了全面评估。实验结果验证了该模型的卓越性能，并验证了其在医疗保健应用中的实用性。

{"title":"HEART: Learning better representation of EHR data with a heterogeneous relation-aware transformer","authors":"Tinglin Huang , Syed Asad Rizvi , Rohan Krishna Thakur , Vimig Socrates , Meili Gupta , David van Dijk , R. Andrew Taylor , Rex Ying","doi":"10.1016/j.jbi.2024.104741","DOIUrl":"10.1016/j.jbi.2024.104741","url":null,"abstract":"<div><h3>Objective:</h3><div>Pretrained language models have recently demonstrated their effectiveness in modeling Electronic Health Record (EHR) data by modeling the encounters of patients as sentences. However, existing methods fall short of utilizing the inherent heterogeneous correlations between medical entities—which include diagnoses, medications, procedures, and lab tests. Existing studies either focus merely on diagnosis entities or encode different entities in a homogeneous space, leading to suboptimal performance. Motivated by this, we aim to develop a foundational language model pre-trained on EHR data with explicitly incorporating the heterogeneous correlations among these entities.</div></div><div><h3>Methods:</h3><div>In this study, we propose <span>HEART</span>, a heterogeneous relation-aware transformer for EHR. Our model includes a range of heterogeneous entities within each input sequence and represents pairwise relationships between entities as a relation embedding. Such a higher-order representation allows the model to perform complex reasoning and derive attention weights in the heterogeneous context. Additionally, a multi-level attention scheme is employed to exploit the connection between different encounters while alleviating the high computational costs. For pretraining, <span>HEART</span> engages with two tasks, missing entity prediction and anomaly detection, which both effectively enhance the model’s performance on various downstream tasks.</div></div><div><h3>Results:</h3><div>Extensive experiments on two EHR datasets and five downstream tasks demonstrate <span>HEART</span>’s superior performance compared to four SOTA foundation models. For instance, <span>HEART</span> achieves improvements of 12.1% and 4.1% over Med-BERT in death and readmission prediction, respectively. Additionally, case studies show that <span>HEART</span> offers interpretable insights into the relationships between entities through the learned relation embeddings.</div></div><div><h3>Conclusion:</h3><div>We study the problem of EHR representation learning and propose HEART, a model that leverages the heterogeneous relationships between medical entities. Our approach includes a multi-level encoding scheme and two specialized pretrained objectives, designed to boost both the efficiency and effectiveness of the model. We have comprehensively evaluated HEART across five clinically significant downstream tasks using two EHR datasets. The experimental results verify the model’s great performance and validate its practical utility in healthcare applications. Code: <span><span>https://github.com/Graph-and-Geometric-Learning/HEART</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104741"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithms for evaluation of minimal cut sets 最小切割集评估算法。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104740

Marcin Radom , Agnieszka Rybarczyk , Igor Piekarz , Piotr Formanowicz

Objective:

We propose a way to enhance the evaluation of minimal cut sets (MCSs) in biological systems modeled by Petri nets, by providing criteria and methodology for determining their optimality in disabling specific processes without affecting critical system components.

Methods:

This study concerns Petri nets to model biological systems and utilizes two primary approaches for MCS evaluation. First is the analyzing impact on t-invariants to identify structural dependencies. Second is assessing the impact on potentially starved transitions caused by the inactivity of specific MCSs. This approach deal with net dynamics. These methodologies aim to offer practical tools for assessing the quality and effectiveness of MCSs.

Results:

The proposed methodologies were applied to two case studies. In the first case, a cholesterol metabolism network was analyzed to investigate how local inflammation and oxidative stress, in conjunction with cholesterol imbalances, influence the progression of atherosclerosis. The MCSs were ranked, with the top sets presented, focusing on those that disabled the fewest number of t-invariants. In the second case, a carbohydrate metabolism disorder model was examined to understand its impact on atherosclerosis progression. The analysis aimed to identify MCSs that could inhibit the atherosclerosis process by targeting specific transitions. Both studies utilized the Holmes software for calculations, demonstrating the effectiveness of the proposed evaluation methodologies in ranking MCSs for practical biological applications.

Conclusion:

The algorithms proposed in this paper offer an analytical approach for evaluating the quality of MCSs in biological systems. By providing criteria for MCS optimality, these approaches have potential to enhance the utility of MCS analysis in systems biology, aiding in the understanding and manipulation of complex biological networks.

Algorithm are implemented within Holmes software, an open-source project available at https://github.com/bszawulak/HolmesPN.

目标：我们提出了一种在 Petri 网建模的生物系统中加强最小割集（MCS）评估的方法，提供了确定最小割集在不影响关键系统组件的情况下禁用特定过程的最优性的标准和方法：本研究采用 Petri 网为生物系统建模，并利用两种主要方法对 MCS 进行评估。首先是分析对 t 变量的影响，以确定结构依赖性。其次是评估特定多重监控系统不活动对潜在饥饿转换的影响。这种方法处理的是净动态。这些方法旨在为评估监控监的质量和有效性提供实用工具：结果：所提出的方法适用于两个案例研究。第一个案例分析了胆固醇代谢网络，以研究局部炎症和氧化应激与胆固醇失衡如何影响动脉粥样硬化的进展。对多态性变异体进行了排序，并展示了最优秀的变异体，重点是那些禁用 t 变异体数量最少的变异体。第二种情况是研究碳水化合物代谢紊乱模型，以了解其对动脉粥样硬化进展的影响。分析的目的是找出可以通过靶向特定转变来抑制动脉粥样硬化过程的 MCS。这两项研究都使用了 Holmes 软件进行计算，证明了所提出的评估方法在实际生物应用中对 MCS 进行排序的有效性：本文提出的算法提供了一种评估生物系统中多重控制信号质量的分析方法。通过提供 MCS 最佳性标准，这些方法有望提高系统生物学中 MCS 分析的实用性，帮助理解和操纵复杂的生物网络。算法在 Holmes 软件中实现，该软件是一个开源项目，可在 https://github.com/bszawulak/HolmesPN 上获取。

{"title":"Algorithms for evaluation of minimal cut sets","authors":"Marcin Radom , Agnieszka Rybarczyk , Igor Piekarz , Piotr Formanowicz","doi":"10.1016/j.jbi.2024.104740","DOIUrl":"10.1016/j.jbi.2024.104740","url":null,"abstract":"<div><h3>Objective:</h3><div>We propose a way to enhance the evaluation of minimal cut sets (MCSs) in biological systems modeled by Petri nets, by providing criteria and methodology for determining their optimality in disabling specific processes without affecting critical system components.</div></div><div><h3>Methods:</h3><div>This study concerns Petri nets to model biological systems and utilizes two primary approaches for MCS evaluation. First is the analyzing impact on t-invariants to identify structural dependencies. Second is assessing the impact on potentially starved transitions caused by the inactivity of specific MCSs. This approach deal with net dynamics. These methodologies aim to offer practical tools for assessing the quality and effectiveness of MCSs.</div></div><div><h3>Results:</h3><div>The proposed methodologies were applied to two case studies. In the first case, a cholesterol metabolism network was analyzed to investigate how local inflammation and oxidative stress, in conjunction with cholesterol imbalances, influence the progression of atherosclerosis. The MCSs were ranked, with the top sets presented, focusing on those that disabled the fewest number of t-invariants. In the second case, a carbohydrate metabolism disorder model was examined to understand its impact on atherosclerosis progression. The analysis aimed to identify MCSs that could inhibit the atherosclerosis process by targeting specific transitions. Both studies utilized the Holmes software for calculations, demonstrating the effectiveness of the proposed evaluation methodologies in ranking MCSs for practical biological applications.</div></div><div><h3>Conclusion:</h3><div>The algorithms proposed in this paper offer an analytical approach for evaluating the quality of MCSs in biological systems. By providing criteria for MCS optimality, these approaches have potential to enhance the utility of MCS analysis in systems biology, aiding in the understanding and manipulation of complex biological networks.</div><div>Algorithm are implemented within Holmes software, an open-source project available at <span><span>https://github.com/bszawulak/HolmesPN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104740"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142501142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Call for Papers: Data Generation in Healthcare Environments 医疗环境中的数据生成。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104742

Ricardo Cardoso Pereira (Guest Editors) , Pedro Pereira Rodrigues , Irina Sousa Moreira , Pedro Henriques Abreu (Managing Guest Editor)

引用次数: 0

PLRTE: Progressive learning for biomedical relation triplet extraction using large language models PLRTE：使用大型语言模型进行生物医学关系三元组提取的渐进式学习。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104738

Yi-Kai Zheng , Bi Zeng , Yi-Chun Feng , Lu Zhou , Yi-Xue Li

Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model’s capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction, and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.

文档级关系三元组提取在生物医学文本挖掘中至关重要，它有助于药物发现和生物医学知识图谱的构建。在生物医学关系三元组提取中，当前的语言模型在泛化到未见过的数据集和关系类型方面面临挑战，这限制了它们在这些关键任务中的有效性。为了应对这一挑战，我们的研究从两个关键维度对模型进行了优化：数据与任务的相关性和关系的粒度，旨在显著增强模型的泛化能力。我们引入了一种新颖的渐进式学习策略来获得 PLRTE 模型。该策略不仅增强了模型理解生物医学领域各种关系类型的能力，还通过语义关系增强、组合指令和双轴水平学习实现了结构化的四级渐进学习过程。我们在 DDI 和 BC5CDR 文档级生物医学关系三元组数据集上进行的实验表明，与目前最先进的基线相比，我们的性能提高了 5% 到 20%。此外，我们的模型在未见过的 Chemprot 和 GDA 数据集上表现出了卓越的泛化能力，进一步验证了优化数据-任务关联和关系粒度以增强模型泛化能力的有效性。

{"title":"PLRTE: Progressive learning for biomedical relation triplet extraction using large language models","authors":"Yi-Kai Zheng , Bi Zeng , Yi-Chun Feng , Lu Zhou , Yi-Xue Li","doi":"10.1016/j.jbi.2024.104738","DOIUrl":"10.1016/j.jbi.2024.104738","url":null,"abstract":"<div><div>Document-level relation triplet extraction is crucial in biomedical text mining, aiding in drug discovery and the construction of biomedical knowledge graphs. Current language models face challenges in generalizing to unseen datasets and relation types in biomedical relation triplet extraction, which limits their effectiveness in these crucial tasks. To address this challenge, our study optimizes models from two critical dimensions: data-task relevance and granularity of relations, aiming to enhance their generalization capabilities significantly. We introduce a novel progressive learning strategy to obtain the PLRTE model. This strategy not only enhances the model’s capability to comprehend diverse relation types in the biomedical domain but also implements a structured four-level progressive learning process through semantic relation augmentation, compositional instruction, and dual-axis level learning. Our experiments on the DDI and BC5CDR document-level biomedical relation triplet datasets demonstrate a significant performance improvement of 5% to 20% over the current state-of-the-art baselines. Furthermore, our model exhibits exceptional generalization capabilities on the unseen Chemprot and GDA datasets, further validating the effectiveness of optimizing data-task association and relation granularity for enhancing model generalizability.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104738"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization 将开源 Gen3 平台和 kubernetes 用于 NIH HEAL IMPOWR 和 MIRHIQL 临床试验数据中心：定制、云过渡和优化。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104749

Meredith C.B. Adams , Colin Griffin , Hunter Adams , Stephen Bryant , Robert W. Hurley , Umit Topaloglu

Objective

This study aims to provide the decision-making framework, strategies, and software used to successfully deploy the first combined chronic pain and opioid use data clinical trial data commons using the Gen3 platform.

Materials and Methods

The approach involved adapting the open-source Gen3 platform and Kubernetes for the needs of the NIH HEAL IMPOWR and MIRHIQL networks. Key steps included customizing the Gen3 architecture, transitioning from Amazon to Google Cloud, adapting data ingestion and harmonization processes, ensuring security and compliance for the Kubernetes environment, and optimizing performance and user experience.

Results

The primary result was a fully operational IMPOWR data commons built on Gen3. Key features include a modular architecture supporting diverse clinical trial data types, automated processes for data management, fine-grained access control and auditing, and researcher-friendly interfaces for data exploration and analysis.

Discussion

The successful development of the Wake Forest IDEA-CC data commons represents a significant milestone for chronic pain and addiction research. Harmonized, FAIR data from diverse studies can be discovered in a secure, scalable repository. Challenges remain in long-term maintenance and governance, but the commons provides a foundation for accelerating scientific progress. Key lessons learned include the importance of engaging both technical and domain experts, the need for flexible yet robust infrastructure, and the value of building on established open-source platforms.

Conclusion

The WF IDEA-CC Gen3 data commons demonstrates the feasibility and value of developing a shared data infrastructure for chronic pain and opioid use research. The lessons learned can inform similar efforts in other clinical domains.

目的：本研究旨在提供决策框架、策略和软件，用于利用 Gen3 平台成功部署首个慢性疼痛和阿片类药物使用数据的临床试验数据公共平台：本研究旨在提供决策框架、策略和软件，以便利用 Gen3 平台成功部署首个慢性疼痛和阿片类药物使用数据合并临床试验数据公共中心：该方法包括根据 NIH HEAL IMPOWR 和 MIRHIQL 网络的需求调整开源 Gen3 平台和 Kubernetes。关键步骤包括定制 Gen3 架构、从亚马逊云过渡到谷歌云、调整数据摄取和统一流程、确保 Kubernetes 环境的安全性和合规性，以及优化性能和用户体验：主要成果是在 Gen3 基础上建立了一个全面运行的 IMPOWR 数据中心。主要特点包括：支持多种临床试验数据类型的模块化架构、数据管理自动化流程、细粒度访问控制和审计，以及用于数据探索和分析的研究人员友好界面：维克森林 IDEA-CC 数据集的成功开发是慢性疼痛和成瘾研究的一个重要里程碑。来自不同研究的统一的 FAIR 数据可以在一个安全、可扩展的资源库中被发现。虽然在长期维护和管理方面仍存在挑战，但共享库为加速科学进步奠定了基础。获得的主要经验包括：技术专家和领域专家参与的重要性、对灵活而强大的基础设施的需求，以及在已有开源平台基础上进行构建的价值：WF IDEA-CC Gen3 数据共用区证明了为慢性疼痛和阿片类药物使用研究开发共享数据基础设施的可行性和价值。这些经验教训可为其他临床领域的类似工作提供借鉴。

{"title":"Adapting the open-source Gen3 platform and kubernetes for the NIH HEAL IMPOWR and MIRHIQL clinical trial data commons: Customization, cloud transition, and optimization","authors":"Meredith C.B. Adams , Colin Griffin , Hunter Adams , Stephen Bryant , Robert W. Hurley , Umit Topaloglu","doi":"10.1016/j.jbi.2024.104749","DOIUrl":"10.1016/j.jbi.2024.104749","url":null,"abstract":"<div><h3>Objective</h3><div>This study aims to provide the decision-making framework, strategies, and software used to successfully deploy the first combined chronic pain and opioid use data clinical trial data commons using the Gen3 platform.</div></div><div><h3>Materials and Methods</h3><div>The approach involved adapting the open-source Gen3 platform and Kubernetes for the needs of the NIH HEAL IMPOWR and MIRHIQL networks. Key steps included customizing the Gen3 architecture, transitioning from Amazon to Google Cloud, adapting data ingestion and harmonization processes, ensuring security and compliance for the Kubernetes environment, and optimizing performance and user experience.</div></div><div><h3>Results</h3><div>The primary result was a fully operational IMPOWR data commons built on Gen3. Key features include a modular architecture supporting diverse clinical trial data types, automated processes for data management, fine-grained access control and auditing, and researcher-friendly interfaces for data exploration and analysis.</div></div><div><h3>Discussion</h3><div>The successful development of the Wake Forest IDEA-CC data commons represents a significant milestone for chronic pain and addiction research. Harmonized, FAIR data from diverse studies can be discovered in a secure, scalable repository. Challenges remain in long-term maintenance and governance, but the commons provides a foundation for accelerating scientific progress. Key lessons learned include the importance of engaging both technical and domain experts, the need for flexible yet robust infrastructure, and the value of building on established open-source platforms.</div></div><div><h3>Conclusion</h3><div>The WF IDEA-CC Gen3 data commons demonstrates the feasibility and value of developing a shared data infrastructure for chronic pain and opioid use research. The lessons learned can inform similar efforts in other clinical domains.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104749"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia 利用加泰罗尼亚初级保健研究发展信息系统（SIDIAP）的数据进行母婴数据链接的方法。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-11-01 DOI: 10.1016/j.jbi.2024.104747

E. Segundo, M. Far, C.I. Rodríguez-Casado, J.M. Elorza, J. Carrere-Molina, R. Mallol-Parera, M. Aragón

Background

Large-scale clinical databases containing routinely collected electronic health records (EHRs) data are a valuable source of information for research studies. For example, they can be used in pharmacoepidemiology studies to evaluate the effects of maternal medication exposure on neonatal and pediatric outcomes. Yet, this type of studies is infeasible without proper mother–child linkage.

Methods

We leveraged all eligible active records (N = 8,553,321) of the Information System for Research in Primary Care (SIDIAP) database. Mothers and infants were linked using a deterministic approach and linkage accuracy was evaluated in terms of the number of records from candidate mothers that failed to link. We validated the mother–child links identified by comparison of linked and unlinked records for both candidate mothers and descendants. Differences across these two groups were evaluated by means of effect size calculations instead of p-values. Overall, we described our data linkage process following the GUidance for Information about Linking Data sets (GUILD) principles.

Results

We were able to identify 744,763 unique mother–child relationships, linking 83.8 % candidate mothers with delivery dates within a period of 15 years. Of note, we provide a record-level category label used to derive a global confidence metric for the presented linkage process. Our validation analysis showed that the two groups were similar in terms of a number of aggregated attributes.

Conclusions

Complementing the SIDIAP database with mother–child links will allow clinical researchers to expand their epidemiologic studies with the ultimate goal of improving outcomes for pregnant women and their children. Importantly, the reported information at each step of the data linkage process will contribute to the validity of analyses and interpretation of results in future studies using this resource.

背景：包含常规收集的电子健康记录（EHR）数据的大型临床数据库是研究的宝贵信息来源。例如，它们可用于药物流行病学研究，以评估母体药物暴露对新生儿和儿科预后的影响。然而，如果没有适当的母婴联系，这类研究是不可行的：我们利用了初级医疗研究信息系统（SIDIAP）数据库中所有符合条件的有效记录（N = 8,553,321）。采用确定性方法连接母亲和婴儿，并根据未能连接的候选母亲记录数量评估连接的准确性。我们通过比较候选母亲和后代的链接记录和未链接记录，验证了所识别的母婴链接。我们通过计算效应大小而不是 p 值来评估这两组之间的差异。总之，我们按照《数据集链接信息指南》（GUILD）的原则描述了我们的数据链接过程：我们能够识别 744 763 个独特的母子关系，将 83.8% 的候选母亲与 15 年内的分娩日期联系起来。值得注意的是，我们提供了一个记录级别的类别标签，用于为提出的链接过程推导出一个全局置信度指标。我们的验证分析表明，两组数据在一些综合属性方面具有相似性：通过母婴链接对 SIDIAP 数据库进行补充，将使临床研究人员能够扩大流行病学研究，最终改善孕妇及其子女的预后。重要的是，在数据链接过程的每一步所报告的信息都将有助于提高分析的有效性，并有助于今后使用这一资源进行研究时对结果的解释。

{"title":"A mother-child data linkage approach using data from the information system for the development of research in primary care (SIDIAP) in Catalonia","authors":"E. Segundo, M. Far, C.I. Rodríguez-Casado, J.M. Elorza, J. Carrere-Molina, R. Mallol-Parera, M. Aragón","doi":"10.1016/j.jbi.2024.104747","DOIUrl":"10.1016/j.jbi.2024.104747","url":null,"abstract":"<div><h3>Background</h3><div>Large-scale clinical databases containing routinely collected electronic health records (EHRs) data are a valuable source of information for research studies. For example, they can be used in pharmacoepidemiology studies to evaluate the effects of maternal medication exposure on neonatal and pediatric outcomes. Yet, this type of studies is infeasible without proper mother–child linkage.</div></div><div><h3>Methods</h3><div>We leveraged all eligible active records (N = 8,553,321) of the Information System for Research in Primary Care (SIDIAP) database. Mothers and infants were linked using a deterministic approach and linkage accuracy was evaluated in terms of the number of records from candidate mothers that failed to link. We validated the mother–child links identified by comparison of linked and unlinked records for both candidate mothers and descendants. Differences across these two groups were evaluated by means of effect size calculations instead of <em>p</em>-values. Overall, we described our data linkage process following the GUidance for Information about Linking Data sets (GUILD) principles.</div></div><div><h3>Results</h3><div>We were able to identify 744,763 unique mother–child relationships, linking 83.8 % candidate mothers with delivery dates within a period of 15 years. Of note, we provide a record-level category label used to derive a global confidence metric for the presented linkage process. Our validation analysis showed that the two groups were similar in terms of a number of aggregated attributes.</div></div><div><h3>Conclusions</h3><div>Complementing the SIDIAP database with mother–child links will allow clinical researchers to expand their epidemiologic studies with the ultimate goal of improving outcomes for pregnant women and their children. Importantly, the reported information at each step of the data linkage process will contribute to the validity of analyses and interpretation of results in future studies using this resource.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104747"},"PeriodicalIF":4.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142604420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Triple and quadruple optimization for feature selection in cancer biomarker discovery 癌症生物标记物发现中特征选择的三重和四重优化。

IF 4 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2024-10-11 DOI: 10.1016/j.jbi.2024.104736

L. Cattelani, V. Fortino

The proliferation of omics data has advanced cancer biomarker discovery but often falls short in external validation, mainly due to a narrow focus on prediction accuracy that neglects clinical utility and validation feasibility. We introduce three- and four-objective optimization strategies based on genetic algorithms to identify clinically actionable biomarkers in omics studies, addressing classification tasks aimed at distinguishing hard-to-differentiate cancer subtypes beyond histological analysis alone. Our hypothesis is that by optimizing more than one characteristic of cancer biomarkers, we may identify biomarkers that will enhance their success in external validation. Our objectives are to: (i) assess the biomarker panel’s accuracy using a machine learning (ML) framework; (ii) ensure the biomarkers exhibit significant fold-changes across subtypes, thereby boosting the success rate of PCR or immunohistochemistry validations; (iii) select a concise set of biomarkers to simplify the validation process and reduce clinical costs; and (iv) identify biomarkers crucial for predicting overall survival, which plays a significant role in determining the prognostic value of cancer subtypes. We implemented and applied triple and quadruple optimization algorithms to renal carcinoma gene expression data from TCGA. The study targets kidney cancer subtypes that are difficult to distinguish through histopathology methods. Selected RNA-seq biomarkers were assessed against the gold standard method, which relies solely on clinical information, and in external microarray-based validation datasets. Notably, these biomarkers achieved over 0.8 of accuracy in external validations and added significant value to survival predictions, outperforming the use of clinical data alone with a superior c-index. The provided tool also helps explore the trade-off between objectives, offering multiple solutions for clinical evaluation before proceeding to costly validation or clinical trials.

全局组学数据的激增推动了癌症生物标记物的发现，但在外部验证方面往往存在不足，这主要是由于狭隘地关注预测准确性而忽视了临床实用性和验证可行性。我们介绍了基于遗传算法的三目标和四目标优化策略，以便在全局组学研究中发现可用于临床的生物标记物，解决旨在区分难以区分的癌症亚型的分类任务，而不仅仅是组织学分析。我们的假设是，通过优化癌症生物标志物的一个以上特征，我们可以确定生物标志物，从而提高它们在外部验证中的成功率。我们的目标是(i)使用机器学习（ML）框架评估生物标记物面板的准确性；(ii)确保生物标记物在不同亚型中表现出显著的折叠变化，从而提高 PCR 或免疫组化验证的成功率；(iii)选择一组简明的生物标记物以简化验证过程并降低临床成本；(iv)确定对预测总生存期至关重要的生物标记物，总生存期在确定癌症亚型的预后价值方面发挥着重要作用。我们对来自 TCGA 的肾癌基因表达数据实施并应用了三重和四重优化算法。这项研究的目标是组织病理学方法难以区分的肾癌亚型。对照完全依赖临床信息的金标准方法以及基于微阵列的外部验证数据集，对选定的 RNA-seq 生物标志物进行了评估。值得注意的是，这些生物标记物在外部验证中的准确率超过了 0.8，为生存预测带来了显著的价值，其 c 指数优于仅使用临床数据的方法。所提供的工具还有助于探索目标之间的权衡，在进行昂贵的验证或临床试验之前为临床评估提供多种解决方案。

{"title":"Triple and quadruple optimization for feature selection in cancer biomarker discovery","authors":"L. Cattelani, V. Fortino","doi":"10.1016/j.jbi.2024.104736","DOIUrl":"10.1016/j.jbi.2024.104736","url":null,"abstract":"<div><div>The proliferation of omics data has advanced cancer biomarker discovery but often falls short in external validation, mainly due to a narrow focus on prediction accuracy that neglects clinical utility and validation feasibility. We introduce three- and four-objective optimization strategies based on genetic algorithms to identify clinically actionable biomarkers in omics studies, addressing classification tasks aimed at distinguishing hard-to-differentiate cancer subtypes beyond histological analysis alone. Our hypothesis is that by optimizing more than one characteristic of cancer biomarkers, we may identify biomarkers that will enhance their success in external validation. Our objectives are to: (i) assess the biomarker panel’s accuracy using a machine learning (ML) framework; (ii) ensure the biomarkers exhibit significant fold-changes across subtypes, thereby boosting the success rate of PCR or immunohistochemistry validations; (iii) select a concise set of biomarkers to simplify the validation process and reduce clinical costs; and (iv) identify biomarkers crucial for predicting overall survival, which plays a significant role in determining the prognostic value of cancer subtypes. We implemented and applied triple and quadruple optimization algorithms to renal carcinoma gene expression data from TCGA. The study targets kidney cancer subtypes that are difficult to distinguish through histopathology methods. Selected RNA-seq biomarkers were assessed against the gold standard method, which relies solely on clinical information, and in external microarray-based validation datasets. Notably, these biomarkers achieved over 0.8 of accuracy in external validations and added significant value to survival predictions, outperforming the use of clinical data alone with a superior c-index. The provided tool also helps explore the trade-off between objectives, offering multiple solutions for clinical evaluation before proceeding to costly validation or clinical trials.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"159 ","pages":"Article 104736"},"PeriodicalIF":4.0,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0