Journal of Biomedical Informatics最新文献_第9页

Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records 通过丰富的验证和有针对性的抽样来克服数据挑战，以测量电子健康记录中的整个人的健康。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-02 DOI: 10.1016/j.jbi.2025.104904

Sarah C. Lotspeich , Sheetal Kedar , Rabeya Tahir , Aidan D. Keleghan , Amelia Miranda , Stephany N. Duda , Michael P. Bancks , Brian J. Wells , Ashish K. Khanna , Joseph Rigdon

Objective:

The allostatic load index (ALI) is a 10-component composite measure of whole-person health, which reflects the multiple interrelated physiological regulatory systems that underlie healthy functioning. Data from electronic health records (EHR) present a huge opportunity to operationalize the ALI in learning health systems; however, these data are prone to missingness and errors. Validation (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients’ data can be validated, and most protocols do not recover missing data.

Methods:

Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of ALI and healthcare utilization. Employing semiparametric maximum likelihood estimation, we robustly incorporate all available patient information into statistical models. Using targeted design strategies, we examine ways to select the most informative patients for validation. Incorporating clinical expertise, we devise a novel validation protocol to promote EHR data quality and completeness.

Results:

Chart reviews uncovered few errors (99% matched source documents) and recovered some missing data through auxiliary information in patients’ charts. On average, validation increased the number of non-missing ALI components per patient from 6 to 7. Through simulations based on preliminary data, residual sampling was identified as the most informative strategy for completing our validation study. Incorporating validation data, statistical models indicated that worse whole-person health (higher ALI) was associated with higher odds of engaging in the healthcare system, adjusting for age.

Conclusion:

Targeted validation with an enriched protocol can ensure the quality and promote the completeness of EHR data. Findings from our validation study were incorporated into analyses as we operationalize the ALI as a scalable whole-person health measure that predicts healthcare utilization in the learning health system.

目的：适应负荷指数（ALI）是一个由10个成分组成的整体人体健康指标，它反映了健康功能背后的多个相互关联的生理调节系统。来自电子健康记录（EHR）的数据为在学习卫生系统中实施ALI提供了巨大的机会；然而，这些数据容易丢失和错误。验证（例如，通过图表审查）可以提供更高质量的数据，但实际上，只有一小部分患者的数据可以被验证，而且大多数方案不能恢复丢失的数据。方法：从一个广泛的学习型卫生系统的电子病历中选取1000名患者作为代表性样本（其中100人可以被验证），我们提出了设计、实施和分析ALI和医疗保健利用的统计有效和稳健研究的方法。采用半参数最大似然估计，我们稳健地将所有可用的患者信息纳入统计模型。使用有针对性的设计策略，我们检查了选择最具信息性的患者进行验证的方法。结合临床专业知识，我们设计了一种新的验证方案，以提高电子病历数据的质量和完整性。结果：图表审核发现的错误很少（99%与源文档匹配），并通过患者图表中的辅助信息恢复了一些缺失的数据。平均而言，验证将每位患者非缺失ALI成分的数量从6个增加到7个。通过基于初步数据的模拟，残差抽样被确定为完成我们验证研究的最具信息性的策略。结合验证数据，统计模型表明，整体健康状况较差（ALI较高）与参与医疗保健系统的几率较高相关，并根据年龄进行调整。结论：采用丰富的方案进行有针对性的验证，可以保证电子病历数据的质量，提高数据的完整性。我们验证研究的结果被纳入分析，因为我们将ALI作为可扩展的全人健康测量来预测学习健康系统中的医疗保健利用。

{"title":"Overcoming data challenges through enriched validation and targeted sampling to measure whole-person health in electronic health records","authors":"Sarah C. Lotspeich , Sheetal Kedar , Rabeya Tahir , Aidan D. Keleghan , Amelia Miranda , Stephany N. Duda , Michael P. Bancks , Brian J. Wells , Ashish K. Khanna , Joseph Rigdon","doi":"10.1016/j.jbi.2025.104904","DOIUrl":"10.1016/j.jbi.2025.104904","url":null,"abstract":"<div><h3>Objective:</h3><div>The allostatic load index (ALI) is a 10-component composite measure of whole-person health, which reflects the multiple interrelated physiological regulatory systems that underlie healthy functioning. Data from electronic health records (EHR) present a huge opportunity to operationalize the ALI in learning health systems; however, these data are prone to missingness and errors. Validation (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients’ data can be validated, and most protocols do not recover missing data.</div></div><div><h3>Methods:</h3><div>Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of ALI and healthcare utilization. Employing semiparametric maximum likelihood estimation, we robustly incorporate all available patient information into statistical models. Using targeted design strategies, we examine ways to select the most informative patients for validation. Incorporating clinical expertise, we devise a novel validation protocol to promote EHR data quality and completeness.</div></div><div><h3>Results:</h3><div>Chart reviews uncovered few errors (99% matched source documents) and recovered some missing data through auxiliary information in patients’ charts. On average, validation increased the number of non-missing ALI components per patient from 6 to 7. Through simulations based on preliminary data, residual sampling was identified as the most informative strategy for completing our validation study. Incorporating validation data, statistical models indicated that worse whole-person health (higher ALI) was associated with higher odds of engaging in the healthcare system, adjusting for age.</div></div><div><h3>Conclusion:</h3><div>Targeted validation with an enriched protocol can ensure the quality and promote the completeness of EHR data. Findings from our validation study were incorporated into analyses as we operationalize the ALI as a scalable whole-person health measure that predicts healthcare utilization in the learning health system.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"170 ","pages":"Article 104904"},"PeriodicalIF":4.5,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145000662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum to “A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses”. [J. Biomed. Inform. 162 (2025) 104771] “协调NHS苏格兰实验室数据以实现国家级分析的管道”的勘误。[J。生物医学。通报。162(2025)104771]。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-01 DOI: 10.1016/j.jbi.2025.104891

Chuang Gao , Shahzad Mumtaz , Sophie McCall , Katherine O’Sullivan , Mark McGilchrist , Daniel R. Morales , Christopher Hall , Katie Wilde , Charlie Mayor , Pamela Linksted , Kathy Harrison , Christian Cole , Emily Jefferson

Objective

Medical laboratory data together with prescribing and hospitalisation records are three of the most used electronic health records (EHRs) for data-driven health research. In Scotland, hospitalisation, prescribing and the death register data are available nationally whereas laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users of this regionally curated data, working on laboratory datasets across regional cohorts requires effort and time. As part of this study, the Scottish Safe Haven Network have developed an open-source software pipeline to generate a harmonised laboratory dataset.

Methods

We obtained sample laboratory data from the four regional Safe Havens in Scotland covering people within the SHARE consented cohort. We compared the variables collected by each regional Safe Haven and mapped these to 11 FHIR and 2 Scottish-specific standardised terms (i.e., one to indicate the regional health board and a second to describe the source clinical code description)

Results

We compared the laboratory data and found that 182 test codes covered 98.7 % of test records performed across Scotland. Focusing on the 182 test codes, we developed a set of transformations to convert test results captured in different units to the same unit. We included both Read Codes and SNOMED CT to encode the tests within the pipeline.

Conclusion

We validated our harmonisation pipeline by comparing the results across the different regional datasets. The pipeline can be reused by researchers and/or Safe Havens to generate clean, harmonised laboratory data at a national level with minimal effort.

目的：医学实验室数据与处方和住院记录是数据驱动的健康研究中使用最多的三种电子健康记录（EHRs）。在苏格兰，住院、处方和死亡登记数据在全国范围内可获得，而实验室数据是从地方卫生委员会系统获取、存储和报告的，存在显著的异质性。对于研究人员或该区域整理数据的其他用户来说，处理跨区域队列的实验室数据集需要精力和时间。作为这项研究的一部分，苏格兰安全港网络开发了一个开源软件管道来生成一个统一的实验室数据集。方法：我们从苏格兰的四个区域安全港获得样本实验室数据，涵盖了SHARE同意队列中的人群。我们比较了每个区域安全港收集的变量，并将这些变量映射到11个FHIR和2个苏格兰特定的标准化术语（即，一个表示区域卫生委员会，另一个描述源临床代码描述）结果：我们比较了实验室数据，发现182个测试代码覆盖了整个苏格兰98.7%的测试记录。专注于182个测试代码，我们开发了一组转换，将在不同单元中捕获的测试结果转换为相同的单元。我们包括了Read Codes和SNOMED CT来对管道中的测试进行编码。结论：我们通过比较不同区域数据集的结果验证了我们的协调管道。研究人员和/或安全港可以重复使用该管道，以最小的努力在国家一级生成干净、统一的实验室数据。

{"title":"Corrigendum to “A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses”. [J. Biomed. Inform. 162 (2025) 104771]","authors":"Chuang Gao , Shahzad Mumtaz , Sophie McCall , Katherine O’Sullivan , Mark McGilchrist , Daniel R. Morales , Christopher Hall , Katie Wilde , Charlie Mayor , Pamela Linksted , Kathy Harrison , Christian Cole , Emily Jefferson","doi":"10.1016/j.jbi.2025.104891","DOIUrl":"10.1016/j.jbi.2025.104891","url":null,"abstract":"<div><h3>Objective</h3><div>Medical laboratory data together with prescribing and hospitalisation records are three of the most used electronic health records (EHRs) for data-driven health research. In Scotland, hospitalisation, prescribing and the death register data are available nationally whereas laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users of this regionally curated data, working on laboratory datasets across regional cohorts requires effort and time. As part of this study, the Scottish Safe Haven Network have developed an open-source software pipeline to generate a harmonised laboratory dataset.</div></div><div><h3>Methods</h3><div>We obtained sample laboratory data from the four regional Safe Havens in Scotland covering people within the SHARE consented cohort. We compared the variables collected by each regional Safe Haven and mapped these to 11 FHIR and 2 Scottish-specific standardised terms (i.e., one to indicate the regional health board and a second to describe the source clinical code description)</div></div><div><h3>Results</h3><div>We compared the laboratory data and found that 182 test codes covered 98.7 % of test records performed across Scotland. Focusing on the 182 test codes, we developed a set of transformations to convert test results captured in different units to the same unit. We included both Read Codes and SNOMED CT to encode the tests within the pipeline.</div></div><div><h3>Conclusion</h3><div>We validated our harmonisation pipeline by comparing the results across the different regional datasets. The pipeline can be reused by researchers and/or Safe Havens to generate clean, harmonised laboratory data at a national level with minimal effort.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104891"},"PeriodicalIF":4.5,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144835216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying task groupings for multi-task learning using pointwise V-usable information. 使用点v可用信息识别多任务学习的任务分组。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-01 Epub Date: 2025-07-16 DOI: 10.1016/j.jbi.2025.104881

Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova

Objective: Even in the era of Large Language Models (LLMs) which are claimed to be solutions for many tasks, fine-tuning language models remains a core methodology used in deployment for a variety of reasons - computational efficiency and performance maximization among them. Fine-tuning could be single-task or multi-task joint learning where the tasks support each other thus boosting their performance. The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose such a metric.

Methods: We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama and GPT-4.

Results: The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.

Conclusion: For domain-specific tasks, finetuned models may remain a preferable option, and the PVI-based method of grouping tasks for multi-task learning could be particularly beneficial. This metric could be wrapped in the overall recipe of fine-tuning language models.

目的：即使在大型语言模型（llm）时代，它被认为是许多任务的解决方案，微调语言模型仍然是部署中使用的核心方法，原因有很多，其中包括计算效率和性能最大化。微调可以是单任务或多任务联合学习，其中任务相互支持从而提高它们的表现。多任务学习的成功在很大程度上取决于哪些任务被组合在一起。天真地将所有任务或一组随机任务分组可能会导致负迁移，多任务模型的表现比单任务模型更差。尽管在识别任务分组和测量不同任务之间的相关性方面已经做出了许多努力，但定义一个度量来从许多潜在的任务组合池中识别最佳任务分组仍然是一个具有挑战性的研究课题。我们提出这样一个度量标准。方法：提出了一种基于点向v可用信息（PVI）测量任务难度的任务相关性度量方法。PVI是最近提出的一个度量，用于估计给定模型的数据集包含多少可用信息。我们假设PVI估计没有统计学差异的任务足够相似，可以从联合学习过程中受益。我们进行了全面的实验，以评估该指标在一般、生物医学和临床领域的15个NLP数据集上进行任务分组的可行性。我们将联合学习器的结果与单个学习器、现有的基线方法和最近的大型语言模型（包括Llama和GPT-4）进行了比较。结果：结果表明，通过对具有相似PVI估计的任务进行分组，联合学习器在总参数较少的情况下产生了具有竞争力的结果，并且在各个领域的表现一致。结论：对于特定领域的任务，微调模型可能仍然是一个更好的选择，基于pvi的多任务学习任务分组方法可能特别有益。这个度量可以包含在微调语言模型的整体配方中。

{"title":"Identifying task groupings for multi-task learning using pointwise V-usable information.","authors":"Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova","doi":"10.1016/j.jbi.2025.104881","DOIUrl":"10.1016/j.jbi.2025.104881","url":null,"abstract":"<p><strong>Objective: </strong>Even in the era of Large Language Models (LLMs) which are claimed to be solutions for many tasks, fine-tuning language models remains a core methodology used in deployment for a variety of reasons - computational efficiency and performance maximization among them. Fine-tuning could be single-task or multi-task joint learning where the tasks support each other thus boosting their performance. The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose such a metric.</p><p><strong>Methods: </strong>We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama and GPT-4.</p><p><strong>Results: </strong>The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.</p><p><strong>Conclusion: </strong>For domain-specific tasks, finetuned models may remain a preferable option, and the PVI-based method of grouping tasks for multi-task learning could be particularly beneficial. This metric could be wrapped in the overall recipe of fine-tuning language models.</p>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":" ","pages":"104881"},"PeriodicalIF":4.5,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144667675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Digital twins in increasing diversity in clinical trials: A systematic review. 数字双胞胎在临床试验中增加多样性：系统回顾。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-01 Epub Date: 2025-08-08 DOI: 10.1016/j.jbi.2025.104879

Abigail Tubbs, Enrique Alvarez Vazquez

The integration of digital twin (DT) technology and artificial intelligence (AI) into clinical trials holds transformative potential for addressing persistent inequities in participant representation. This systematic review evaluates the role of these technologies in improving diversity, particularly in racial, ethnic, gender, age, and socioeconomic dimensions, minimizing bias, and allowing personalized medicine in clinical research settings. Evidence from 90 studies reveals that digital twins offer dynamic simulation capabilities for trial design, while AI facilitates predictive analytics and recruitment optimization. However, implementation remains hindered by fragmented regulatory frameworks, biased datasets, and infrastructural disparities. Ethical concerns,including privacy, consent, and algorithmic opacity, further complicate the deployment. Inclusive data practices identified in the literature include the use of demographically representative training data, participatory data collection frameworks, and equity audits to detect and correct systemic bias. Fairness in AI and DT models is primarily operationalized through group fairness metrics such as demographic parity and equalized odds, along with fairness, aware model training and validation. Key gaps include the lack of global standards, underrepresentation in model training, and challenges in real-world adoption. To overcome these barriers, the review proposes actionable directions: developing inclusive data practices, harmonizing regulatory oversight, and embedding fairness into computational model design. By focusing on diversity as a design principle, AI and DT technologies can support a more equitable and generalizable future for clinical research.

将数字孪生（DT）技术和人工智能（AI）整合到临床试验中，对于解决参与者代表性方面持续存在的不平等问题具有变革性潜力。本系统综述评估了这些技术在改善多样性方面的作用，特别是在种族、民族、性别、年龄和社会经济方面，最大限度地减少偏见，并在临床研究环境中允许个性化医疗。来自90项研究的证据表明，数字双胞胎为试验设计提供了动态模拟能力，而人工智能则有助于预测分析和招聘优化。然而，实施仍然受到分散的监管框架、有偏见的数据集和基础设施差异的阻碍。伦理问题，包括隐私、同意和算法不透明，使部署进一步复杂化。文献中确定的包容性数据实践包括使用具有人口代表性的培训数据、参与式数据收集框架和公平审计来发现和纠正系统性偏见。AI和DT模型中的公平性主要通过群体公平性指标（如人口均等和均等几率）以及公平性、有意识的模型训练和验证来实现。主要的差距包括缺乏全球标准，模型训练中的代表性不足，以及在现实世界中采用的挑战。为了克服这些障碍，报告提出了可操作的方向：发展包容性数据实践，协调监管监督，将公平性嵌入计算模型设计。通过将多样性作为一项设计原则，人工智能和DT技术可以为临床研究提供更公平和更普遍的未来。

{"title":"Digital twins in increasing diversity in clinical trials: A systematic review.","authors":"Abigail Tubbs, Enrique Alvarez Vazquez","doi":"10.1016/j.jbi.2025.104879","DOIUrl":"10.1016/j.jbi.2025.104879","url":null,"abstract":"<p><p>The integration of digital twin (DT) technology and artificial intelligence (AI) into clinical trials holds transformative potential for addressing persistent inequities in participant representation. This systematic review evaluates the role of these technologies in improving diversity, particularly in racial, ethnic, gender, age, and socioeconomic dimensions, minimizing bias, and allowing personalized medicine in clinical research settings. Evidence from 90 studies reveals that digital twins offer dynamic simulation capabilities for trial design, while AI facilitates predictive analytics and recruitment optimization. However, implementation remains hindered by fragmented regulatory frameworks, biased datasets, and infrastructural disparities. Ethical concerns,including privacy, consent, and algorithmic opacity, further complicate the deployment. Inclusive data practices identified in the literature include the use of demographically representative training data, participatory data collection frameworks, and equity audits to detect and correct systemic bias. Fairness in AI and DT models is primarily operationalized through group fairness metrics such as demographic parity and equalized odds, along with fairness, aware model training and validation. Key gaps include the lack of global standards, underrepresentation in model training, and challenges in real-world adoption. To overcome these barriers, the review proposes actionable directions: developing inclusive data practices, harmonizing regulatory oversight, and embedding fairness into computational model design. By focusing on diversity as a design principle, AI and DT technologies can support a more equitable and generalizable future for clinical research.</p>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":" ","pages":"104879"},"PeriodicalIF":4.5,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144816690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corrigendum to “A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses” [J. Biomed. Inform. 2025 Feb;162:104771. https://doi.org/10.1016/j.jbi.2024.104771. Epub 2025 Jan 2. PMID: 39755323] “统一NHS苏格兰实验室数据以实现国家级分析的管道”的勘误表[J]。生物医学。通报。2025年2月；162:104771。https://doi.org/10.1016/j.jbi.2024.104771。Epub 2025年1月2日PMID: 39755323)

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-01 DOI: 10.1016/j.jbi.2025.104890

Chuang Gao , Shahzad Mumtaz , Sophie McCall , Katherine O'Sullivan , Mark McGilchrist , Daniel R. Morales , Christopher Hall , Katie Wilde , Charlie Mayor , Pamela Linksted , Kathy Harrison , Christian Cole , Emily Jefferson

引用次数: 0

Learning from multiple data sources for decision making in health care 从多个数据源中学习，用于卫生保健决策。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-09-01 DOI: 10.1016/j.jbi.2025.104892

Fabio Stella (Guest Editors), Francesco Calimeri, Mauro Dragoni

引用次数: 0

WoundcareVQA: A multilingual visual question answering benchmark dataset for wound care WoundcareVQA：伤口护理的多语言视觉问答基准数据集。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-08-29 DOI: 10.1016/j.jbi.2025.104888

Wen-wai Yim , Asma Ben Abacha , Robert Doerning , Chia-Yu Chen , Jiaying Xu , Anita Subbarao , Zixuan Yu , Fei Xia , M. Kennedy Hall , Meliha Yetisgen

Objective:

Introduce the task of wound care multimodal multilingual visual question answering, provide baseline performances, and identify areas of future study.

Methods:

A dataset of wound care multimodal multilingual visual question answering (VQA) was created using consumer health questions asked online. Practicing US medical doctors were tasked with providing metadata and expert responses labels. Several instruct-enabled, multilingual visual question answering models (GPT-4o, Gemini-1.5-Pro, and Qwen-VL) were tested to benchmark performances. Finally, automatic evaluations were tested against domain expert response ratings.

Results:

A multilingual dataset of 477 wound care cases, 768 responses, 748 images, 3k structured data labels, 1362 translation instances, and 10k judgments was constructed (https://osf.io/xsj5u/). Metadata scores ranged from 0.32–0.78 accuracy depending on classification type; response generation performances 0.06 BLEU, 0.66 BERTScore, 0.45 ROUGE-L in English and 0.12 BLEU, 0.69 BERTScore, and 0.50 ROUGE-L in Chinese.

Conclusion:

We construct and explore the tasks of multimodal, multilingual VQA. We hope the work here can inspire further research in wound care metadata classification, VQA response generation, and open response automatic evaluation.

目的：介绍创伤护理多模态多语言视觉问答的任务，提供基线表现，并确定未来的研究领域。方法：利用消费者在线健康问题，建立伤口护理多模式多语言视觉问答（VQA）数据集。美国执业医生的任务是提供元数据和专家回答标签。测试了几种支持指令的多语言视觉问答模型（gpt - 40、Gemini-1.5-Pro和Qwen-VL）的基准性能。最后，根据领域专家的反应等级对自动评估进行了测试。结果：构建了一个包含477个伤口护理案例、768个回复、748张图像、3k个结构化数据标签、1362个翻译实例和10k个判断的多语言数据集（https://osf.io/xsj5u/）。根据分类类型，元数据得分的准确率范围为0.32-0.78；反应生成性能：英语为0.06 BLEU, 0.66 BERTScore, 0.45 ROUGE-L；汉语为0.12 BLEU, 0.69 BERTScore, 0.50 ROUGE-L。结论：我们构建并探索了多模态、多语言的VQA任务。我们希望本文的工作能够对伤口护理元数据分类、VQA反应生成和开放反应自动评估等方面的进一步研究提供启发。

{"title":"WoundcareVQA: A multilingual visual question answering benchmark dataset for wound care","authors":"Wen-wai Yim , Asma Ben Abacha , Robert Doerning , Chia-Yu Chen , Jiaying Xu , Anita Subbarao , Zixuan Yu , Fei Xia , M. Kennedy Hall , Meliha Yetisgen","doi":"10.1016/j.jbi.2025.104888","DOIUrl":"10.1016/j.jbi.2025.104888","url":null,"abstract":"<div><h3>Objective:</h3><div>Introduce the task of wound care multimodal multilingual visual question answering, provide baseline performances, and identify areas of future study.</div></div><div><h3>Methods:</h3><div>A dataset of wound care multimodal multilingual visual question answering (VQA) was created using consumer health questions asked online. Practicing US medical doctors were tasked with providing metadata and expert responses labels. Several instruct-enabled, multilingual visual question answering models (GPT-4o, Gemini-1.5-Pro, and Qwen-VL) were tested to benchmark performances. Finally, automatic evaluations were tested against domain expert response ratings.</div></div><div><h3>Results:</h3><div>A multilingual dataset of 477 wound care cases, 768 responses, 748 images, 3k structured data labels, 1362 translation instances, and 10k judgments was constructed (<span><span>https://osf.io/xsj5u/</span><svg><path></path></svg></span>). Metadata scores ranged from 0.32–0.78 accuracy depending on classification type; response generation performances 0.06 BLEU, 0.66 BERTScore, 0.45 ROUGE-L in English and 0.12 BLEU, 0.69 BERTScore, and 0.50 ROUGE-L in Chinese.</div></div><div><h3>Conclusion:</h3><div>We construct and explore the tasks of multimodal, multilingual VQA. We hope the work here can inspire further research in wound care metadata classification, VQA response generation, and open response automatic evaluation.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"170 ","pages":"Article 104888"},"PeriodicalIF":4.5,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144955644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MedVidDeID: Protecting privacy in clinical encounter video recordings MedVidDeID：在临床遭遇视频记录中保护隐私。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-08-29 DOI: 10.1016/j.jbi.2025.104901

Sriharsha Mopidevi , Kuk Jin Jang , Basam Alasaly , Sydney Pugh , Jean Park , Ashley Batugo , Sy Hwang , Eric Eaton , Danielle Lee Mowery , Kevin B. Johnson

Objective:

The increasing use of audio-video (AV) data in healthcare has improved patient care, clinical training, and medical and ethnographic research. However, it has also introduced major challenges in preserving patient-provider privacy due to Protected Health Information (PHI) in such data. Traditional de-identification methods are inadequate for AV data, which can reveal identifiable information such as faces, voices, and environmental details. Our goal was to create a pipeline for de-identifying AV healthcare data that minimized the human effort required to guarantee successful de-identification.

Methods:

We combined open-source tools with novel methods and infrastructure into a six-stage pipeline: (1) transcript extraction using WhisperX, (2) transcript de-identification with an adapted PHIlter, (3) audio de-identification through scrubbing, (4) video de-identification using YOLOv11 for pose detection and blurring, (5) recombining de-identified audio and video, and (6) validation and correction via manual quality control (QC). We developed two de-identification strategies to support different tolerances for lossy video images. We evaluated this pipeline using 10 h of simulated clinical AV recordings, comprising nearly 1.1 million video frames and approximately 72,000 words.

Results:

In Precision Privacy Preservation (PPP) mode, MedVidDeId achieved a success rate of 50%, while in Greedy Privacy Preservation (GPP) mode, it achieved a 97.5% success rate. Compared to manual methods for a 15 min video segment, the pipeline reduced de-identification time by 26.7% in PPP and 64.2% in GPP modes.

Conclusion:

The MedVidDeID pipeline offers a viable, efficient hybrid solution for handling AV healthcare data and privacy preservation. Future work will focus on reducing upstream errors at each stage and minimizing the role of the human in the loop.

目的：在医疗保健中越来越多地使用视听（AV）数据，改善了患者护理、临床培训以及医学和人种学研究。然而，由于此类数据中的受保护健康信息（PHI），它也在保护患者-提供者隐私方面带来了重大挑战。传统的去识别方法不适用于自动驾驶数据，因为自动驾驶数据可以显示人脸、声音和环境细节等可识别信息。我们的目标是创建一个去识别AV医疗保健数据的管道，以最大限度地减少保证成功去识别所需的人力。方法：我们将开源工具与新方法和基础设施结合起来，形成了一个六阶段的流水线：(1)使用WhisperX提取文本，(2)使用改编PHIlter进行文本去识别，(3)通过擦除进行音频去识别，(4)使用YOLOv11进行姿态检测和模糊处理进行视频去识别，(5)重新组合去识别的音频和视频，以及(6)通过人工质量控制（QC）进行验证和纠正。我们开发了两种去识别策略来支持有损视频图像的不同容忍度。我们使用10小时的模拟临床AV记录来评估这个管道，包括近110万视频帧和大约72,000个单词。结果：MedVidDeId在精准隐私保护（PPP）模式下的成功率为50%，在贪婪隐私保护（GPP）模式下的成功率为97.5%。与手动方法相比，对于15分钟的视频片段，管道在PPP模式下减少了26.7%的去识别时间，在GPP模式下减少了64.2%。结论：MedVidDeID管道为处理AV医疗数据和隐私保护提供了一种可行、高效的混合解决方案。未来的工作将侧重于减少每个阶段的上游错误，并最大限度地减少人在循环中的作用。

{"title":"MedVidDeID: Protecting privacy in clinical encounter video recordings","authors":"Sriharsha Mopidevi , Kuk Jin Jang , Basam Alasaly , Sydney Pugh , Jean Park , Ashley Batugo , Sy Hwang , Eric Eaton , Danielle Lee Mowery , Kevin B. Johnson","doi":"10.1016/j.jbi.2025.104901","DOIUrl":"10.1016/j.jbi.2025.104901","url":null,"abstract":"<div><h3>Objective:</h3><div>The increasing use of audio-video (AV) data in healthcare has improved patient care, clinical training, and medical and ethnographic research. However, it has also introduced major challenges in preserving patient-provider privacy due to Protected Health Information (PHI) in such data. Traditional de-identification methods are inadequate for AV data, which can reveal identifiable information such as faces, voices, and environmental details. Our goal was to create a pipeline for de-identifying AV healthcare data that minimized the human effort required to guarantee successful de-identification.</div></div><div><h3>Methods:</h3><div>We combined open-source tools with novel methods and infrastructure into a six-stage pipeline: (1) transcript extraction using WhisperX, (2) transcript de-identification with an adapted PHIlter, (3) audio de-identification through scrubbing, (4) video de-identification using YOLOv11 for pose detection and blurring, (5) recombining de-identified audio and video, and (6) validation and correction via manual quality control (QC). We developed two de-identification strategies to support different tolerances for lossy video images. We evaluated this pipeline using 10 h of simulated clinical AV recordings, comprising nearly 1.1 million video frames and approximately 72,000 words.</div></div><div><h3>Results:</h3><div>In Precision Privacy Preservation (PPP) mode, MedVidDeId achieved a success rate of 50%, while in Greedy Privacy Preservation (GPP) mode, it achieved a 97.5% success rate. Compared to manual methods for a 15 min video segment, the pipeline reduced de-identification time by 26.7% in PPP and 64.2% in GPP modes.</div></div><div><h3>Conclusion:</h3><div>The MedVidDeID pipeline offers a viable, efficient hybrid solution for handling AV healthcare data and privacy preservation. Future work will focus on reducing upstream errors at each stage and minimizing the role of the human in the loop.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"170 ","pages":"Article 104901"},"PeriodicalIF":4.5,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144955583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering 通过纠错提示工程改进药物不良反应命名实体识别的大型语言模型

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-08-28 DOI: 10.1016/j.jbi.2025.104893

Yunfei Zhang, Wei Liao

The monitoring and analysis of adverse drug reactions (ADRs ) are important for ensuring patient safety and improving treatment outcomes. Accurate identification of drug names, drug components, and ADR entities during named entity recognition (NER) processes is essential for ensuring drug safety and advancing the integration of drug information. Given that existing medical name entity recognition technologies rely on large amounts of manually annotated data for training, they are often less effective when applied to adverse drug reactions due to significant data variability and the high similarity between drug names. This paper proposes a prompt template for ADR that integrates error correction examples. The prompt template includes: 1. Basic prompts with task descriptions, 2. Annotated entity explanations, 3. Annotation guidelines, 4. Annotated samples for few-shot learning, 5. Error correction examples. Additionally, it integrates complex ADR data from the web and constructs a corpus containing three types of entities (drug name, drug components, and adverse drug reactions) using the Begin, Inside, Outside (BIO) annotation method. Finally, we evaluate the effectiveness of each prompt and compare it with the fine-tuned Large Language Model Meta AI (LLaMA) model and the DeepSeek model. Experimental results show that under this prompt template, the F1 score of GPT-3.5 increased from 0.648 to 0.887, and that of GPT-4 increased from 0.757 to 0.921. It is significantly better than the fine-tuned LLaMA model and DeepSeek model. It demonstrates the superiority of the proposed method, and provides a solid foundation for extracting drug-related entity relationships and building knowledge graphs.

药物不良反应（adr）的监测和分析对于确保患者安全和改善治疗效果非常重要。在命名实体识别（NER）过程中，准确识别药品名称、药物成分和ADR实体对于确保药品安全和推进药品信息整合至关重要。鉴于现有的医学名称实体识别技术依赖于大量人工标注的数据进行训练，由于数据的显著可变性和药品名称之间的高度相似性，它们在应用于药物不良反应时往往效果较差。本文提出了一个集成纠错实例的ADR提示模板。提示符模板包括：1。2、基本提示和任务描述。注释实体解释；注释指南；5.少射学习的标注样本；纠错示例。此外，它集成了来自网络的复杂ADR数据，并使用Begin， Inside, Outside （BIO）注释方法构建了一个包含三种类型实体（药物名称，药物成分和药物不良反应）的语料库。最后，我们评估了每个提示的有效性，并将其与经过微调的大型语言模型元AI （LLaMA）模型和DeepSeek模型进行了比较。实验结果表明，在该提示模板下，GPT-3.5的F1评分从0.648提高到0.887，GPT-4的F1评分从0.757提高到0.921。它明显优于经过微调的LLaMA模型和DeepSeek模型。验证了该方法的优越性，为药物相关实体关系的提取和知识图谱的构建提供了坚实的基础。

{"title":"Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering","authors":"Yunfei Zhang, Wei Liao","doi":"10.1016/j.jbi.2025.104893","DOIUrl":"10.1016/j.jbi.2025.104893","url":null,"abstract":"<div><div>The monitoring and analysis of adverse drug reactions (ADRs ) are important for ensuring patient safety and improving treatment outcomes. Accurate identification of drug names, drug components, and ADR entities during named entity recognition (NER) processes is essential for ensuring drug safety and advancing the integration of drug information. Given that existing medical name entity recognition technologies rely on large amounts of manually annotated data for training, they are often less effective when applied to adverse drug reactions due to significant data variability and the high similarity between drug names. This paper proposes a prompt template for ADR that integrates error correction examples. The prompt template includes: 1. Basic prompts with task descriptions, 2. Annotated entity explanations, 3. Annotation guidelines, 4. Annotated samples for few-shot learning, 5. Error correction examples. Additionally, it integrates complex ADR data from the web and constructs a corpus containing three types of entities (drug name, drug components, and adverse drug reactions) using the Begin, Inside, Outside (BIO) annotation method. Finally, we evaluate the effectiveness of each prompt and compare it with the fine-tuned Large Language Model Meta AI (LLaMA) model and the DeepSeek model. Experimental results show that under this prompt template, the F1 score of GPT-3.5 increased from 0.648 to 0.887, and that of GPT-4 increased from 0.757 to 0.921. It is significantly better than the fine-tuned LLaMA model and DeepSeek model. It demonstrates the superiority of the proposed method, and provides a solid foundation for extracting drug-related entity relationships and building knowledge graphs.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"170 ","pages":"Article 104893"},"PeriodicalIF":4.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review 用于健康预测的机器学习中检测和减轻数据集转移的策略：系统综述

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics

Pub Date : 2025-08-26 DOI: 10.1016/j.jbi.2025.104902

Gabriel Ferreira dos Santos Silva , Fabiano Novaes Barcellos Filho , Roberta Moreira Wichmann , Francisco Costa da Silva Junior , Alexandre Dias Porto Chiavegatto Filho

Objective

This review aims to provide a comprehensive overview of the literature on methods and techniques for identifying and correcting dataset shift in machine learning (ML) applications for health predictions.

Methods

A systematic search was conducted across PubMed, IEEE Xplore, Scopus, and Web of Science, targeting articles published between January 1, 2019, and March 15, 2025. earch strings combined terms related to machine learning, healthcare, and dataset shift. A total of 32 studies were included, and were evaluated based on dataset shift types addressed, detection and correction strategies used, algorithmic choices, and reported impacts on model performance.

Results

The review identified a wide range of dataset shift types, with temporal shift and concept drift being the most commonly addressed. Model-based monitoring and statistical tests were the most frequent detection strategies, while retraining and feature engineering were the predominant correction approaches. Most methods demonstrate moderate interpretability, computational feasibility, and generalizability. However, a lack of standardized performance metrics and external validations limited the comparability of results across studies.

Conclusion

While several promising approaches for managing dataset shift in health-related ML models have been proposed, no single method emerged as broadly generalizable across use cases. The implementation of these techniques in real-world clinical workflows remains limited. Future research should prioritize prospective evaluations, subgroup-specific analyses (e.g., by race, age, or geographic region), and integration into clinical decision-support systems to ensure robust and equitable ML deployment in healthcare settings. A structured summary table and conceptual pipeline diagram are provided to support practical adoption.

目的：本综述旨在对用于健康预测的机器学习（ML）应用中识别和纠正数据集移位的方法和技术的文献进行全面概述。方法系统检索PubMed、IEEE explore、Scopus和Web of Science，检索2019年1月1日至2025年3月15日之间发表的论文。搜索与机器学习、医疗保健和数据集转移相关的组合术语的字符串。共纳入了32项研究，并根据所处理的数据集移位类型、使用的检测和校正策略、算法选择以及对模型性能的影响进行了评估。结果该综述确定了广泛的数据集移位类型，其中时间移位和概念漂移是最常见的。基于模型的监测和统计测试是最常见的检测策略，而再训练和特征工程是主要的校正方法。大多数方法表现出适度的可解释性、计算可行性和通用性。然而，缺乏标准化的性能指标和外部验证限制了研究结果的可比性。虽然已经提出了几种有前途的方法来管理与健康相关的机器学习模型中的数据集转移，但没有一种方法可以在用例中广泛推广。这些技术在实际临床工作流程中的应用仍然有限。未来的研究应优先考虑前瞻性评估，亚组特定分析（例如，按种族，年龄或地理区域），并整合到临床决策支持系统中，以确保在医疗保健环境中稳健和公平的机器学习部署。提供了一个结构化的汇总表和概念管道图，以支持实际采用。

{"title":"Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review","authors":"Gabriel Ferreira dos Santos Silva , Fabiano Novaes Barcellos Filho , Roberta Moreira Wichmann , Francisco Costa da Silva Junior , Alexandre Dias Porto Chiavegatto Filho","doi":"10.1016/j.jbi.2025.104902","DOIUrl":"10.1016/j.jbi.2025.104902","url":null,"abstract":"<div><h3>Objective</h3><div>This review aims to provide a comprehensive overview of the literature on methods and techniques for identifying and correcting dataset shift in machine learning (ML) applications for health predictions.</div></div><div><h3>Methods</h3><div>A systematic search was conducted across PubMed, IEEE Xplore, Scopus, and Web of Science, targeting articles published between January 1, 2019, and March 15, 2025. earch strings combined terms related to machine learning, healthcare, and dataset shift. A total of 32 studies were included, and were evaluated based on dataset shift types addressed, detection and correction strategies used, algorithmic choices, and reported impacts on model performance.</div></div><div><h3>Results</h3><div>The review identified a wide range of dataset shift types, with temporal shift and concept drift being the most commonly addressed. Model-based monitoring and statistical tests were the most frequent detection strategies, while retraining and feature engineering were the predominant correction approaches. Most methods demonstrate moderate interpretability, computational feasibility, and generalizability. However, a lack of standardized performance metrics and external validations limited the comparability of results across studies.</div></div><div><h3>Conclusion</h3><div>While several promising approaches for managing dataset shift in health-related ML models have been proposed, no single method emerged as broadly generalizable across use cases. The implementation of these techniques in real-world clinical workflows remains limited. Future research should prioritize prospective evaluations, subgroup-specific analyses (e.g., by race, age, or geographic region), and integration into clinical decision-support systems to ensure robust and equitable ML deployment in healthcare settings. A structured summary table and conceptual pipeline diagram are provided to support practical adoption.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"170 ","pages":"Article 104902"},"PeriodicalIF":4.5,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144926250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0