Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.

IF 4 2区医学 Q1 PHARMACOLOGY & PHARMACY Drug Safety Pub Date : 2025-03-01 Epub Date: 2024-12-11 DOI:10.1007/s40264-024-01499-1

Jürgen Dietrich, André Hollstein

{"title":"Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.","authors":"Jürgen Dietrich, André Hollstein","doi":"10.1007/s40264-024-01499-1","DOIUrl":null,"url":null,"abstract":"Introduction: Recent artificial intelligence (AI) advances can generate human-like responses to a wide range of queries, making them a useful tool for healthcare applications. Therefore, the potential use of large language models (LLMs) in controlled environments regarding efficacy, reproducibility, and operability will be of paramount interest.Objective: We investigated if and how GPT 3.5 and GPT 4 models can be directly used as a part of a GxP validated system and compared the performance of externally hosted GPT 3.5 and GPT 4 against LLMs, which can be hosted internally. We explored zero-shot LLM performance for named entity recognition (NER) and relation extraction tasks, investigated which LLM has the best zero-shot performance to be used potentially for generating training data proposals, evaluated the LLM performance of seven entities for medical NER in zero-shot experiments, selected one model for further performance improvement (few-shot and fine-tuning: Zephyr-7b-beta), and investigated how smaller open-source LLMs perform in contrast to GPT models and to a small fine-tuned T5 Base.Methods: We performed reproducibility experiments to evaluate if LLMs can be used in controlled environments and utilized guided generation to use the same prompt across multiple models. Few-shot learning and quantized low rank adapter (QLoRA) fine-tuning were applied to further improve LLM performance.Results and conclusion: We demonstrated that zero-shot GPT 4 performance is comparable with a fine-tuned T5, and Zephyr performed better than zero-shot GPT 3.5, but the recognition of product combinations such as product event combination was significantly better by using a fine-tuned T5. Although Open AI launched recently GPT versions to improve the generation of consistent output, both GPT variants failed to demonstrate reproducible results. The lack of reproducibility together with limitations of external hosted systems to keep validated systems in a state of control may affect the use of closed and proprietary models in regulated environments. However, due to the good NER performance, we recommend using GPT for creating annotation proposals for training data as a basis for fine-tuning.","PeriodicalId":11382,"journal":{"name":"Drug Safety","volume":" ","pages":"287-303"},"PeriodicalIF":4.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829833/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40264-024-01499-1","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Recent artificial intelligence (AI) advances can generate human-like responses to a wide range of queries, making them a useful tool for healthcare applications. Therefore, the potential use of large language models (LLMs) in controlled environments regarding efficacy, reproducibility, and operability will be of paramount interest.

Objective: We investigated if and how GPT 3.5 and GPT 4 models can be directly used as a part of a GxP validated system and compared the performance of externally hosted GPT 3.5 and GPT 4 against LLMs, which can be hosted internally. We explored zero-shot LLM performance for named entity recognition (NER) and relation extraction tasks, investigated which LLM has the best zero-shot performance to be used potentially for generating training data proposals, evaluated the LLM performance of seven entities for medical NER in zero-shot experiments, selected one model for further performance improvement (few-shot and fine-tuning: Zephyr-7b-beta), and investigated how smaller open-source LLMs perform in contrast to GPT models and to a small fine-tuned T5 Base.

Methods: We performed reproducibility experiments to evaluate if LLMs can be used in controlled environments and utilized guided generation to use the same prompt across multiple models. Few-shot learning and quantized low rank adapter (QLoRA) fine-tuning were applied to further improve LLM performance.

Results and conclusion: We demonstrated that zero-shot GPT 4 performance is comparable with a fine-tuned T5, and Zephyr performed better than zero-shot GPT 3.5, but the recognition of product combinations such as product event combination was significantly better by using a fine-tuned T5. Although Open AI launched recently GPT versions to improve the generation of consistent output, both GPT variants failed to demonstrate reproducible results. The lack of reproducibility together with limitations of external hosted systems to keep validated systems in a state of control may affect the use of closed and proprietary models in regulated environments. However, due to the good NER performance, we recommend using GPT for creating annotation proposals for training data as a basis for fine-tuning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

命名实体识别中大型语言模型的性能和再现性：在受控环境中使用的考虑。

简介：最近的人工智能（AI）进步可以对广泛的查询生成类似人类的响应，使其成为医疗保健应用程序的有用工具。因此，在受控环境中使用大型语言模型（llm）的潜在功效、再现性和可操作性将是最重要的。目的：我们研究了GPT 3.5和GPT 4模型是否以及如何直接用作GxP验证系统的一部分，并将外部托管的GPT 3.5和GPT 4与内部托管的llm的性能进行了比较。我们探索了命名实体识别（NER）和关系提取任务的零射击LLM性能，研究了哪些LLM具有最佳的零射击性能，可以用于生成训练数据建议，在零射击实验中评估了医学NER中七个实体的LLM性能，选择了一个模型进行进一步的性能改进（少射击和微调）：Zephyr-7b-beta)，并研究了较小的开源llm与GPT模型和小型微调T5 Base相比的性能。方法：我们进行了重复性实验来评估llm是否可以在受控环境中使用，并利用引导生成在多个模型中使用相同的提示符。采用少镜头学习和量化低秩适配器（QLoRA）微调来进一步提高LLM的性能。结果与结论：我们证明零射击GPT 4的性能与微调后的T5相当，Zephyr的性能优于零射击GPT 3.5，但使用微调后的T5对产品组合（如产品事件组合）的识别效果明显更好。尽管Open AI最近推出了GPT版本，以改进一致输出的生成，但两种GPT变体都未能证明可重复的结果。缺乏可再现性，加上外部托管系统将验证系统保持在受控状态的限制，可能会影响在受监管环境中使用封闭和专有模型。然而，由于良好的NER性能，我们建议使用GPT为训练数据创建注释建议，作为微调的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Drug Safety 医学-毒理学

CiteScore

7.60

自引率

7.10%

发文量

112

审稿时长

6-12 weeks

期刊介绍： Drug Safety is the official journal of the International Society of Pharmacovigilance. The journal includes: Overviews of contentious or emerging issues. Comprehensive narrative reviews that provide an authoritative source of information on epidemiology, clinical features, prevention and management of adverse effects of individual drugs and drug classes. In-depth benefit-risk assessment of adverse effect and efficacy data for a drug in a defined therapeutic area. Systematic reviews (with or without meta-analyses) that collate empirical evidence to answer a specific research question, using explicit, systematic methods as outlined by the PRISMA statement. Original research articles reporting the results of well-designed studies in disciplines such as pharmacoepidemiology, pharmacovigilance, pharmacology and toxicology, and pharmacogenomics. Editorials and commentaries on topical issues. Additional digital features (including animated abstracts, video abstracts, slide decks, audio slides, instructional videos, infographics, podcasts and animations) can be published with articles; these are designed to increase the visibility, readership and educational value of the journal’s content. In addition, articles published in Drug Safety Drugs may be accompanied by plain language summaries to assist readers who have some knowledge of, but not in-depth expertise in, the area to understand important medical advances.