{"title":"Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments.","authors":"Jürgen Dietrich, André Hollstein","doi":"10.1007/s40264-024-01499-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Recent artificial intelligence (AI) advances can generate human-like responses to a wide range of queries, making them a useful tool for healthcare applications. Therefore, the potential use of large language models (LLMs) in controlled environments regarding efficacy, reproducibility, and operability will be of paramount interest.</p><p><strong>Objective: </strong>We investigated if and how GPT 3.5 and GPT 4 models can be directly used as a part of a GxP validated system and compared the performance of externally hosted GPT 3.5 and GPT 4 against LLMs, which can be hosted internally. We explored zero-shot LLM performance for named entity recognition (NER) and relation extraction tasks, investigated which LLM has the best zero-shot performance to be used potentially for generating training data proposals, evaluated the LLM performance of seven entities for medical NER in zero-shot experiments, selected one model for further performance improvement (few-shot and fine-tuning: Zephyr-7b-beta), and investigated how smaller open-source LLMs perform in contrast to GPT models and to a small fine-tuned T5 Base.</p><p><strong>Methods: </strong>We performed reproducibility experiments to evaluate if LLMs can be used in controlled environments and utilized guided generation to use the same prompt across multiple models. Few-shot learning and quantized low rank adapter (QLoRA) fine-tuning were applied to further improve LLM performance.</p><p><strong>Results and conclusion: </strong>We demonstrated that zero-shot GPT 4 performance is comparable with a fine-tuned T5, and Zephyr performed better than zero-shot GPT 3.5, but the recognition of product combinations such as product event combination was significantly better by using a fine-tuned T5. Although Open AI launched recently GPT versions to improve the generation of consistent output, both GPT variants failed to demonstrate reproducible results. The lack of reproducibility together with limitations of external hosted systems to keep validated systems in a state of control may affect the use of closed and proprietary models in regulated environments. However, due to the good NER performance, we recommend using GPT for creating annotation proposals for training data as a basis for fine-tuning.</p>","PeriodicalId":11382,"journal":{"name":"Drug Safety","volume":" ","pages":"287-303"},"PeriodicalIF":4.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829833/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40264-024-01499-1","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Recent artificial intelligence (AI) advances can generate human-like responses to a wide range of queries, making them a useful tool for healthcare applications. Therefore, the potential use of large language models (LLMs) in controlled environments regarding efficacy, reproducibility, and operability will be of paramount interest.
Objective: We investigated if and how GPT 3.5 and GPT 4 models can be directly used as a part of a GxP validated system and compared the performance of externally hosted GPT 3.5 and GPT 4 against LLMs, which can be hosted internally. We explored zero-shot LLM performance for named entity recognition (NER) and relation extraction tasks, investigated which LLM has the best zero-shot performance to be used potentially for generating training data proposals, evaluated the LLM performance of seven entities for medical NER in zero-shot experiments, selected one model for further performance improvement (few-shot and fine-tuning: Zephyr-7b-beta), and investigated how smaller open-source LLMs perform in contrast to GPT models and to a small fine-tuned T5 Base.
Methods: We performed reproducibility experiments to evaluate if LLMs can be used in controlled environments and utilized guided generation to use the same prompt across multiple models. Few-shot learning and quantized low rank adapter (QLoRA) fine-tuning were applied to further improve LLM performance.
Results and conclusion: We demonstrated that zero-shot GPT 4 performance is comparable with a fine-tuned T5, and Zephyr performed better than zero-shot GPT 3.5, but the recognition of product combinations such as product event combination was significantly better by using a fine-tuned T5. Although Open AI launched recently GPT versions to improve the generation of consistent output, both GPT variants failed to demonstrate reproducible results. The lack of reproducibility together with limitations of external hosted systems to keep validated systems in a state of control may affect the use of closed and proprietary models in regulated environments. However, due to the good NER performance, we recommend using GPT for creating annotation proposals for training data as a basis for fine-tuning.
期刊介绍:
Drug Safety is the official journal of the International Society of Pharmacovigilance. The journal includes:
Overviews of contentious or emerging issues.
Comprehensive narrative reviews that provide an authoritative source of information on epidemiology, clinical features, prevention and management of adverse effects of individual drugs and drug classes.
In-depth benefit-risk assessment of adverse effect and efficacy data for a drug in a defined therapeutic area.
Systematic reviews (with or without meta-analyses) that collate empirical evidence to answer a specific research question, using explicit, systematic methods as outlined by the PRISMA statement.
Original research articles reporting the results of well-designed studies in disciplines such as pharmacoepidemiology, pharmacovigilance, pharmacology and toxicology, and pharmacogenomics.
Editorials and commentaries on topical issues.
Additional digital features (including animated abstracts, video abstracts, slide decks, audio slides, instructional videos, infographics, podcasts and animations) can be published with articles; these are designed to increase the visibility, readership and educational value of the journal’s content. In addition, articles published in Drug Safety Drugs may be accompanied by plain language summaries to assist readers who have some knowledge of, but not in-depth expertise in, the area to understand important medical advances.