微调命名实体识别任务大型语言模型的样本量考虑因素：方法论研究

IF 2 JMIR AI Pub Date : 2024-05-16 DOI:10.2196/52095

Zoltan P. Majdik, S. S. Graham, Jade C Shiva Edward, Sabrina N Rodriguez, M. S. Karnes, Jared T Jensen, Joshua B Barbour, Justin F. Rousseau

{"title":"微调命名实体识别任务大型语言模型的样本量考虑因素：方法论研究","authors":"Zoltan P. Majdik, S. S. Graham, Jade C Shiva Edward, Sabrina N Rodriguez, M. S. Karnes, Jared T Jensen, Joshua B Barbour, Justin F. Rousseau","doi":"10.2196/52095","DOIUrl":null,"url":null,"abstract":"\n \n Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.\n \n \n \n This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.\n \n \n \n A random sample of 200 disclosure statements was prepared for annotation. All “PERSON” and “ORG” entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.\n \n \n \n Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.\n \n \n \n Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture’s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.\n","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"48 20","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study\",\"authors\":\"Zoltan P. Majdik, S. S. Graham, Jade C Shiva Edward, Sabrina N Rodriguez, M. S. Karnes, Jared T Jensen, Joshua B Barbour, Justin F. Rousseau\",\"doi\":\"10.2196/52095\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n \\n Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking.\\n \\n \\n \\n This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements.\\n \\n \\n \\n A random sample of 200 disclosure statements was prepared for annotation. All “PERSON” and “ORG” entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density.\\n \\n \\n \\n Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38.\\n \\n \\n \\n Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture’s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.\\n\",\"PeriodicalId\":73551,\"journal\":{\"name\":\"JMIR AI\",\"volume\":\"48 20\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2024-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR AI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/52095\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/52095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLMs）有可能支持健康信息学中前景广阔的新应用。然而，关于微调 LLM 以执行生物医学和卫生政策背景下的特定任务所需的样本大小考虑因素的实用数据却很缺乏。本研究旨在评估样本大小和样本选择技术，以微调 LLM，支持改进利益冲突披露声明自定义数据集的命名实体识别（NER）。研究人员随机抽取了 200 份披露声明进行标注。所有 "PERSON "和 "ORG "实体均由两名标注者分别识别，在达成适当一致后，标注者又独立标注了另外 290 份披露声明。从 490 份注释过的文件中，按不同大小范围抽取了 2500 个分层随机样本。这 2500 个训练集子样本用于微调两个模型架构（转换器双向编码器表示法 [BERT] 和生成预训练转换器 [GPT]）中的语言模型，以提高 NER，并使用多元回归评估样本大小（句子）、实体密度（每句实体 [EPS]）和训练模型性能（F1-分数）之间的关系。此外，还使用了单预测因子阈值回归模型来评估样本量或实体密度的增加是否会导致边际收益递减。微调后的模型在不同架构下的最高 NER 性能从 F1-score=0.79 到 F1-score=0.96不等。双预测多元线性回归模型具有显著的统计意义，多重 R2 从 0.6057 到 0.7896 不等（所有 P<.001）。在所有情况下，EPS 和句子数量都是 F1 分数的重要预测因素（P<.001），但 GPT-2_large 模型除外，在该模型中 EPS 不是重要的预测因素（P=.184）。模型阈值表明，以句子数量衡量的训练数据集样本量的增加会导致边际收益递减，点估计值从 RoBERTa_large 的 439 个句子到 GPT-2_large 的 527 个句子不等。同样，阈值回归模型表明 EPS 的边际收益递减，点估计值在 1.36 和 1.38 之间。对于应用于生物医学文本的 NER 任务，可以使用相对适中的样本量来微调 LLM，而训练数据的实体密度应与生产数据中的实体密度近似。训练数据质量和模型架构的预期用途（文本生成与文本处理或分类）可能与训练数据量和模型参数大小同样重要，甚至更为重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study

Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. A random sample of 200 disclosure statements was prepared for annotation. All “PERSON” and “ORG” entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture’s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR AI

自引率

0.00%

发文量