使用自然语言处理的放射学报告自动标记：传统方法和新方法的比较

Health Care Science Pub Date : 2023-04-24 DOI:10.1002/hcs2.40

Seo Yi Chng, Paul J. W. Tern, Matthew R. X. Kan, Lionel T. E. Cheng

{"title":"使用自然语言处理的放射学报告自动标记：传统方法和新方法的比较","authors":"Seo Yi Chng, Paul J. W. Tern, Matthew R. X. Kan, Lionel T. E. Cheng","doi":"10.1002/hcs2.40","DOIUrl":null,"url":null,"abstract":"<p>Automated labelling of radiology reports using natural language processing allows for the labelling of ground truth for large datasets of radiological studies that are required for training of computer vision models. This paper explains the necessary data preprocessing steps, reviews the main methods for automated labelling and compares their performance. There are four main methods of automated labelling, namely: (1) rules-based text-matching algorithms, (2) conventional machine learning models, (3) neural network models and (4) Bidirectional Encoder Representations from Transformers (BERT) models. Rules-based labellers perform a brute force search against manually curated keywords and are able to achieve high F1 scores. However, they require proper handling of negative words. Machine learning models require preprocessing that involves tokenization and vectorization of text into numerical vectors. Multilabel classification approaches are required in labelling radiology reports and conventional models can achieve good performance if they have large enough training sets. Deep learning models make use of connected neural networks, often a long short-term memory network, and are similarly able to achieve good performance if trained on a large data set. BERT is a transformer-based model that utilizes attention. Pretrained BERT models only require fine-tuning with small data sets. In particular, domain-specific BERT models can achieve superior performance compared with the other methods for automated labelling.</p>","PeriodicalId":100601,"journal":{"name":"Health Care Science","volume":"2 2","pages":"120-128"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/hcs2.40","citationCount":"1","resultStr":"{\"title\":\"Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods\",\"authors\":\"Seo Yi Chng, Paul J. W. Tern, Matthew R. X. Kan, Lionel T. E. Cheng\",\"doi\":\"10.1002/hcs2.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Automated labelling of radiology reports using natural language processing allows for the labelling of ground truth for large datasets of radiological studies that are required for training of computer vision models. This paper explains the necessary data preprocessing steps, reviews the main methods for automated labelling and compares their performance. There are four main methods of automated labelling, namely: (1) rules-based text-matching algorithms, (2) conventional machine learning models, (3) neural network models and (4) Bidirectional Encoder Representations from Transformers (BERT) models. Rules-based labellers perform a brute force search against manually curated keywords and are able to achieve high F1 scores. However, they require proper handling of negative words. Machine learning models require preprocessing that involves tokenization and vectorization of text into numerical vectors. Multilabel classification approaches are required in labelling radiology reports and conventional models can achieve good performance if they have large enough training sets. Deep learning models make use of connected neural networks, often a long short-term memory network, and are similarly able to achieve good performance if trained on a large data set. BERT is a transformer-based model that utilizes attention. Pretrained BERT models only require fine-tuning with small data sets. In particular, domain-specific BERT models can achieve superior performance compared with the other methods for automated labelling.</p>\",\"PeriodicalId\":100601,\"journal\":{\"name\":\"Health Care Science\",\"volume\":\"2 2\",\"pages\":\"120-128\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/hcs2.40\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Health Care Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/hcs2.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Care Science","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/hcs2.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

使用自然语言处理对放射学报告进行自动标记，可以标记计算机视觉模型训练所需的放射学研究的大型数据集的基本事实。本文解释了必要的数据预处理步骤，回顾了自动标记的主要方法，并比较了它们的性能。自动标注主要有四种方法，即：（1）基于规则的文本匹配算法，（2）传统的机器学习模型，（3）神经网络模型和（4）来自变换器的双向编码器表示（BERT）模型。基于规则的标注器对手动策划的关键词进行强力搜索，并能够获得高F1分数。然而，它们需要正确处理负面词语。机器学习模型需要预处理，包括将文本标记化和矢量化为数字向量。在标记放射学报告时需要多标签分类方法，如果传统模型具有足够大的训练集，则可以获得良好的性能。深度学习模型利用连接的神经网络，通常是一种长短期记忆网络，如果在大数据集上进行训练，同样能够获得良好的性能。BERT是一种利用注意力的基于变换器的模型。预训练的BERT模型只需要使用小数据集进行微调。特别是，与其他自动标记方法相比，特定领域的BERT模型可以实现卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods

Automated labelling of radiology reports using natural language processing allows for the labelling of ground truth for large datasets of radiological studies that are required for training of computer vision models. This paper explains the necessary data preprocessing steps, reviews the main methods for automated labelling and compares their performance. There are four main methods of automated labelling, namely: (1) rules-based text-matching algorithms, (2) conventional machine learning models, (3) neural network models and (4) Bidirectional Encoder Representations from Transformers (BERT) models. Rules-based labellers perform a brute force search against manually curated keywords and are able to achieve high F1 scores. However, they require proper handling of negative words. Machine learning models require preprocessing that involves tokenization and vectorization of text into numerical vectors. Multilabel classification approaches are required in labelling radiology reports and conventional models can achieve good performance if they have large enough training sets. Deep learning models make use of connected neural networks, often a long short-term memory network, and are similarly able to achieve good performance if trained on a large data set. BERT is a transformer-based model that utilizes attention. Pretrained BERT models only require fine-tuning with small data sets. In particular, domain-specific BERT models can achieve superior performance compared with the other methods for automated labelling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Health Care Science

CiteScore

0.90

自引率

0.00%

发文量