{"title":"Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria.","authors":"Zenan Sun, Cui Tao","doi":"10.1109/ichi57859.2023.00100","DOIUrl":null,"url":null,"abstract":"<p><p>Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.</p>","PeriodicalId":73284,"journal":{"name":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","volume":"2023 ","pages":"558-564"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10815931/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ichi57859.2023.00100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.
阿尔茨海默病(AD)是一种复杂的神经退行性疾病,影响着全球数百万人。找到治疗这种疾病的有效方法至关重要。临床试验在开发和测试阿尔茨海默病的新疗法方面发挥着至关重要的作用。然而,确定符合条件的参与者是一项具有挑战性的工作,既费时又费钱。近年来,自然语言处理(NLP)技术的发展,特别是命名实体识别(NER)和命名实体规范化(NEN)技术的发展,有助于更高效地自动识别和提取资格标准(EC)中的相关信息,从而促进半自动化的患者招募,并实现临床试验数据的公平性。然而,目前大多数生物医学 NER 模型只为有限的实体类型提供注释,而这些实体类型可能并不适用于临床试验数据。此外,对使用否定前缀否定的实体准确执行 NEN 目前还缺乏成熟的技术。在本文中,我们介绍了一个专为从 AD 临床试验 EC 中提取信息而设计的管道,其中包括对 EC 数据进行预处理、临床 NER 以及根据统一医学语言系统(UMLS)进行生物医学 NEN。我们的 NER 模型可以识别七个预定义类别中的命名实体,而我们的 NEN 模型则结合使用了精确匹配和部分匹配搜索策略,以及自定义规则来准确归一化带有负前缀的实体。为了评估我们管道的性能,我们测量了 NER 组件的精确度、召回率和 F1 分数,并手动查看了 NEN 组件生成的前五个映射结果。我们对管道性能的评估结果表明,它能以最佳的准确率成功地对临床试验 EC 中的命名实体进行规范化处理。NER 组件的总体 F1 值为 0.816,表明它有能力准确识别临床文本中的七种命名实体。该管道的 NEN 组件也表现出了令人印象深刻的性能,通过定制规则以及精确匹配和部分匹配策略的组合,规范化实体的准确率达到了 0.940。