Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.

IF 1.8 4区医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Methods of Information in Medicine Pub Date : 2020-12-01 Epub Date: 2020-10-14 DOI:10.1055/s-0040-1716403

Antje Wulff, Marcel Mast, Marcus Hassler, Sara Montag, Michael Marschollek, Thomas Jack

{"title":"Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.","authors":"Antje Wulff, Marcel Mast, Marcus Hassler, Sara Montag, Michael Marschollek, Thomas Jack","doi":"10.1055/s-0040-1716403","DOIUrl":null,"url":null,"abstract":"Background: Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.Objectives: The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.Methods: We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.Results: We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.Conclusion: The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"59 S 02","pages":"e64-e78"},"PeriodicalIF":1.8000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1055/s-0040-1716403","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0040-1716403","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/14 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 9

Abstract

Background: Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives: The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods: We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results: We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion: The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用自然语言处理设计基于开放式ehr的非结构化临床数据提取和标准化管道。

背景:以标准化和语义丰富的格式合并来自临床常规的不同和异构数据集，以实现数据的多种使用，还意味着合并非结构化数据，如医疗免费文本。尽管从文本中提取结构化数据，即自然语言处理(NLP)，至少已经针对英语进行了广泛的研究，但获得任何格式的结构化输出是不够的。NLP技术需要与临床信息标准(如openEHR)一起使用，以便能够合理地重用和交换非结构化数据。目的:本研究的目的是通过设计和实现一个用于儿科病史处理的示范性管道，从医学免费文本中自动提取关键信息，并将这种非结构化临床数据转换为标准化和结构化的表示。方法:我们构建了一个管道，允许以结构化和标准化的方式重用医学免费文本，如儿科病史，通过(1)选择和建模适当的openEHR原型作为标准临床信息模型，(2)定义一个德语词典，其中包含关键文本标记，作为NLP管道的专家知识库，(3)创建NLP输出和原型之间的映射规则。该方法在第一项试点研究中进行了评估，该研究使用了汉诺威医学院儿科重症监护室的50例手工注释的病史。结果:我们成功地重用了24个现有的国际原型，以标准化的形式代表了非结构化儿科病史的最关键元素。通过定义3.055个文本标记条目、132个文本事件、66个正则表达式和776个条目的文本语料库，构建了自主开发的NLP管道，用于自动纠正拼写错误。总共实现了123个映射规则，将提取的片段转换为基于openehr的表示，以便能够将它们与其他结构化数据一起存储在现有的基于openehr的数据存储库中。在第一次评估中，NLP管道产生了97%的准确率和94%的召回率。结论:使用NLP和openEHR原型被证明是一种可行的方法，可以以结构化和语义丰富的格式从儿科病史中提取和表示重要信息。我们设计了一种具有推广潜力的有前途的方法，并实现了一个可扩展和可重用的原型，用于有关德语医学免费文本的其他用例。从长远来看，这将利用非结构化的临床数据进行进一步的研究，如临床决策支持系统的设计。与已经集成在基于openehr表示中的结构化数据一起，我们的目标是开发一个可互操作的基于openehr的应用程序，该应用程序能够根据患者入院时的病史自动评估患者的风险状态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Methods of Information in Medicine 医学-计算机：信息系统

CiteScore

3.70

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.