Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.

IF 1.3 4区 医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Methods of Information in Medicine Pub Date : 2020-12-01 Epub Date: 2020-10-14 DOI:10.1055/s-0040-1716403
Antje Wulff, Marcel Mast, Marcus Hassler, Sara Montag, Michael Marschollek, Thomas Jack
{"title":"Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.","authors":"Antje Wulff,&nbsp;Marcel Mast,&nbsp;Marcus Hassler,&nbsp;Sara Montag,&nbsp;Michael Marschollek,&nbsp;Thomas Jack","doi":"10.1055/s-0040-1716403","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.</p><p><strong>Objectives: </strong>The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.</p><p><strong>Methods: </strong>We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.</p><p><strong>Results: </strong>We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.</p><p><strong>Conclusion: </strong>The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.</p>","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":"59 S 02","pages":"e64-e78"},"PeriodicalIF":1.3000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1055/s-0040-1716403","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/s-0040-1716403","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/14 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 9

Abstract

Background: Merging disparate and heterogeneous datasets from clinical routine in a standardized and semantically enriched format to enable a multiple use of data also means incorporating unstructured data such as medical free texts. Although the extraction of structured data from texts, known as natural language processing (NLP), has been researched at least for the English language extensively, it is not enough to get a structured output in any format. NLP techniques need to be used together with clinical information standards such as openEHR to be able to reuse and exchange still unstructured data sensibly.

Objectives: The aim of the study is to automatically extract crucial information from medical free texts and to transform this unstructured clinical data into a standardized and structured representation by designing and implementing an exemplary pipeline for the processing of pediatric medical histories.

Methods: We constructed a pipeline that allows reusing medical free texts such as pediatric medical histories in a structured and standardized way by (1) selecting and modeling appropriate openEHR archetypes as standard clinical information models, (2) defining a German dictionary with crucial text markers serving as expert knowledge base for a NLP pipeline, and (3) creating mapping rules between the NLP output and the archetypes. The approach was evaluated in a first pilot study by using 50 manually annotated medical histories from the pediatric intensive care unit of the Hannover Medical School.

Results: We successfully reused 24 existing international archetypes to represent the most crucial elements of unstructured pediatric medical histories in a standardized form. The self-developed NLP pipeline was constructed by defining 3.055 text marker entries, 132 text events, 66 regular expressions, and a text corpus consisting of 776 entries for automatic correction of spelling mistakes. A total of 123 mapping rules were implemented to transform the extracted snippets to an openEHR-based representation to be able to store them together with other structured data in an existing openEHR-based data repository. In the first evaluation, the NLP pipeline yielded 97% precision and 94% recall.

Conclusion: The use of NLP and openEHR archetypes was demonstrated as a viable approach for extracting and representing important information from pediatric medical histories in a structured and semantically enriched format. We designed a promising approach with potential to be generalized, and implemented a prototype that is extensible and reusable for other use cases concerning German medical free texts. In a long term, this will harness unstructured clinical data for further research purposes such as the design of clinical decision support systems. Together with structured data already integrated in openEHR-based representations, we aim at developing an interoperable openEHR-based application that is capable of automatically assessing a patient's risk status based on the patient's medical history at time of admission.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用自然语言处理设计基于开放式ehr的非结构化临床数据提取和标准化管道。
背景:以标准化和语义丰富的格式合并来自临床常规的不同和异构数据集,以实现数据的多种使用,还意味着合并非结构化数据,如医疗免费文本。尽管从文本中提取结构化数据,即自然语言处理(NLP),至少已经针对英语进行了广泛的研究,但获得任何格式的结构化输出是不够的。NLP技术需要与临床信息标准(如openEHR)一起使用,以便能够合理地重用和交换非结构化数据。目的:本研究的目的是通过设计和实现一个用于儿科病史处理的示范性管道,从医学免费文本中自动提取关键信息,并将这种非结构化临床数据转换为标准化和结构化的表示。方法:我们构建了一个管道,允许以结构化和标准化的方式重用医学免费文本,如儿科病史,通过(1)选择和建模适当的openEHR原型作为标准临床信息模型,(2)定义一个德语词典,其中包含关键文本标记,作为NLP管道的专家知识库,(3)创建NLP输出和原型之间的映射规则。该方法在第一项试点研究中进行了评估,该研究使用了汉诺威医学院儿科重症监护室的50例手工注释的病史。结果:我们成功地重用了24个现有的国际原型,以标准化的形式代表了非结构化儿科病史的最关键元素。通过定义3.055个文本标记条目、132个文本事件、66个正则表达式和776个条目的文本语料库,构建了自主开发的NLP管道,用于自动纠正拼写错误。总共实现了123个映射规则,将提取的片段转换为基于openehr的表示,以便能够将它们与其他结构化数据一起存储在现有的基于openehr的数据存储库中。在第一次评估中,NLP管道产生了97%的准确率和94%的召回率。结论:使用NLP和openEHR原型被证明是一种可行的方法,可以以结构化和语义丰富的格式从儿科病史中提取和表示重要信息。我们设计了一种具有推广潜力的有前途的方法,并实现了一个可扩展和可重用的原型,用于有关德语医学免费文本的其他用例。从长远来看,这将利用非结构化的临床数据进行进一步的研究,如临床决策支持系统的设计。与已经集成在基于openehr表示中的结构化数据一起,我们的目标是开发一个可互操作的基于openehr的应用程序,该应用程序能够根据患者入院时的病史自动评估患者的风险状态。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Methods of Information in Medicine
Methods of Information in Medicine 医学-计算机:信息系统
CiteScore
3.70
自引率
11.80%
发文量
33
审稿时长
6-12 weeks
期刊介绍: Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.
期刊最新文献
Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop. Artificial Intelligence-Based Prediction of Contrast Medium Doses for Computed Tomography Angiography Using Optimized Clinical Parameter Sets. Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse. Deep Learning for Predicting Progression of Patellofemoral Osteoarthritis Based on Lateral Knee Radiographs, Demographic Data, and Symptomatic Assessments. Europe's Largest Research Infrastructure for Curated Medical Data Models with Semantic Annotations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1