从高度非结构化的电子健康记录中提取药物数据的人在循环中的语言不可知

2020 International Conference on Data Mining Workshops (ICDMW) Pub Date : 2020-11-01 DOI:10.1109/ICDMW51313.2020.00091

Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen

{"title":"从高度非结构化的电子健康记录中提取药物数据的人在循环中的语言不可知","authors":"Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen","doi":"10.1109/ICDMW51313.2020.00091","DOIUrl":null,"url":null,"abstract":"Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.","PeriodicalId":426846,"journal":{"name":"2020 International Conference on Data Mining Workshops (ICDMW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records\",\"authors\":\"Frank Ruis, Shreyasi Pathak, Jeroen Geerdink, J. H. Hegeman, C. Seifert, M. V. Keulen\",\"doi\":\"10.1109/ICDMW51313.2020.00091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.\",\"PeriodicalId\":426846,\"journal\":{\"name\":\"2020 International Conference on Data Mining Workshops (ICDMW)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Data Mining Workshops (ICDMW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDMW51313.2020.00091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW51313.2020.00091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

电子健康记录包含以自由格式文本书写的重要信息。它们通常是非结构化的、不符合语法的，并且包含拼写错误和缩写，这使得应用传统的自然语言处理技术变得困难。由于访问受限，标注数据很难获得，并且监督模型通常不能很好地泛化到其他数据集。我们提出了一种与语言无关的人在循环方法，用于从大量高度非结构化的电子健康记录中提取药物名称，在第二次迭代之后，我们在测试集中达到了近97%的召回率，同时保持了100%的准确率。从一个引导词典开始，我们执行一个基于上下文的词典扩展，由一个人类审阅者策划。该方法可以处理歧义的词汇条目，并有效地找到模糊匹配而不产生误报。人工审查步骤确保了高精度，这在医疗保健中尤其重要，并且不会受到与外部来源注释不一致的影响。该代码可在11https://github.com/FrankRuis/medical_concept_extraction上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Human-in-the-loop Language-agnostic Extraction of Medication Data from Highly Unstructured Electronic Health Records

Electronic health records contain important information written in free-form text. They are often highly unstructured and ungrammatical and contain misspellings and abbreviations, making it difficult to apply traditional natural language processing techniques. Annotated data is hard to come by due to restricted access, and supervised models often don't generalize well to other datasets. We propose a language-agnostic human-in-the-loop approach for extracting medication names from a large set of highly unstructured electronic health records, where we reach almost 97% recall on our test set after the second iteration while maintaining 100% precision. Starting with a bootstrap lexicon we perform a context based dictionary expansion curated by a human reviewer. The method can handle ambiguous lexicon entries and efficiently find fuzzy matches without producing false positives. The human review step ensures a high precision, which is especially important in healthcare, and is not subject to disagreements with annotations from an external source. The code is available online 11https://github.com/FrankRuis/medical_concept_extraction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量

期刊最新文献

Synthetic Data by Principal Component Analysis Deep Contextualized Word Embedding for Text-based Online User Profiling to Detect Social Bots on Twitter Integration of Fuzzy and Deep Learning in Three-Way Decisions Mining Heterogeneous Data for Formulation Design Restructuring of Hoeffding Trees for Trapezoidal Data Streams