SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora

2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology Pub Date : 2012-09-27 DOI:10.1109/HISB.2012.16

A. Coden, D. Gruhl, Neal Lewis, M. Tanenblatt, J. Terdiman

{"title":"SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora","authors":"A. Coden, D. Gruhl, Neal Lewis, M. Tanenblatt, J. Terdiman","doi":"10.1109/HISB.2012.16","DOIUrl":null,"url":null,"abstract":"Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.","PeriodicalId":375089,"journal":{"name":"2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HISB.2012.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

找出药物!从超大临床语料库中提取药品名称的无监督模式匹配方法

尽管结构化电子健康记录正变得越来越普遍，但关于患者健康的许多信息仍然仅以非结构化文本记录。多年来，“理解”这些文本一直是自然语言处理(NLP)研究的焦点，取得了一些显著的成功，但仍有更多的工作要做。了解患者服用的药物不仅对了解患者的健康状况至关重要(例如，药物-药物相互作用或药物-酶相互作用)，而且对次要用途也至关重要，例如研究治疗效果。一些药物词典已经策划，如RxNorm, FDA的橙皮书，或NCI，重点是处方药。开发这些词典是一项挑战，但更具有挑战性的是，面对一个快速发展的领域，使这些词典保持最新——对于服用处方药立普妥的病人来说，确定葡萄柚是一种“药物”至关重要，因为它们已知的不良相互作用。为了发现其他新的药物不良相互作用，通常需要检查大量的患者病史，这不仅需要准确而且需要快速的算法来识别药理学物质。在本文中，我们提出了一种新的算法SPOT，它可以从大型语料库中识别可用作新词典条目的药物名称，其中“药物”被定义为用于诊断、治愈、缓解、治疗或预防疾病的物质。根据手动标注的参考语料库进行测量，我们给出了SPOT的精度和召回值。SPOT是独立于语言和语法的，可以有效地运行，以保持字典的最新，也可以建议单词和短语，可能是拼写错误或未编目的同义词的已知药物。我们展示了SPOT缺乏对NLP工具的依赖如何使其在分析临床医学文本方面变得强大。SPOT是一种广义的自举算法，以已知字典为种子，自动提取提及每种药物的上下文。我们定义了这种背景的三个特征:支持、信心和流行。最后，我们根据为这些特性选择的阈值给出了性能权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology

自引率

0.00%

发文量

期刊最新文献

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses Aggregated Indexing of Biomedical Time Series Data Temporal Analysis of Physicians' EHR Workflow during Outpatient Visits Does Domain Knowledge Matter for Assertion Annotation in Clinical Texts? A Randomized Response Model for Privacy-Preserving Data Dissemination