A. Coden, D. Gruhl, Neal Lewis, M. Tanenblatt, J. Terdiman
{"title":"SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora","authors":"A. Coden, D. Gruhl, Neal Lewis, M. Tanenblatt, J. Terdiman","doi":"10.1109/HISB.2012.16","DOIUrl":null,"url":null,"abstract":"Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.","PeriodicalId":375089,"journal":{"name":"2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HISB.2012.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 43
Abstract
Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.