The drug-gene interaction database (DGIdb) is a resource that aggregates interaction data from over 40 different resources into one platform with the primary goal of making the druggable genome accessible to clinicians and researchers. By providing a public, computationally accessible database, the DGIdb enables therapeutic insights through broad aggregation of DGI data.
As part of our aggregation process, DGIdb preserves data regarding interaction types, directionality, and other attributes that enable filtering or biochemical insight. However, source data are often incomplete and may not contain the original physiological context of the interaction. Without this context, the therapeutic relevance of an interaction may be compromised or lost. In this report, we address these missing data and extract therapeutic context from free-text sources. We apply existing large language models (LLMs) that have been fine-tuned on additional medical corpuses to tag and extract indications, cancer types, and relevant pharmacogenomics from free-text, FDA approved labels. We are then able to utilize our in-house normalization services to link extracted data back to formally grouped concepts.
In a preliminary test set of 355 FDA labels, we were able to normalize 59.4%, 49.8%, and 49.1% of extracted chemical, disease, and genetic entities back to harmonized concepts. Extracting this data allows us to supplement our existing interactions with relevant context that may inform the therapeutic relevance of a particular interaction. Inclusion of these data will be particularly invaluable for variant interpretation pipelines where mutational status can lead to the identification of a lifesaving therapeutic and a positive patient outcome.