Learning Effective Distributed Representation of Complex Biomedical Concepts

2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE) Pub Date : 2018-10-01 DOI:10.1109/BIBE.2018.00073

Khai Nguyen, R. Ichise

{"title":"Learning Effective Distributed Representation of Complex Biomedical Concepts","authors":"Khai Nguyen, R. Ichise","doi":"10.1109/BIBE.2018.00073","DOIUrl":null,"url":null,"abstract":"Word embedding is the state-of-the-art representation to capture semantic information of terms. It benefits a wide range of natural language processing and related applications, not only in general fields of artificial intelligence but also in bioinformatics. Although recent efforts of using word embedding to represent medical concepts have provided remarkable analyses, many essential problems remain unsolved. Examples include representation of complex concepts (i.e., formed by multiple tokens), leveraging of a large corpus to maximize the trainable concepts, and downstream analyses on a biomedical-related dataset. Our study focused on training effective representations for biomedical concepts including complex ones. We used an efficient technique to index all possible concepts of UMLS thesaurus (Unified Medical Language System) in a huge corpus of 15,4 billion tokens. By this way, we can obtain the vector representations for more than 650,000 concepts, the largest ever reported resource to date. Furthermore, evaluations of trained vectors on retrieval task show superior performance compared to recent studies.","PeriodicalId":127507,"journal":{"name":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2018.00073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Word embedding is the state-of-the-art representation to capture semantic information of terms. It benefits a wide range of natural language processing and related applications, not only in general fields of artificial intelligence but also in bioinformatics. Although recent efforts of using word embedding to represent medical concepts have provided remarkable analyses, many essential problems remain unsolved. Examples include representation of complex concepts (i.e., formed by multiple tokens), leveraging of a large corpus to maximize the trainable concepts, and downstream analyses on a biomedical-related dataset. Our study focused on training effective representations for biomedical concepts including complex ones. We used an efficient technique to index all possible concepts of UMLS thesaurus (Unified Medical Language System) in a huge corpus of 15,4 billion tokens. By this way, we can obtain the vector representations for more than 650,000 concepts, the largest ever reported resource to date. Furthermore, evaluations of trained vectors on retrieval task show superior performance compared to recent studies.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

学习复杂生物医学概念的有效分布式表示

词嵌入是一种捕捉词的语义信息的最先进的表示方法。它有利于广泛的自然语言处理和相关应用，不仅在人工智能的一般领域，而且在生物信息学。虽然最近使用词嵌入来表示医学概念的努力提供了显著的分析，但许多基本问题仍未解决。示例包括复杂概念的表示(即由多个令牌组成)，利用大型语料库来最大化可训练的概念，以及对生物医学相关数据集的下游分析。我们的研究重点是训练生物医学概念的有效表征，包括复杂的生物医学概念。我们使用了一种高效的技术来索引UMLS同义词库(统一医学语言系统)中所有可能的概念，这些概念包含154亿个标记。通过这种方式，我们可以获得超过65万个概念的向量表示，这是迄今为止报道的最大的资源。此外，训练后的向量在检索任务上的评价与目前的研究相比，表现出更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)

自引率

0.00%

发文量