{"title":"Learning Effective Distributed Representation of Complex Biomedical Concepts","authors":"Khai Nguyen, R. Ichise","doi":"10.1109/BIBE.2018.00073","DOIUrl":null,"url":null,"abstract":"Word embedding is the state-of-the-art representation to capture semantic information of terms. It benefits a wide range of natural language processing and related applications, not only in general fields of artificial intelligence but also in bioinformatics. Although recent efforts of using word embedding to represent medical concepts have provided remarkable analyses, many essential problems remain unsolved. Examples include representation of complex concepts (i.e., formed by multiple tokens), leveraging of a large corpus to maximize the trainable concepts, and downstream analyses on a biomedical-related dataset. Our study focused on training effective representations for biomedical concepts including complex ones. We used an efficient technique to index all possible concepts of UMLS thesaurus (Unified Medical Language System) in a huge corpus of 15,4 billion tokens. By this way, we can obtain the vector representations for more than 650,000 concepts, the largest ever reported resource to date. Furthermore, evaluations of trained vectors on retrieval task show superior performance compared to recent studies.","PeriodicalId":127507,"journal":{"name":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2018.00073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Word embedding is the state-of-the-art representation to capture semantic information of terms. It benefits a wide range of natural language processing and related applications, not only in general fields of artificial intelligence but also in bioinformatics. Although recent efforts of using word embedding to represent medical concepts have provided remarkable analyses, many essential problems remain unsolved. Examples include representation of complex concepts (i.e., formed by multiple tokens), leveraging of a large corpus to maximize the trainable concepts, and downstream analyses on a biomedical-related dataset. Our study focused on training effective representations for biomedical concepts including complex ones. We used an efficient technique to index all possible concepts of UMLS thesaurus (Unified Medical Language System) in a huge corpus of 15,4 billion tokens. By this way, we can obtain the vector representations for more than 650,000 concepts, the largest ever reported resource to date. Furthermore, evaluations of trained vectors on retrieval task show superior performance compared to recent studies.