{"title":"Automatic Labeling for Gene-Disease Associations through Distant Supervision","authors":"Fei Teng, Meng Bai, Tian-Jie Li","doi":"10.1109/ISKE47853.2019.9170268","DOIUrl":null,"url":null,"abstract":"Associating genes with diseases is a fundamental challenge in human health with applications of understanding disease properties and developing precision medicine. Over the past decades, biomedical articles increase explosively, which contain a great number of gene-disease associations (GDAs). Association extraction requires annotated corpus of high accuracy, but manual labeling is time consuming and labor intensive. This paper proposes a distant supervision-based method, to automatically label corpus for GDAs extraction. Compared with the manually annotated gold corpus, the automatic labeled corpus has much larger scale and better quality. It improves the performance of state-of-the-art extraction models, with AUC of 0.96, and F1 of 90%. To the best of our knowledge, this is the first study of automatic labeling GDAs in the field of precision medicine. We extracted GDAs using new corpora from 115,261 PubMed abstracts about 29 lung cancers, and finally discovered 296 new genes/proteins related to lung cancers. These findings indicate new directions for drug design.","PeriodicalId":399084,"journal":{"name":"2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISKE47853.2019.9170268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Associating genes with diseases is a fundamental challenge in human health with applications of understanding disease properties and developing precision medicine. Over the past decades, biomedical articles increase explosively, which contain a great number of gene-disease associations (GDAs). Association extraction requires annotated corpus of high accuracy, but manual labeling is time consuming and labor intensive. This paper proposes a distant supervision-based method, to automatically label corpus for GDAs extraction. Compared with the manually annotated gold corpus, the automatic labeled corpus has much larger scale and better quality. It improves the performance of state-of-the-art extraction models, with AUC of 0.96, and F1 of 90%. To the best of our knowledge, this is the first study of automatic labeling GDAs in the field of precision medicine. We extracted GDAs using new corpora from 115,261 PubMed abstracts about 29 lung cancers, and finally discovered 296 new genes/proteins related to lung cancers. These findings indicate new directions for drug design.