Edward W. Huang, Sheng Wang, Bingxue Li, Ran Zhang, Baoyan Liu, Runshun Zhang, Jie Liu, Xuezhong Zhou, Hongsheng Lin, ChengXiang Zhai
{"title":"HEMnet: Integration of Electronic Medical Records with Molecular Interaction Networks and Domain Knowledge for Survival Analysis","authors":"Edward W. Huang, Sheng Wang, Bingxue Li, Ran Zhang, Baoyan Liu, Runshun Zhang, Jie Liu, Xuezhong Zhou, Hongsheng Lin, ChengXiang Zhai","doi":"10.1145/3107411.3107422","DOIUrl":null,"url":null,"abstract":"The continual growth of electronic medical record (EMR) databases has paved the way for many data mining applications, including the discovery of novel disease-drug associations and the prediction of patient survival rates. However, these tasks are hindered because EMRs are usually segmented or incomplete. EMR analysis is further limited by the overabundance of medical term synonyms and morphologies, which causes existing techniques to mismatch records containing semantically similar but lexically distinct terms. Current solutions fill in missing values with techniques that tend to introduce noise rather than reduce it. In this paper, we propose to simultaneously infer missing data and solve semantic mismatching in EMRs by first integrating EMR data with molecular interaction networks and domain knowledge to build the HEMnet, a heterogeneous medical information network. We then project this network onto a low-dimensional space, and group entities in the network according to their relative distances. Lastly, we use this entity distance information to enrich the original EMRs. We evaluate the effectiveness of this method according to its ability to separate patients with dissimilar survival functions. We show that our method can obtain significant (p-value < 0.01) results for each cancer subtype in a lung cancer dataset, while the baselines cannot.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The continual growth of electronic medical record (EMR) databases has paved the way for many data mining applications, including the discovery of novel disease-drug associations and the prediction of patient survival rates. However, these tasks are hindered because EMRs are usually segmented or incomplete. EMR analysis is further limited by the overabundance of medical term synonyms and morphologies, which causes existing techniques to mismatch records containing semantically similar but lexically distinct terms. Current solutions fill in missing values with techniques that tend to introduce noise rather than reduce it. In this paper, we propose to simultaneously infer missing data and solve semantic mismatching in EMRs by first integrating EMR data with molecular interaction networks and domain knowledge to build the HEMnet, a heterogeneous medical information network. We then project this network onto a low-dimensional space, and group entities in the network according to their relative distances. Lastly, we use this entity distance information to enrich the original EMRs. We evaluate the effectiveness of this method according to its ability to separate patients with dissimilar survival functions. We show that our method can obtain significant (p-value < 0.01) results for each cancer subtype in a lung cancer dataset, while the baselines cannot.