Liangchi Li, Shuaijing Xu, Shenling Wang, Xianlin Ma
{"title":"The Diseases Clustering for Multi-source Medical Sets","authors":"Liangchi Li, Shuaijing Xu, Shenling Wang, Xianlin Ma","doi":"10.1109/IIKI.2016.37","DOIUrl":null,"url":null,"abstract":"The construction of medical database has been constructed to some degrees, but the data redundancy between many medical sets has great influence on searching cross different sets. In this paper, the first step is to use three major domestic medical sets as the foundation of the research. And the Natural Language processing technologies is applied to realize the segmentation of disease description. Then, we use TF-IDF to calculate the weight of the feature words in the disease description, and establish the disease feature vector. Based on this vector, the similarity of disease feature vectors is measured by the cosine similarity method. Finally, the effect of k-means and k-center clustering algorithm on the alignment of the disease text is compared. The experimental results show that the k-center clustering algorithm has better performance compared to k-means. And the result of the clustering is reasonable to some extent.","PeriodicalId":371106,"journal":{"name":"2016 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIKI.2016.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The construction of medical database has been constructed to some degrees, but the data redundancy between many medical sets has great influence on searching cross different sets. In this paper, the first step is to use three major domestic medical sets as the foundation of the research. And the Natural Language processing technologies is applied to realize the segmentation of disease description. Then, we use TF-IDF to calculate the weight of the feature words in the disease description, and establish the disease feature vector. Based on this vector, the similarity of disease feature vectors is measured by the cosine similarity method. Finally, the effect of k-means and k-center clustering algorithm on the alignment of the disease text is compared. The experimental results show that the k-center clustering algorithm has better performance compared to k-means. And the result of the clustering is reasonable to some extent.