Qirui Zhang, Yonggang Xue, Huaying Zhou, Jinghua Tan
{"title":"医学文献分类研究","authors":"Qirui Zhang, Yonggang Xue, Huaying Zhou, Jinghua Tan","doi":"10.1109/FBIE.2008.83","DOIUrl":null,"url":null,"abstract":"Medical document categorization is the process of automatically assigning one or more predefined category labels to medical documents. Document indexing plays a very important role in the process of classification. This paper proposes an improved method of computing term weights which is called tfidfie (term frequency, inverted document frequency and inverted entropy). In comparison with the tfidf (term frequency and inverted document frequency) function, the tfidfie function adds an information entropy factor, H, which represents the distribution of documents in the training set in which the term occurs. Then, we discuss the effects of training set in medical document categorization. An imbalanced training set decreases the performance of classifier. Considering the characteristics of medical documents, the medical classifiers are constructed by the methods of Naive Bayes and Rocchio respectively. The experiment results show that tfidfie improves the classification performance and Naive Bayes outperforms Rocchio.","PeriodicalId":415908,"journal":{"name":"2008 International Seminar on Future BioMedical Information Engineering","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Research on Medical Document Categorization\",\"authors\":\"Qirui Zhang, Yonggang Xue, Huaying Zhou, Jinghua Tan\",\"doi\":\"10.1109/FBIE.2008.83\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Medical document categorization is the process of automatically assigning one or more predefined category labels to medical documents. Document indexing plays a very important role in the process of classification. This paper proposes an improved method of computing term weights which is called tfidfie (term frequency, inverted document frequency and inverted entropy). In comparison with the tfidf (term frequency and inverted document frequency) function, the tfidfie function adds an information entropy factor, H, which represents the distribution of documents in the training set in which the term occurs. Then, we discuss the effects of training set in medical document categorization. An imbalanced training set decreases the performance of classifier. Considering the characteristics of medical documents, the medical classifiers are constructed by the methods of Naive Bayes and Rocchio respectively. The experiment results show that tfidfie improves the classification performance and Naive Bayes outperforms Rocchio.\",\"PeriodicalId\":415908,\"journal\":{\"name\":\"2008 International Seminar on Future BioMedical Information Engineering\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 International Seminar on Future BioMedical Information Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FBIE.2008.83\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Seminar on Future BioMedical Information Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FBIE.2008.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
医疗文档分类是将一个或多个预定义的类别标签自动分配给医疗文档的过程。文献标引在分类过程中起着非常重要的作用。本文提出了一种改进的术语权重计算方法,称为tfidfie(术语频率、倒立文档频率和倒立熵)。与tfidf (term frequency and inverse document frequency)函数相比,tfidf函数增加了一个信息熵因子H, H表示该词出现在训练集中的文档分布。然后,我们讨论了训练集在医学文献分类中的作用。不平衡的训练集会降低分类器的性能。针对医学文献的特点,分别采用朴素贝叶斯和罗基奥方法构建医学分类器。实验结果表明,该算法提高了分类性能,朴素贝叶斯算法优于罗基奥算法。
Medical document categorization is the process of automatically assigning one or more predefined category labels to medical documents. Document indexing plays a very important role in the process of classification. This paper proposes an improved method of computing term weights which is called tfidfie (term frequency, inverted document frequency and inverted entropy). In comparison with the tfidf (term frequency and inverted document frequency) function, the tfidfie function adds an information entropy factor, H, which represents the distribution of documents in the training set in which the term occurs. Then, we discuss the effects of training set in medical document categorization. An imbalanced training set decreases the performance of classifier. Considering the characteristics of medical documents, the medical classifiers are constructed by the methods of Naive Bayes and Rocchio respectively. The experiment results show that tfidfie improves the classification performance and Naive Bayes outperforms Rocchio.