{"title":"基于词典的中文临床文本分词半crf","authors":"Guoqing Xia, Yao Shen, Qian-Xiang Lin","doi":"10.1109/PIC.2017.8359512","DOIUrl":null,"url":null,"abstract":"Word segmentation is in most cases a base for text analysis and absolutely vital to the accuracy of subsequent natural language processing (NLP) tasks. While word segmentation for normal text has been intensively studied and quite a few algorithms have been proposed, these algorithms however do not work well in special fields, e.g., in clinical text analysis. Besides, most state-of-the-art methods have difficulties in identifying out-of-vocabulary (OOV) words. For these two reasons, in this paper, we propose a semi-supervised CRF (semi-CRF) algorithm for Chinese clinical text word segmentation. Semi-CRF is implemented by modifying the learning objective so as to adapt for partial labeled data. Training data are obtained by applying a bidirectional lexicon matching scheme. A modified Viterbi algorithm using lexicon matching scheme is also proposed for word segmentation on raw sentences. Experiments show that our model has a precision of 93.88% on test data and outperforms two popular open source Chinese word segmentation tools i.e., HanLP and THULAC. By using lexicon, our model is able to be adapted for other domain text word segmentation.","PeriodicalId":370588,"journal":{"name":"2017 International Conference on Progress in Informatics and Computing (PIC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Lexicon-based semi-CRF for Chinese clinical text word segmentation\",\"authors\":\"Guoqing Xia, Yao Shen, Qian-Xiang Lin\",\"doi\":\"10.1109/PIC.2017.8359512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word segmentation is in most cases a base for text analysis and absolutely vital to the accuracy of subsequent natural language processing (NLP) tasks. While word segmentation for normal text has been intensively studied and quite a few algorithms have been proposed, these algorithms however do not work well in special fields, e.g., in clinical text analysis. Besides, most state-of-the-art methods have difficulties in identifying out-of-vocabulary (OOV) words. For these two reasons, in this paper, we propose a semi-supervised CRF (semi-CRF) algorithm for Chinese clinical text word segmentation. Semi-CRF is implemented by modifying the learning objective so as to adapt for partial labeled data. Training data are obtained by applying a bidirectional lexicon matching scheme. A modified Viterbi algorithm using lexicon matching scheme is also proposed for word segmentation on raw sentences. Experiments show that our model has a precision of 93.88% on test data and outperforms two popular open source Chinese word segmentation tools i.e., HanLP and THULAC. By using lexicon, our model is able to be adapted for other domain text word segmentation.\",\"PeriodicalId\":370588,\"journal\":{\"name\":\"2017 International Conference on Progress in Informatics and Computing (PIC)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Progress in Informatics and Computing (PIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PIC.2017.8359512\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC.2017.8359512","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Lexicon-based semi-CRF for Chinese clinical text word segmentation
Word segmentation is in most cases a base for text analysis and absolutely vital to the accuracy of subsequent natural language processing (NLP) tasks. While word segmentation for normal text has been intensively studied and quite a few algorithms have been proposed, these algorithms however do not work well in special fields, e.g., in clinical text analysis. Besides, most state-of-the-art methods have difficulties in identifying out-of-vocabulary (OOV) words. For these two reasons, in this paper, we propose a semi-supervised CRF (semi-CRF) algorithm for Chinese clinical text word segmentation. Semi-CRF is implemented by modifying the learning objective so as to adapt for partial labeled data. Training data are obtained by applying a bidirectional lexicon matching scheme. A modified Viterbi algorithm using lexicon matching scheme is also proposed for word segmentation on raw sentences. Experiments show that our model has a precision of 93.88% on test data and outperforms two popular open source Chinese word segmentation tools i.e., HanLP and THULAC. By using lexicon, our model is able to be adapted for other domain text word segmentation.