{"title":"基于分布式字符树和未知词识别的高性能分词技术研究","authors":"Su Hang, Zhou Hanqing","doi":"10.1109/CCET55412.2022.9906347","DOIUrl":null,"url":null,"abstract":"This paper summarizes the disadvantages of regular word segmentation and statistical word segmentation, then a high performance Chinese word segmentation algorithm based on distributed character tree and unknown word recognition is proposed, which not only solves the defect of dictionary dependence in regular word segmentation, but also makes up for the lack of high time complexity in statistical word segmentation. The main innovations of the algorithm include: in the preprocessing stage, defining the distributed character tree and creating the feature dictionary; In the stage of word segmentation, the concept of word-formation skewness is defined, and the judgment formula of unknown word is proposed. The experimental results show that the new method has improved the accuracy and recall rate, which is applicable.","PeriodicalId":329327,"journal":{"name":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Research on High Performance Word Segmentation Technology Based on Distributed Character Tree and Unknown Word Recognition\",\"authors\":\"Su Hang, Zhou Hanqing\",\"doi\":\"10.1109/CCET55412.2022.9906347\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper summarizes the disadvantages of regular word segmentation and statistical word segmentation, then a high performance Chinese word segmentation algorithm based on distributed character tree and unknown word recognition is proposed, which not only solves the defect of dictionary dependence in regular word segmentation, but also makes up for the lack of high time complexity in statistical word segmentation. The main innovations of the algorithm include: in the preprocessing stage, defining the distributed character tree and creating the feature dictionary; In the stage of word segmentation, the concept of word-formation skewness is defined, and the judgment formula of unknown word is proposed. The experimental results show that the new method has improved the accuracy and recall rate, which is applicable.\",\"PeriodicalId\":329327,\"journal\":{\"name\":\"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCET55412.2022.9906347\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCET55412.2022.9906347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Research on High Performance Word Segmentation Technology Based on Distributed Character Tree and Unknown Word Recognition
This paper summarizes the disadvantages of regular word segmentation and statistical word segmentation, then a high performance Chinese word segmentation algorithm based on distributed character tree and unknown word recognition is proposed, which not only solves the defect of dictionary dependence in regular word segmentation, but also makes up for the lack of high time complexity in statistical word segmentation. The main innovations of the algorithm include: in the preprocessing stage, defining the distributed character tree and creating the feature dictionary; In the stage of word segmentation, the concept of word-formation skewness is defined, and the judgment formula of unknown word is proposed. The experimental results show that the new method has improved the accuracy and recall rate, which is applicable.