{"title":"基于字符频率子词增强的中文短文本分类方法","authors":"Xingguang Wang, Shunxiang Zhang, Zichen Ma, Yunduo Liu, Youqiang Zhang","doi":"10.1080/09540091.2023.2263663","DOIUrl":null,"url":null,"abstract":"As a foundation task of natural language processing, text classification is widely used in information retrieval, public opinion analysis, and other related tasks. Facing the problem of sparse features of Chinese short texts, which affects the classification accuracy of Chinese short texts, this paper proposes a Chinese short text classification method based on the Character Frequency Sub-word Enhancement (CFSE), which can effectively improve the classification accuracy of Chinese short texts. First, the initial Chinese-character sequence is mapped to the corresponding Character Frequency Sub-word (CFS) sequence based on the global character1 frequency information. Second, the relationship features among data are extracted based on BiLSTM-Att processing CFS sequence, and the semantic features of the initial Chinese-character sequence are obtained through ERNIE. Finally, these two kinds of features are fused and input into the text classifier to obtain the classification results. Experimental results show that the proposed method can improve the classification accuracy of Chinese short texts.","PeriodicalId":50629,"journal":{"name":"Connection Science","volume":"41 1","pages":"0"},"PeriodicalIF":3.2000,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CFSE: a Chinese short text classification method based on character frequency sub-word enhancement\",\"authors\":\"Xingguang Wang, Shunxiang Zhang, Zichen Ma, Yunduo Liu, Youqiang Zhang\",\"doi\":\"10.1080/09540091.2023.2263663\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a foundation task of natural language processing, text classification is widely used in information retrieval, public opinion analysis, and other related tasks. Facing the problem of sparse features of Chinese short texts, which affects the classification accuracy of Chinese short texts, this paper proposes a Chinese short text classification method based on the Character Frequency Sub-word Enhancement (CFSE), which can effectively improve the classification accuracy of Chinese short texts. First, the initial Chinese-character sequence is mapped to the corresponding Character Frequency Sub-word (CFS) sequence based on the global character1 frequency information. Second, the relationship features among data are extracted based on BiLSTM-Att processing CFS sequence, and the semantic features of the initial Chinese-character sequence are obtained through ERNIE. Finally, these two kinds of features are fused and input into the text classifier to obtain the classification results. Experimental results show that the proposed method can improve the classification accuracy of Chinese short texts.\",\"PeriodicalId\":50629,\"journal\":{\"name\":\"Connection Science\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2023-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Connection Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/09540091.2023.2263663\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Connection Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/09540091.2023.2263663","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
CFSE: a Chinese short text classification method based on character frequency sub-word enhancement
As a foundation task of natural language processing, text classification is widely used in information retrieval, public opinion analysis, and other related tasks. Facing the problem of sparse features of Chinese short texts, which affects the classification accuracy of Chinese short texts, this paper proposes a Chinese short text classification method based on the Character Frequency Sub-word Enhancement (CFSE), which can effectively improve the classification accuracy of Chinese short texts. First, the initial Chinese-character sequence is mapped to the corresponding Character Frequency Sub-word (CFS) sequence based on the global character1 frequency information. Second, the relationship features among data are extracted based on BiLSTM-Att processing CFS sequence, and the semantic features of the initial Chinese-character sequence are obtained through ERNIE. Finally, these two kinds of features are fused and input into the text classifier to obtain the classification results. Experimental results show that the proposed method can improve the classification accuracy of Chinese short texts.
期刊介绍:
Connection Science is an interdisciplinary journal dedicated to exploring the convergence of the analytic and synthetic sciences, including neuroscience, computational modelling, artificial intelligence, machine learning, deep learning, Database, Big Data, quantum computing, Blockchain, Zero-Knowledge, Internet of Things, Cybersecurity, and parallel and distributed computing.
A strong focus is on the articles arising from connectionist, probabilistic, dynamical, or evolutionary approaches in aspects of Computer Science, applied applications, and systems-level computational subjects that seek to understand models in science and engineering.