{"title":"文本分类任务中的词嵌入和认知语言模型","authors":"A. Surkova, S. Skorynin, Igor Chernobaev","doi":"10.1145/3373722.3373778","DOIUrl":null,"url":null,"abstract":"The paper considers two linguistic models, analyzed the possibility of their use for the text data classification as well as their associations in the integrated texts presentation. A cognitive approach for the text classification issues is presented. An algorithm to identify the words basic level using WordNet is considered. A model for text classification based on the pre-trained word embeddings is presented. The model consists of three layers: embedding layer Long-Short Term Memory (LSTM) layer, and softmax layer. The model was trained and evaluated on the 20 Newsgroups dataset. The classification quality was assessed by F- measure, precision and recall. The obtained results analysis is carried out. Both described models show good results, low scores for some texts are explained. The advantages and limitations of the linguistic models are shown. In future works the authors are going to combine proposed models and modify them. Thus, for model based on word embedding there are pretty vast opportunities for extension: from experimenting with different word embeddings and various distance metrics to more complicated architecture of layers and even promising state of the art artificial neural network models, activation functions and their modifications. In addition, there is research area of proper ensemble strategy selection.","PeriodicalId":243162,"journal":{"name":"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Word embedding and cognitive linguistic models in text classification tasks\",\"authors\":\"A. Surkova, S. Skorynin, Igor Chernobaev\",\"doi\":\"10.1145/3373722.3373778\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper considers two linguistic models, analyzed the possibility of their use for the text data classification as well as their associations in the integrated texts presentation. A cognitive approach for the text classification issues is presented. An algorithm to identify the words basic level using WordNet is considered. A model for text classification based on the pre-trained word embeddings is presented. The model consists of three layers: embedding layer Long-Short Term Memory (LSTM) layer, and softmax layer. The model was trained and evaluated on the 20 Newsgroups dataset. The classification quality was assessed by F- measure, precision and recall. The obtained results analysis is carried out. Both described models show good results, low scores for some texts are explained. The advantages and limitations of the linguistic models are shown. In future works the authors are going to combine proposed models and modify them. Thus, for model based on word embedding there are pretty vast opportunities for extension: from experimenting with different word embeddings and various distance metrics to more complicated architecture of layers and even promising state of the art artificial neural network models, activation functions and their modifications. In addition, there is research area of proper ensemble strategy selection.\",\"PeriodicalId\":243162,\"journal\":{\"name\":\"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3373722.3373778\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the XI International Scientific Conference Communicative Strategies of the Information Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3373722.3373778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Word embedding and cognitive linguistic models in text classification tasks
The paper considers two linguistic models, analyzed the possibility of their use for the text data classification as well as their associations in the integrated texts presentation. A cognitive approach for the text classification issues is presented. An algorithm to identify the words basic level using WordNet is considered. A model for text classification based on the pre-trained word embeddings is presented. The model consists of three layers: embedding layer Long-Short Term Memory (LSTM) layer, and softmax layer. The model was trained and evaluated on the 20 Newsgroups dataset. The classification quality was assessed by F- measure, precision and recall. The obtained results analysis is carried out. Both described models show good results, low scores for some texts are explained. The advantages and limitations of the linguistic models are shown. In future works the authors are going to combine proposed models and modify them. Thus, for model based on word embedding there are pretty vast opportunities for extension: from experimenting with different word embeddings and various distance metrics to more complicated architecture of layers and even promising state of the art artificial neural network models, activation functions and their modifications. In addition, there is research area of proper ensemble strategy selection.