{"title":"The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model","authors":"Linda S. Khumalo, Georg I. Schltinz, Q. Williams","doi":"10.1109/ROBOMECH.2019.8704827","DOIUrl":null,"url":null,"abstract":"This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.","PeriodicalId":344332,"journal":{"name":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROBOMECH.2019.8704827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.