The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model

2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA) Pub Date : 2019-01-01 DOI:10.1109/ROBOMECH.2019.8704827

Linda S. Khumalo, Georg I. Schltinz, Q. Williams

{"title":"The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model","authors":"Linda S. Khumalo, Georg I. Schltinz, Q. Williams","doi":"10.1109/ROBOMECH.2019.8704827","DOIUrl":null,"url":null,"abstract":"This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.","PeriodicalId":344332,"journal":{"name":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROBOMECH.2019.8704827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

词嵌入和特定领域的远程上下文信息对递归神经网络语言模型的影响

本研究探讨了输入文本表示对长短期记忆递归神经网络语言模型(LSTM RNNLM)性能的影响。由于LSTM RNNLM在训练过程中存在梯度消失的问题，它们无法捕获远程上下文信息。远程上下文信息通常封装了关于文本应用程序域的详细信息。词嵌入向量在其结构中捕获相似度和语义上下文信息。其他上下文可以通过社会情景设置信息、主题、命名实体和词性标签来捕获。本工作在实验中使用字符LSTM RNNLM作为控制，以确定不同类型的输入文本表示对LSTM RNNLM困惑度的影响，即值的百分比增加或减少。将社会情境信息添加到字符LSTM RNNLM中，与控制模型相比，困惑度降低了0.1%。当字符嵌入与word2vec嵌入交换时，获得的困惑度降低了2.77%。向单词嵌入模型中添加上下文信息(如社会情境信息)也可以减少困惑。然而，事实并非如此，因为将社会情境信息添加到单词嵌入模型中，与仅使用word2vec的模型相比，困惑度增加了5.79%。在进一步的实验中，将其他类型的上下文信息添加到单词嵌入模型中，可以观察到这种增加困惑的趋势。当使用词嵌入和主题时，困惑度增加最多，增加了7.55%。这种增加的困惑是由于向输入文本添加了更多的数据(上下文信息)。更多的数据意味着更多的单词(唯一的或其他的)被连接在一起作为输入的表示。这导致更大更稀疏的输入，不仅需要更长的训练时间，而且平均而言捕获的有用信息较少，导致模型具有更高的困惑度。一种更好的文本表示方法是将向量加在一起，而不是将它们连接起来，这将减少输入的大小，同时仍然捕获隐含在词嵌入中的必要语义信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)

自引率

0.00%

发文量