The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model

Linda S. Khumalo, Georg I. Schltinz, Q. Williams
{"title":"The effect of word embeddings and domain specific long-range contextual information on a Recurrent Neural Network Language Model","authors":"Linda S. Khumalo, Georg I. Schltinz, Q. Williams","doi":"10.1109/ROBOMECH.2019.8704827","DOIUrl":null,"url":null,"abstract":"This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.","PeriodicalId":344332,"journal":{"name":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROBOMECH.2019.8704827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This work explores the effect of input text representation on the performance of a Long Short-Term Memory Recurrent Neural Network Language Model (LSTM RNNLM). Due to a problem with vanishing gradients during LSTM RNNLM training, they cannot capture long-range context information. Long-range context information often encapsulates details about the text application domain. Word embedding vectors capture similarity and semantic context information in their structure. Additional context can be captured by socio-situational setting information, topics, named entities and parts-of-speech tags. This work uses a character LSTM RNNLM as a control in experiments to determine the effect different types of input text representation have on the perplexity of an LSTM RNNLM in terms of a percentage increase or decrease in the value. Adding socio- situational information to a character LSTM RNNLM results in a 0.1 % reduction in perplexity in comparison to that of the control model. When the character embeddings are swapped with word2vec embeddings a reduction in perplexity of 2.77 % is obtained. Adding context information such as socio-situational information to the word embedded model should also result in a perplexity reduction. However, this is not the case, as the addition of socio-situational information to a word embedded model results in a 5.79 % perplexity increase in comparison to the word2vec only model. This trend of an increase in perplexity is observed in further experiments where other types of context information are added to a word embedded model. The largest increase in perplexity is obtained when word embeddings and topics are applied giving a perplexity increase of 7.55 %. This increase in perplexity is due to the addition of more data (context information) to the input text. More data means more words (unique or otherwise) that are being concatenated together as a representation of the input. This results in a larger and sparser input that not only takes longer to train but has less useful information captured on average resulting in models with a higher perplexity. A better method of text representation that will reduce the size of the input while still capturing the necessary semantic information implicit in word embeddings will be adding the vectors together instead of concatenating them.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
词嵌入和特定领域的远程上下文信息对递归神经网络语言模型的影响
本研究探讨了输入文本表示对长短期记忆递归神经网络语言模型(LSTM RNNLM)性能的影响。由于LSTM RNNLM在训练过程中存在梯度消失的问题,它们无法捕获远程上下文信息。远程上下文信息通常封装了关于文本应用程序域的详细信息。词嵌入向量在其结构中捕获相似度和语义上下文信息。其他上下文可以通过社会情景设置信息、主题、命名实体和词性标签来捕获。本工作在实验中使用字符LSTM RNNLM作为控制,以确定不同类型的输入文本表示对LSTM RNNLM困惑度的影响,即值的百分比增加或减少。将社会情境信息添加到字符LSTM RNNLM中,与控制模型相比,困惑度降低了0.1%。当字符嵌入与word2vec嵌入交换时,获得的困惑度降低了2.77%。向单词嵌入模型中添加上下文信息(如社会情境信息)也可以减少困惑。然而,事实并非如此,因为将社会情境信息添加到单词嵌入模型中,与仅使用word2vec的模型相比,困惑度增加了5.79%。在进一步的实验中,将其他类型的上下文信息添加到单词嵌入模型中,可以观察到这种增加困惑的趋势。当使用词嵌入和主题时,困惑度增加最多,增加了7.55%。这种增加的困惑是由于向输入文本添加了更多的数据(上下文信息)。更多的数据意味着更多的单词(唯一的或其他的)被连接在一起作为输入的表示。这导致更大更稀疏的输入,不仅需要更长的训练时间,而且平均而言捕获的有用信息较少,导致模型具有更高的困惑度。一种更好的文本表示方法是将向量加在一起,而不是将它们连接起来,这将减少输入的大小,同时仍然捕获隐含在词嵌入中的必要语义信息。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Distributed Framework for Programming the Artificial Intelligence of Mobile Robots in Smart Manufacturing Systems Multi-Class Weather Classification from Still Image Using Said Ensemble Method Three-Phase Five-Limb Transformer Harmonic Analysis under DC-bias Modelling of a MMC HVDC Link between Koeberg Power Station and Cape Town - Experiences in simulation Comparison Between A Three and Two Level Inverter Variable Flux Machine Drives For Traction Applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1