均值上下文嵌入的规范决定其方差

Hiroaki Yamagiwa, Hidetoshi Shimodaira
{"title":"均值上下文嵌入的规范决定其方差","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":null,"url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\na distribution in the embedding space. To analyze this distribution, we focus\non the norm of the mean embedding and the variance of the embeddings. In this\nstudy, we first demonstrate that these values follow the well-known formula for\nvariance in statistics and provide an efficient sequential computation method.\nThen, by observing embeddings from intermediate layers of several Transformer\nmodels, we found a strong trade-off relationship between the norm and the\nvariance: as the mean embedding becomes closer to the origin, the variance\nincreases. This trade-off is likely influenced by the layer normalization\nmechanism used in Transformer models. Furthermore, when the sets of token\nembeddings are treated as clusters, we show that the variance of the entire\nembedding set can theoretically be decomposed into the within-cluster variance\nand the between-cluster variance. We found experimentally that as the layers of\nTransformer models deepen, the embeddings move farther from the origin, the\nbetween-cluster variance relatively decreases, and the within-cluster variance\nrelatively increases. These results are consistent with existing studies on the\nanisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Norm of Mean Contextualized Embeddings Determines their Variance\",\"authors\":\"Hiroaki Yamagiwa, Hidetoshi Shimodaira\",\"doi\":\"arxiv-2409.11253\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contextualized embeddings vary by context, even for the same token, and form\\na distribution in the embedding space. To analyze this distribution, we focus\\non the norm of the mean embedding and the variance of the embeddings. In this\\nstudy, we first demonstrate that these values follow the well-known formula for\\nvariance in statistics and provide an efficient sequential computation method.\\nThen, by observing embeddings from intermediate layers of several Transformer\\nmodels, we found a strong trade-off relationship between the norm and the\\nvariance: as the mean embedding becomes closer to the origin, the variance\\nincreases. This trade-off is likely influenced by the layer normalization\\nmechanism used in Transformer models. Furthermore, when the sets of token\\nembeddings are treated as clusters, we show that the variance of the entire\\nembedding set can theoretically be decomposed into the within-cluster variance\\nand the between-cluster variance. We found experimentally that as the layers of\\nTransformer models deepen, the embeddings move farther from the origin, the\\nbetween-cluster variance relatively decreases, and the within-cluster variance\\nrelatively increases. These results are consistent with existing studies on the\\nanisotropy of the embedding spaces across layers.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11253\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

语境化嵌入因语境而异,即使是同一个标记,也会在嵌入空间中形成分布。为了分析这种分布,我们将重点放在平均嵌入的规范和嵌入的方差上。在这项研究中,我们首先证明了这些值遵循统计学中著名的方差公式,并提供了一种高效的顺序计算方法。然后,通过观察几个变换模型中间层的嵌入,我们发现了规范和方差之间的强烈权衡关系:当平均嵌入变得更接近原点时,方差就会增大。这种取舍关系很可能受到 Transformer 模型中使用的层归一化机制的影响。此外,当标记嵌入集被视为簇时,我们发现理论上整个嵌入集的方差可以分解为簇内方差和簇间方差。我们在实验中发现,随着转换器模型层数的加深,嵌入会远离原点,簇间方差会相对减小,而簇内方差会相对增大。这些结果与现有关于各层嵌入空间各向异性的研究相一致。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Norm of Mean Contextualized Embeddings Determines their Variance
Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1