均值上下文嵌入的规范决定其方差

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11253

Hiroaki Yamagiwa, Hidetoshi Shimodaira

{"title":"均值上下文嵌入的规范决定其方差","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":null,"url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\na distribution in the embedding space. To analyze this distribution, we focus\non the norm of the mean embedding and the variance of the embeddings. In this\nstudy, we first demonstrate that these values follow the well-known formula for\nvariance in statistics and provide an efficient sequential computation method.\nThen, by observing embeddings from intermediate layers of several Transformer\nmodels, we found a strong trade-off relationship between the norm and the\nvariance: as the mean embedding becomes closer to the origin, the variance\nincreases. This trade-off is likely influenced by the layer normalization\nmechanism used in Transformer models. Furthermore, when the sets of token\nembeddings are treated as clusters, we show that the variance of the entire\nembedding set can theoretically be decomposed into the within-cluster variance\nand the between-cluster variance. We found experimentally that as the layers of\nTransformer models deepen, the embeddings move farther from the origin, the\nbetween-cluster variance relatively decreases, and the within-cluster variance\nrelatively increases. These results are consistent with existing studies on the\nanisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Norm of Mean Contextualized Embeddings Determines their Variance\",\"authors\":\"Hiroaki Yamagiwa, Hidetoshi Shimodaira\",\"doi\":\"arxiv-2409.11253\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contextualized embeddings vary by context, even for the same token, and form\\na distribution in the embedding space. To analyze this distribution, we focus\\non the norm of the mean embedding and the variance of the embeddings. In this\\nstudy, we first demonstrate that these values follow the well-known formula for\\nvariance in statistics and provide an efficient sequential computation method.\\nThen, by observing embeddings from intermediate layers of several Transformer\\nmodels, we found a strong trade-off relationship between the norm and the\\nvariance: as the mean embedding becomes closer to the origin, the variance\\nincreases. This trade-off is likely influenced by the layer normalization\\nmechanism used in Transformer models. Furthermore, when the sets of token\\nembeddings are treated as clusters, we show that the variance of the entire\\nembedding set can theoretically be decomposed into the within-cluster variance\\nand the between-cluster variance. We found experimentally that as the layers of\\nTransformer models deepen, the embeddings move farther from the origin, the\\nbetween-cluster variance relatively decreases, and the within-cluster variance\\nrelatively increases. These results are consistent with existing studies on the\\nanisotropy of the embedding spaces across layers.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11253\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语境化嵌入因语境而异，即使是同一个标记，也会在嵌入空间中形成分布。为了分析这种分布，我们将重点放在平均嵌入的规范和嵌入的方差上。在这项研究中，我们首先证明了这些值遵循统计学中著名的方差公式，并提供了一种高效的顺序计算方法。然后，通过观察几个变换模型中间层的嵌入，我们发现了规范和方差之间的强烈权衡关系：当平均嵌入变得更接近原点时，方差就会增大。这种取舍关系很可能受到 Transformer 模型中使用的层归一化机制的影响。此外，当标记嵌入集被视为簇时，我们发现理论上整个嵌入集的方差可以分解为簇内方差和簇间方差。我们在实验中发现，随着转换器模型层数的加深，嵌入会远离原点，簇间方差会相对减小，而簇内方差会相对增大。这些结果与现有关于各层嵌入空间各向异性的研究相一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Norm of Mean Contextualized Embeddings Determines their Variance

Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量