TexShape：语言模型的信息论句子嵌入

arXiv - CS - Information Theory Pub Date : 2024-02-05 DOI:arxiv-2402.05132

H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath

{"title":"TexShape：语言模型的信息论句子嵌入","authors":"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath","doi":"arxiv-2402.05132","DOIUrl":null,"url":null,"abstract":"With the exponential growth in data volume and the emergence of\ndata-intensive applications, particularly in the field of machine learning,\nconcerns related to resource utilization, privacy, and fairness have become\nparamount. This paper focuses on the textual domain of data and addresses\nchallenges regarding encoding sentences to their optimized representations\nthrough the lens of information-theory. In particular, we use empirical\nestimates of mutual information, using the Donsker-Varadhan definition of\nKullback-Leibler divergence. Our approach leverages this estimation to train an\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\ndata compression or for filtering out sensitive information, enhancing privacy\nand fairness. In this study, we employ a benchmark language model for initial\ntext representation, complemented by neural networks for information-theoretic\ncompression and mutual information estimations. Our experiments demonstrate\nsignificant advancements in preserving maximal targeted information and minimal\nsensitive information over adverse compression ratios, in terms of predictive\naccuracy of downstream models that are trained using the compressed data.","PeriodicalId":501433,"journal":{"name":"arXiv - CS - Information Theory","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TexShape: Information Theoretic Sentence Embedding for Language Models\",\"authors\":\"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath\",\"doi\":\"arxiv-2402.05132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the exponential growth in data volume and the emergence of\\ndata-intensive applications, particularly in the field of machine learning,\\nconcerns related to resource utilization, privacy, and fairness have become\\nparamount. This paper focuses on the textual domain of data and addresses\\nchallenges regarding encoding sentences to their optimized representations\\nthrough the lens of information-theory. In particular, we use empirical\\nestimates of mutual information, using the Donsker-Varadhan definition of\\nKullback-Leibler divergence. Our approach leverages this estimation to train an\\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\\ndata compression or for filtering out sensitive information, enhancing privacy\\nand fairness. In this study, we employ a benchmark language model for initial\\ntext representation, complemented by neural networks for information-theoretic\\ncompression and mutual information estimations. Our experiments demonstrate\\nsignificant advancements in preserving maximal targeted information and minimal\\nsensitive information over adverse compression ratios, in terms of predictive\\naccuracy of downstream models that are trained using the compressed data.\",\"PeriodicalId\":501433,\"journal\":{\"name\":\"arXiv - CS - Information Theory\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2402.05132\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2402.05132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着数据量的指数级增长和数据密集型应用的出现，特别是在机器学习领域，与资源利用、隐私和公平性相关的问题变得尤为重要。本文重点关注文本数据领域，并通过信息论的视角来解决有关将句子编码为其优化表示的挑战。特别是，我们使用 Donsker-Varadhan 定义的库尔巴克-莱伯勒发散（Kullback-Leibler divergence）对互信息进行了经验性估计。我们的方法利用这种估计来训练一种信息论句子嵌入（称为 TexShape），用于（基于任务的）数据压缩或过滤敏感信息，从而提高隐私性和公平性。在这项研究中，我们采用了一个基准语言模型作为初始文本表示，并辅以神经网络进行信息理论压缩和互信息估算。我们的实验表明，在使用压缩数据训练的下游模型的预测准确性方面，我们在保留最大目标信息和最小敏感信息方面取得了显著进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TexShape: Information Theoretic Sentence Embedding for Language Models

With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper focuses on the textual domain of data and addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory. In particular, we use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence. Our approach leverages this estimation to train an information-theoretic sentence embedding, called TexShape, for (task-based) data compression or for filtering out sensitive information, enhancing privacy and fairness. In this study, we employ a benchmark language model for initial text representation, complemented by neural networks for information-theoretic compression and mutual information estimations. Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios, in terms of predictive accuracy of downstream models that are trained using the compressed data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Information Theory

自引率

0.00%

发文量