TexShape:语言模型的信息论句子嵌入

H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath
{"title":"TexShape:语言模型的信息论句子嵌入","authors":"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath","doi":"arxiv-2402.05132","DOIUrl":null,"url":null,"abstract":"With the exponential growth in data volume and the emergence of\ndata-intensive applications, particularly in the field of machine learning,\nconcerns related to resource utilization, privacy, and fairness have become\nparamount. This paper focuses on the textual domain of data and addresses\nchallenges regarding encoding sentences to their optimized representations\nthrough the lens of information-theory. In particular, we use empirical\nestimates of mutual information, using the Donsker-Varadhan definition of\nKullback-Leibler divergence. Our approach leverages this estimation to train an\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\ndata compression or for filtering out sensitive information, enhancing privacy\nand fairness. In this study, we employ a benchmark language model for initial\ntext representation, complemented by neural networks for information-theoretic\ncompression and mutual information estimations. Our experiments demonstrate\nsignificant advancements in preserving maximal targeted information and minimal\nsensitive information over adverse compression ratios, in terms of predictive\naccuracy of downstream models that are trained using the compressed data.","PeriodicalId":501433,"journal":{"name":"arXiv - CS - Information Theory","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TexShape: Information Theoretic Sentence Embedding for Language Models\",\"authors\":\"H. Kaan Kale, Homa Esfahanizadeh, Noel Elias, Oguzhan Baser, Muriel Medard, Sriram Vishwanath\",\"doi\":\"arxiv-2402.05132\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the exponential growth in data volume and the emergence of\\ndata-intensive applications, particularly in the field of machine learning,\\nconcerns related to resource utilization, privacy, and fairness have become\\nparamount. This paper focuses on the textual domain of data and addresses\\nchallenges regarding encoding sentences to their optimized representations\\nthrough the lens of information-theory. In particular, we use empirical\\nestimates of mutual information, using the Donsker-Varadhan definition of\\nKullback-Leibler divergence. Our approach leverages this estimation to train an\\ninformation-theoretic sentence embedding, called TexShape, for (task-based)\\ndata compression or for filtering out sensitive information, enhancing privacy\\nand fairness. In this study, we employ a benchmark language model for initial\\ntext representation, complemented by neural networks for information-theoretic\\ncompression and mutual information estimations. Our experiments demonstrate\\nsignificant advancements in preserving maximal targeted information and minimal\\nsensitive information over adverse compression ratios, in terms of predictive\\naccuracy of downstream models that are trained using the compressed data.\",\"PeriodicalId\":501433,\"journal\":{\"name\":\"arXiv - CS - Information Theory\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Information Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2402.05132\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Information Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2402.05132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着数据量的指数级增长和数据密集型应用的出现,特别是在机器学习领域,与资源利用、隐私和公平性相关的问题变得尤为重要。本文重点关注文本数据领域,并通过信息论的视角来解决有关将句子编码为其优化表示的挑战。特别是,我们使用 Donsker-Varadhan 定义的库尔巴克-莱伯勒发散(Kullback-Leibler divergence)对互信息进行了经验性估计。我们的方法利用这种估计来训练一种信息论句子嵌入(称为 TexShape),用于(基于任务的)数据压缩或过滤敏感信息,从而提高隐私性和公平性。在这项研究中,我们采用了一个基准语言模型作为初始文本表示,并辅以神经网络进行信息理论压缩和互信息估算。我们的实验表明,在使用压缩数据训练的下游模型的预测准确性方面,我们在保留最大目标信息和最小敏感信息方面取得了显著进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TexShape: Information Theoretic Sentence Embedding for Language Models
With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper focuses on the textual domain of data and addresses challenges regarding encoding sentences to their optimized representations through the lens of information-theory. In particular, we use empirical estimates of mutual information, using the Donsker-Varadhan definition of Kullback-Leibler divergence. Our approach leverages this estimation to train an information-theoretic sentence embedding, called TexShape, for (task-based) data compression or for filtering out sensitive information, enhancing privacy and fairness. In this study, we employ a benchmark language model for initial text representation, complemented by neural networks for information-theoretic compression and mutual information estimations. Our experiments demonstrate significant advancements in preserving maximal targeted information and minimal sensitive information over adverse compression ratios, in terms of predictive accuracy of downstream models that are trained using the compressed data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massive MIMO CSI Feedback using Channel Prediction: How to Avoid Machine Learning at UE? Reverse em-problem based on Bregman divergence and its application to classical and quantum information theory From "um" to "yeah": Producing, predicting, and regulating information flow in human conversation Electrochemical Communication in Bacterial Biofilms: A Study on Potassium Stimulation and Signal Transmission Semantics-Empowered Space-Air-Ground-Sea Integrated Network: New Paradigm, Frameworks, and Challenges
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1