Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Menan Velayuthan, Kengatharaiyer Sarveswaran
{"title":"Egalitarian Language Representation in Language Models: It All Begins with Tokenizers","authors":"Menan Velayuthan, Kengatharaiyer Sarveswaran","doi":"arxiv-2409.11501","DOIUrl":null,"url":null,"abstract":"Tokenizers act as a bridge between human language and the latent space of\nlanguage models, influencing how language is represented in these models. Due\nto the immense popularity of English-Centric Large Language Models (LLMs),\nefforts are being made to adapt them for other languages. However, we\ndemonstrate that, from a tokenization standpoint, not all tokenizers offer fair\nrepresentation for complex script languages such as Tamil, Sinhala, and Hindi,\nprimarily due to the choice of pre-tokenization methods. We go further to show\nthat pre-tokenization plays a more critical role than the tokenization\nalgorithm itself in achieving an egalitarian representation of these complex\nscript languages. To address this, we introduce an improvement to the Byte Pair\nEncoding (BPE) algorithm by incorporating graphemes, which we term Grapheme\nPair Encoding (GPE). Our experiments show that grapheme-based character\nextraction outperforms byte-level tokenizers for complex scripts. We validate\nthis approach through experiments on Tamil, Sinhala, and Hindi.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语言模型中的平等语言表达:一切从分词器开始
代词化器是人类语言与语言模型潜在空间之间的桥梁,影响着语言在这些模型中的表达方式。由于以英语为中心的大语言模型(LLMs)大受欢迎,人们正努力将其适用于其他语言。然而,我们证明,从标记化的角度来看,并非所有标记化器都能公平地表示泰米尔语、僧伽罗语和印地语等复杂文字语言,这主要是由于选择了预标记化方法。我们进一步证明,在实现这些复杂文字语言的公平表示方面,预标记化比标记化算法本身起着更加关键的作用。为了解决这个问题,我们引入了一种改进的字节对编码(BPE)算法,将词素纳入其中,我们称之为词素对编码(GPE)。我们的实验表明,对于复杂的脚本,基于词素的字符提取效果优于字节级标记器。我们通过对泰米尔语、僧伽罗语和印地语的实验验证了这种方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1