What do tokens know about their characters and how do they know it?

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-06-06 DOI:10.48550/arXiv.2206.02608

Ayush Kaushal, Kyle Mahowald

{"title":"What do tokens know about their characters and how do they know it?","authors":"Ayush Kaushal, Kyle Mahowald","doi":"10.48550/arXiv.2206.02608","DOIUrl":null,"url":null,"abstract":"Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for “cat” encodes that it contains the character “a”). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"North American Chapter of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.02608","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for “cat” encodes that it contains the character “a”). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

符号对它们的字符有什么了解?它们是怎么知道的?

使用子词标记化方案的预训练语言模型(PLMs)可以成功处理各种需要字符级信息的语言任务，尽管缺乏对标记的字符组成的显式访问。在这里，研究了一系列模型(例如，GPT- J, BERT, RoBERTa, GloVe)，我们通过训练分类器来探测哪些词块编码字符级信息，以基于其嵌入来预测标记中特定字母字符的存在或不存在(例如，探测“cat”的模型嵌入是否编码它包含字符“a”)。我们发现这些模型对字符级信息进行了鲁棒编码，一般来说，更大的模型在任务中表现得更好。我们证明了这些结果可以推广到非拉丁字母(阿拉伯语、德文语和西里尔语)中的字符。然后，通过一系列的实验和分析，我们研究了plm在训练过程中获取英语字符信息的机制，并认为这种知识是通过多种现象获得的，包括特定字符和特定词性之间的系统关系，以及相关字符串标记化的自然变化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

North American Chapter of the Association for Computational Linguistics

自引率

0.00%

发文量

期刊最新文献

On Synthetic Data for Back Translation Mining Clues from Incomplete Utterance: A Query-enhanced Network for Incomplete Utterance Rewriting Using Paraphrases to Study Properties of Contextual Embeddings GMN: Generative Multi-modal Network for Practical Document Information Extraction Domain Confused Contrastive Learning for Unsupervised Domain Adaptation