使用变压器对有噪声的书面语言进行分布式文本表示

LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022 Pub Date : 2022-07-10 DOI:10.52591/lxai202207102

A. Rodriguez, Pablo Rivas, G. Bejarano

{"title":"使用变压器对有噪声的书面语言进行分布式文本表示","authors":"A. Rodriguez, Pablo Rivas, G. Bejarano","doi":"10.52591/lxai202207102","DOIUrl":null,"url":null,"abstract":"This work proposes a methodology to derive latent representations for highly noisy text. Traditionally in Natural Language Processing systems, methods rely on words as the core components of a text. Unlike those, we propose a character-based approach to be robust against our target texts’ high syntactical noise. We propose pre-training a Transformer model (BERT) on different, general-purpose language tasks and using the pre-trained model to obtain a representation for an input text. Weights are transferred from one task in the pipeline to the other. Instead of tokenizing the text on a word or sub-word basis, we propose considering the text’s characters as tokens. The ultimate goal is that the representations produced prove useful for other downstream tasks on the data, such as criminal activity in marketplace platforms.","PeriodicalId":350984,"journal":{"name":"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Text Representations Using Transformers for Noisy Written Language\",\"authors\":\"A. Rodriguez, Pablo Rivas, G. Bejarano\",\"doi\":\"10.52591/lxai202207102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work proposes a methodology to derive latent representations for highly noisy text. Traditionally in Natural Language Processing systems, methods rely on words as the core components of a text. Unlike those, we propose a character-based approach to be robust against our target texts’ high syntactical noise. We propose pre-training a Transformer model (BERT) on different, general-purpose language tasks and using the pre-trained model to obtain a representation for an input text. Weights are transferred from one task in the pipeline to the other. Instead of tokenizing the text on a word or sub-word basis, we propose considering the text’s characters as tokens. The ultimate goal is that the representations produced prove useful for other downstream tasks on the data, such as criminal activity in marketplace platforms.\",\"PeriodicalId\":350984,\"journal\":{\"name\":\"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.52591/lxai202207102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52591/lxai202207102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

这项工作提出了一种方法来获得高噪声文本的潜在表示。在传统的自然语言处理系统中，方法依赖于单词作为文本的核心组成部分。与这些不同，我们提出了一种基于字符的方法来抵抗目标文本的高语法噪声。我们建议在不同的通用语言任务上预训练Transformer模型(BERT)，并使用预训练的模型来获得输入文本的表示。权重从管道中的一个任务转移到另一个任务。我们建议将文本的字符作为标记，而不是在单词或子单词的基础上对文本进行标记。最终目标是生成的表示对数据的其他下游任务有用，例如市场平台上的犯罪活动。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Distributed Text Representations Using Transformers for Noisy Written Language

This work proposes a methodology to derive latent representations for highly noisy text. Traditionally in Natural Language Processing systems, methods rely on words as the core components of a text. Unlike those, we propose a character-based approach to be robust against our target texts’ high syntactical noise. We propose pre-training a Transformer model (BERT) on different, general-purpose language tasks and using the pre-trained model to obtain a representation for an input text. Weights are transferred from one task in the pipeline to the other. Instead of tokenizing the text on a word or sub-word basis, we propose considering the text’s characters as tokens. The ultimate goal is that the representations produced prove useful for other downstream tasks on the data, such as criminal activity in marketplace platforms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

LatinX in AI at North American Chapter of the Association for Computational Linguistics Conference 2022

自引率

0.00%

发文量