BERT在词汇外词的标记化和子词表示中的领域适应挑战

First Workshop on Insights from Negative Results in NLP Pub Date : 2020-11-01 DOI:10.18653/v1/2020.insights-1.1

Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao

{"title":"BERT在词汇外词的标记化和子词表示中的领域适应挑战","authors":"Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao","doi":"10.18653/v1/2020.insights-1.1","DOIUrl":null,"url":null,"abstract":"BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words\",\"authors\":\"Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao\",\"doi\":\"10.18653/v1/2020.insights-1.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.\",\"PeriodicalId\":441528,\"journal\":{\"name\":\"First Workshop on Insights from Negative Results in NLP\",\"volume\":\"2016 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"First Workshop on Insights from Negative Results in NLP\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.insights-1.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.insights-1.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

摘要

BERT模型(Devlin et al.， 2019)通过在其架构中利用多头自注意机制(Vaswani et al.， 2017)，在几个自然语言处理(NLP)任务中取得了重大进展。然而，对于工业中发现的特定领域语料库，它仍然存在一些研究难题，没有得到很好的解决。在本文中，我们通过详细的实验，包括使用bert - base - uncase模型分析注意力得分和动态词嵌入，突出了这些问题。我们的实验得出了一些有趣的发现:1)词汇表(in-vocab)中左侧最大的子字符串总是在每个子词单元上选择，这可能导致次优的标记化选择;2)当在词汇表外(OOV)单词中发现子字符串时，词汇表单词的语义会恶化;3)单词中的轻微拼写错误处理不当。我们相信，如果这些挑战得到解决，将对BERT的领域适应方面有很大的帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

First Workshop on Insights from Negative Results in NLP

自引率

0.00%

发文量

期刊最新文献

What GPT Knows About Who is Who Pathologies of Pre-trained Language Models in Few-shot Fine-tuning Can Question Rewriting Help Conversational Question Answering? Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains Do Data-based Curricula Work?