{"title":"BERT在词汇外词的标记化和子词表示中的领域适应挑战","authors":"Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao","doi":"10.18653/v1/2020.insights-1.1","DOIUrl":null,"url":null,"abstract":"BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.","PeriodicalId":441528,"journal":{"name":"First Workshop on Insights from Negative Results in NLP","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words\",\"authors\":\"Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, Vijendran Gopalan Venkoparao\",\"doi\":\"10.18653/v1/2020.insights-1.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.\",\"PeriodicalId\":441528,\"journal\":{\"name\":\"First Workshop on Insights from Negative Results in NLP\",\"volume\":\"2016 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"First Workshop on Insights from Negative Results in NLP\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.insights-1.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"First Workshop on Insights from Negative Results in NLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.insights-1.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
摘要
BERT模型(Devlin et al., 2019)通过在其架构中利用多头自注意机制(Vaswani et al., 2017),在几个自然语言处理(NLP)任务中取得了重大进展。然而,对于工业中发现的特定领域语料库,它仍然存在一些研究难题,没有得到很好的解决。在本文中,我们通过详细的实验,包括使用bert - base - uncase模型分析注意力得分和动态词嵌入,突出了这些问题。我们的实验得出了一些有趣的发现:1)词汇表(in-vocab)中左侧最大的子字符串总是在每个子词单元上选择,这可能导致次优的标记化选择;2)当在词汇表外(OOV)单词中发现子字符串时,词汇表单词的语义会恶化;3)单词中的轻微拼写错误处理不当。我们相信,如果这些挑战得到解决,将对BERT的领域适应方面有很大的帮助。
Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words
BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.