单词的可预测性是基于上下文和/或频率的

Artificial intelligence and applications (Commerce, Calif.) Pub Date : 2022-10-29 DOI:10.5121/csit.2022.121818

R. Delmonte, Nicolò Busetto

{"title":"单词的可预测性是基于上下文和/或频率的","authors":"R. Delmonte, Nicolò Busetto","doi":"10.5121/csit.2022.121818","DOIUrl":null,"url":null,"abstract":"In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.","PeriodicalId":91205,"journal":{"name":"Artificial intelligence and applications (Commerce, Calif.)","volume":"74 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Word Predictability is Based on Context - and/or Frequency\",\"authors\":\"R. Delmonte, Nicolò Busetto\",\"doi\":\"10.5121/csit.2022.121818\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.\",\"PeriodicalId\":91205,\"journal\":{\"name\":\"Artificial intelligence and applications (Commerce, Calif.)\",\"volume\":\"74 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial intelligence and applications (Commerce, Calif.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5121/csit.2022.121818\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence and applications (Commerce, Calif.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/csit.2022.121818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

在本文中，我们提出了一个用BERT对两个领域的少量意大利语句子进行的实验:报纸和诗歌领域。它们代表了预测我们想要测试的掩蔽词的难度增加的两个层次。本实验是在我们打算监测的三个语言复杂性水平(词汇、句法和语义水平)的可预测性难度增加的假设上组织的。为了验证这一假设，我们在使用相同的DL模型处理相同句子之前，交替使用同一句子的规范和非规范版本。结果表明，深度学习模型对非正则结构的存在和局部非字面意义组成效应高度敏感。然而，深度学习对词频也非常敏感，通过优先预测功能词与内容词、搭配词与不常用的词短语。为了衡量性能差异，我们创建了一个基于语言的“可预测性参数”，它与基于余弦的分类高度相关，但可以更好地区分类别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Word Predictability is Based on Context - and/or Frequency

In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial intelligence and applications (Commerce, Calif.)

自引率

0.00%

发文量