单词的可预测性是基于上下文和/或频率的

R. Delmonte, Nicolò Busetto
{"title":"单词的可预测性是基于上下文和/或频率的","authors":"R. Delmonte, Nicolò Busetto","doi":"10.5121/csit.2022.121818","DOIUrl":null,"url":null,"abstract":"In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.","PeriodicalId":91205,"journal":{"name":"Artificial intelligence and applications (Commerce, Calif.)","volume":"74 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Word Predictability is Based on Context - and/or Frequency\",\"authors\":\"R. Delmonte, Nicolò Busetto\",\"doi\":\"10.5121/csit.2022.121818\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.\",\"PeriodicalId\":91205,\"journal\":{\"name\":\"Artificial intelligence and applications (Commerce, Calif.)\",\"volume\":\"74 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial intelligence and applications (Commerce, Calif.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5121/csit.2022.121818\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence and applications (Commerce, Calif.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/csit.2022.121818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在本文中,我们提出了一个用BERT对两个领域的少量意大利语句子进行的实验:报纸和诗歌领域。它们代表了预测我们想要测试的掩蔽词的难度增加的两个层次。本实验是在我们打算监测的三个语言复杂性水平(词汇、句法和语义水平)的可预测性难度增加的假设上组织的。为了验证这一假设,我们在使用相同的DL模型处理相同句子之前,交替使用同一句子的规范和非规范版本。结果表明,深度学习模型对非正则结构的存在和局部非字面意义组成效应高度敏感。然而,深度学习对词频也非常敏感,通过优先预测功能词与内容词、搭配词与不常用的词短语。为了衡量性能差异,我们创建了一个基于语言的“可预测性参数”,它与基于余弦的分类高度相关,但可以更好地区分类别。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Word Predictability is Based on Context - and/or Frequency
In this paper we present an experiment carried out with BERT on a small number of Italian sentences taken from two domains: newspapers and poetry domain. They represent two levels of increasing difficulty in the possibility to predict the masked word that we intended to test. The experiment is organized on the hypothesis of increasing difficulty in predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and semantic level. To test this hypothesis we alternate canonical and non-canonical versions of the same sentence before processing them with the same DL model. The result shows that DL models are highly sensitive to presence of non-canonical structures and to local non-literal meaning compositional effect. However, DL are also very sensitive to word frequency by predicting preferentially function vs content words, collocates vs infrequent word phrases. To measure differences in performance we created a linguistically based “predictability parameter” which is highly correlated with a cosine based classification but produces better distinctions between classes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Methodology of Measurement Intellectualization based on Regularized Bayesian Approach in Uncertain Conditions Stochastic Dual Coordinate Ascent for Learning Sign Constrained Linear Predictors Data Smoothing Filling Method based on ScRNA-Seq Data Zero-Value Identification Batch-Stochastic Sub-Gradient Method for Solving Non-Smooth Convex Loss Function Problems Teaching Reading Skills More Effectively
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1