搭配排序:频率vs语义

Slovenščina 2.0: empirical, applied and interdisciplinary research Pub Date : 2021-12-29 DOI:10.4312/slo2.0.2021.2.41-70

Nikola Ljubesic, N. Logar, Iztok Kosem

{"title":"搭配排序:频率vs语义","authors":"Nikola Ljubesic, N. Logar, Iztok Kosem","doi":"10.4312/slo2.0.2021.2.41-70","DOIUrl":null,"url":null,"abstract":"Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Collocation ranking: frequency vs semantics\",\"authors\":\"Nikola Ljubesic, N. Logar, Iztok Kosem\",\"doi\":\"10.4312/slo2.0.2021.2.41-70\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.\",\"PeriodicalId\":371035,\"journal\":{\"name\":\"Slovenščina 2.0: empirical, applied and interdisciplinary research\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Slovenščina 2.0: empirical, applied and interdisciplinary research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4312/slo2.0.2021.2.41-70\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Slovenščina 2.0: empirical, applied and interdisciplinary research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4312/slo2.0.2021.2.41-70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

搭配在语言描述中起着非常重要的作用，尤其是对词义的识别。现代词典编纂中不可避免的意义演绎部分是通过一些统计测量来排列搭配列表。在本文中，我们提出了两种方法之间的比较:(a) logDice方法，这是主要使用的和基于频率的方法，(b) fastText词嵌入方法，这是一种新的和基于语义的方法。对两个斯洛文尼亚语数据集进行了比较，其中一个数据集代表一般语言词首词及其搭配，另一个数据集代表从特殊用途语言语料库中提取的词首词及其搭配。在实验中，我们使用了两种方法:对于定量部分的评估，我们使用了带有曲线下面积(AUC) ROC评分和支持向量机(svm)算法的监督机器学习，在定性部分，两种方法的排名结果由词典编纂者进行评估。结果有些不一致;虽然定量评估证实，基于机器学习的方法比基于频率的方法产生了更好的搭配排名结果，但词典编纂者在大多数情况下认为这两种方法的搭配列表非常相似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Collocation ranking: frequency vs semantics

Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Slovenščina 2.0: empirical, applied and interdisciplinary research

自引率

0.00%

发文量