用分布式系统Apache Spark定义哈萨克语语义封闭词

IF 5.3 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Big Data and Cognitive Computing Pub Date : 2023-09-27 DOI:10.3390/bdcc7040160

Dauren Ayazbayev, Andrey Bogdanchikov, Kamila Orynbekova, Iraklis Varlamis

{"title":"用分布式系统Apache Spark定义哈萨克语语义封闭词","authors":"Dauren Ayazbayev, Andrey Bogdanchikov, Kamila Orynbekova, Iraklis Varlamis","doi":"10.3390/bdcc7040160","DOIUrl":null,"url":null,"abstract":"This work focuses on determining semantically close words and using semantic similarity in general in order to improve performance in information retrieval tasks. The semantic similarity of words is an important task with many applications from information retrieval to spell checking or even document clustering and classification. Although, in languages with rich linguistic resources, the methods and tools for this task are well established, some languages do not have such tools. The first step in our experiment is to represent the words in a collection in a vector form and then define the semantic similarity of the terms using a vector similarity method. In order to tame the complexity of the task, which relies on the number of word (and, consequently, of the vector) pairs that have to be combined in order to define the semantically closest word pairs, A distributed method that runs on Apache Spark is designed to reduce the calculation time by running comparison tasks in parallel. Three alternative implementations are proposed and tested using a list of target words and seeking the most semantically similar words from a lexicon for each one of them. In a second step, we employ pre-trained multilingual sentence transformers to capture the content semantics at a sentence level and a vector-based semantic index to accelerate the searches. The code is written in MapReduce, and the experiments and results show that the proposed methods can provide an interesting solution for finding similar words or texts in the Kazakh language.","PeriodicalId":36397,"journal":{"name":"Big Data and Cognitive Computing","volume":"81 1","pages":"0"},"PeriodicalIF":5.3000,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Defining Semantically Close Words of Kazakh Language with Distributed System Apache Spark\",\"authors\":\"Dauren Ayazbayev, Andrey Bogdanchikov, Kamila Orynbekova, Iraklis Varlamis\",\"doi\":\"10.3390/bdcc7040160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work focuses on determining semantically close words and using semantic similarity in general in order to improve performance in information retrieval tasks. The semantic similarity of words is an important task with many applications from information retrieval to spell checking or even document clustering and classification. Although, in languages with rich linguistic resources, the methods and tools for this task are well established, some languages do not have such tools. The first step in our experiment is to represent the words in a collection in a vector form and then define the semantic similarity of the terms using a vector similarity method. In order to tame the complexity of the task, which relies on the number of word (and, consequently, of the vector) pairs that have to be combined in order to define the semantically closest word pairs, A distributed method that runs on Apache Spark is designed to reduce the calculation time by running comparison tasks in parallel. Three alternative implementations are proposed and tested using a list of target words and seeking the most semantically similar words from a lexicon for each one of them. In a second step, we employ pre-trained multilingual sentence transformers to capture the content semantics at a sentence level and a vector-based semantic index to accelerate the searches. The code is written in MapReduce, and the experiments and results show that the proposed methods can provide an interesting solution for finding similar words or texts in the Kazakh language.\",\"PeriodicalId\":36397,\"journal\":{\"name\":\"Big Data and Cognitive Computing\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2023-09-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data and Cognitive Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/bdcc7040160\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc7040160","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

本研究的重点是确定语义相近的词，并利用语义相似度来提高信息检索任务的性能。从信息检索到拼写检查，甚至是文档聚类和分类，单词的语义相似度是一项重要的任务。虽然在语言资源丰富的语言中，完成这项任务的方法和工具已经很好地建立起来，但有些语言没有这样的工具。我们实验的第一步是以向量形式表示集合中的单词，然后使用向量相似度方法定义术语的语义相似度。为了驯服任务的复杂性，它依赖于为了定义语义上最接近的词对而必须组合的词对的数量(以及向量的数量)，在Apache Spark上运行的分布式方法被设计为通过并行运行比较任务来减少计算时间。提出并测试了三种备选实现，使用目标单词列表，并从词典中为每个单词寻找语义上最相似的单词。在第二步中，我们使用预先训练的多语言句子转换器在句子级别捕获内容语义，并使用基于向量的语义索引来加速搜索。代码是用MapReduce编写的，实验和结果表明，所提出的方法可以为寻找哈萨克语中相似的单词或文本提供一个有趣的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Defining Semantically Close Words of Kazakh Language with Distributed System Apache Spark

This work focuses on determining semantically close words and using semantic similarity in general in order to improve performance in information retrieval tasks. The semantic similarity of words is an important task with many applications from information retrieval to spell checking or even document clustering and classification. Although, in languages with rich linguistic resources, the methods and tools for this task are well established, some languages do not have such tools. The first step in our experiment is to represent the words in a collection in a vector form and then define the semantic similarity of the terms using a vector similarity method. In order to tame the complexity of the task, which relies on the number of word (and, consequently, of the vector) pairs that have to be combined in order to define the semantically closest word pairs, A distributed method that runs on Apache Spark is designed to reduce the calculation time by running comparison tasks in parallel. Three alternative implementations are proposed and tested using a list of target words and seeking the most semantically similar words from a lexicon for each one of them. In a second step, we employ pre-trained multilingual sentence transformers to capture the content semantics at a sentence level and a vector-based semantic index to accelerate the searches. The code is written in MapReduce, and the experiments and results show that the proposed methods can provide an interesting solution for finding similar words or texts in the Kazakh language.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊