Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli
{"title":"PESTS: Persian_English cross lingual corpus for semantic textual similarity","authors":"Mohammad Abdous, Poorya Piroozfar, Behrouz MinaeiBidgoli","doi":"10.1007/s10579-024-09759-3","DOIUrl":null,"url":null,"abstract":"<p>In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (<b>P</b>ersian <b>E</b>nglish <b>S</b>emantic <b>T</b>extual <b>S</b>imilarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09759-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, there has been significant research interest in the subtask of natural language processing called semantic textual similarity. Measuring semantic similarity between words or terms, sentences, paragraphs and documents plays an important role in natural language processing and computational linguistics. It finds applications in question-answering systems, semantic search, fraud detection, machine translation, information retrieval, and more. Semantic similarity entails evaluating the extent of similarity in meaning between two textual documents, paragraphs, or sentences, both in the same language and across different languages. To achieve cross-lingual semantic similarity, it is essential to have corpora that consist of sentence pairs in both the source and target languages. These sentence pairs should demonstrate a certain degree of semantic similarity between them. Due to the lack of available cross-lingual semantic similarity datasets, many current models in this field rely on machine translation. However, this dependence on machine translation can result in reduced model accuracy due to the potential propagation of translation errors. In the case of Persian, which is categorized as a low-resource language, there has been a lack of efforts in developing models that can comprehend the context of two languages. The demand for such a model that can bridge the understanding gap between languages is now more crucial than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time through the collaboration of linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Furthermore, various Transformer-based models have been fine-tuned using this dataset. Based on the results obtained from the PESTS dataset, it is observed that the utilization of the XLM_ROBERTa model leads to an increase in the Pearson correlation from 85.87 to 95.62%.
期刊介绍:
Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.
Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use.
Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.