首页 > 最新文献

Language Resources and Evaluation最新文献

英文 中文
Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning 利用共同关注多任务学习在代码混合文本中进行有毒评论分类和理由提取
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-13 DOI: 10.1007/s10579-023-09708-6
Kiran Babu Nelatoori, Hima Bindu Kommanti

Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. Caution: This paper may contain words disturbing to some readers.

检测有毒评论和社交媒体帖子的攻击性理由可以促进对社交媒体内容的管理。为此,我们通过迁移学习为低资源印地语-英语(俗称兴英语)有毒文本提出了一种协同多任务学习(CA-MTL)模型。合理性/泛读检测和有毒评论分类这两项合作任务共同创造了一个强大的多任务学习目标。我们设计了一个任务协作模块,以利用分类任务和跨度预测任务之间的双向注意力。模型的综合损失函数是利用这两个任务的单独损失函数构建的。虽然存在英语毒性跨度检测数据集,但到目前为止还没有针对兴英语码混合文本的数据集。因此,我们为混合英语代码文本开发了一个带有有毒跨度注释的数据集。我们使用多语种和兴英语 BERT 变体,将所提出的 CA-MTL 模型与缺乏共同关注机制的单任务和多任务学习模型进行了比较。使用 HingRoBERTa 编码器的 CA-MTL 模型在这两项任务中的 F1 分数都明显高于基线模型。注意事项本文可能包含对某些读者造成困扰的词语。
{"title":"Toxic comment classification and rationale extraction in code-mixed text leveraging co-attentive multi-task learning","authors":"Kiran Babu Nelatoori, Hima Bindu Kommanti","doi":"10.1007/s10579-023-09708-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09708-6","url":null,"abstract":"<p>Detecting toxic comments and rationale for the offensiveness of a social media post promotes moderation of social media content. For this purpose, we propose a Co-Attentive Multi-task Learning (CA-MTL) model through transfer learning for low-resource Hindi-English (commonly known as Hinglish) toxic texts. Together, the cooperative tasks of rationale/span detection and toxic comment classification create a strong multi-task learning objective. A task collaboration module is designed to leverage the bi-directional attention between the classification and span prediction tasks. The combined loss function of the model is constructed using the individual loss functions of these two tasks. Although an English toxic span detection dataset exists, one for Hinglish code-mixed text does not exist as of today. Hence, we developed a dataset with toxic span annotations for Hinglish code-mixed text. The proposed CA-MTL model is compared against single-task and multi-task learning models that lack the co-attention mechanism, using multilingual and Hinglish BERT variants. The F1 scores of the proposed CA-MTL model with HingRoBERTa encoder for both tasks are significantly higher than the baseline models. <i>Caution:</i> This paper may contain words disturbing to some readers.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"27 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139460881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus 为研究拉丁语语料库中的模态而进行多层语义标注和标注模式正规化
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-06 DOI: 10.1007/s10579-023-09706-8

Abstract

This paper stems from the project A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.

摘要 本文源自 "一个充满可能性的世界 "项目。该项目采用基于语料库的方法研究拉丁语历史中的模态。语言注释,特别是模态的语义注释是该项目的关键。除了任何处理语义的注释任务都会遇到的固有困难之外,我们的注释方案还涉及相互关联的多层注释,从而增加了任务的复杂性。考虑到细粒度语义标注的复杂性,我们需要开发记录完备的模式,以便控制标注的一致性,同时还能有效地重复使用我们标注的语料库。本文介绍了注释任务中涉及的不同要素,以及如何将模式语言与 XML 文档相结合,对不同语言成分之间的描述和关系进行形式化和文档化。
{"title":"Multi-layered semantic annotation and the formalisation of annotation schemas for the investigation of modality in a Latin corpus","authors":"","doi":"10.1007/s10579-023-09706-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09706-8","url":null,"abstract":"<h3>Abstract</h3> <p>This paper stems from the project <em>A World of Possibilities. Modal pathways over an extra-long period of time: the diachrony of modality in the Latin language</em> (WoPoss) which involves a corpus-based approach to the study of modality in the history of the Latin language. Linguistic annotation and, in particular, the semantic annotation of modality is a keystone of the project. Besides the difficulties intrinsic to any annotation task dealing with semantics, our annotation scheme involves multiple layers of annotation that are interconnected, adding complexity to the task. Considering the intricacies of our fine-grained semantic annotation, we needed to develop well-documented schemas in order to control the consistency of the annotation, but also to enable an efficient reuse of our annotated corpus. This paper presents the different elements involved in the annotation task, and how the description and the relations between the different linguistic components were formalised and documented, combining schema languages with XML documentation.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139375818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata AC-IQuAD:利用维基数据自动构建印尼语问题解答数据集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-03 DOI: 10.1007/s10579-023-09702-y
Kerenza Doxolodeo, Adila Alfa Krisnadhi

Constructing a question-answering dataset can be prohibitively expensive, making it difficult for researchers to make one for an under-resourced language, such as Indonesian. We create a novel Indonesian Question Answering dataset that is produced automatically end-to-end. The process uses Context Free Grammar, the Wikipedia Indonesian Corpus, and the concept of the proxy model. The dataset consists of 134 thousand simple questions and 60 thousand complex questions. It achieved competitive grammatical and model accuracy compared to the translated dataset but suffers from some issues due to resource constraints.

构建一个问题解答数据集的成本过高,这使得研究人员很难为资源不足的语言(如印尼语)创建一个数据集。我们创建了一个端到端自动生成的新型印尼语答题数据集。该过程使用了上下文自由语法、维基百科印尼语语料库和代理模型概念。该数据集包括 13.4 万个简单问题和 6 万个复杂问题。与翻译过的数据集相比,该数据集的语法和模型准确率都很有竞争力,但由于资源限制,也存在一些问题。
{"title":"AC-IQuAD: Automatically Constructed Indonesian Question Answering Dataset by Leveraging Wikidata","authors":"Kerenza Doxolodeo, Adila Alfa Krisnadhi","doi":"10.1007/s10579-023-09702-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09702-y","url":null,"abstract":"<p>Constructing a question-answering dataset can be prohibitively expensive, making it difficult for researchers to make one for an under-resourced language, such as Indonesian. We create a novel Indonesian Question Answering dataset that is produced automatically end-to-end. The process uses Context Free Grammar, the Wikipedia Indonesian Corpus, and the concept of the proxy model. The dataset consists of 134 thousand simple questions and 60 thousand complex questions. It achieved competitive grammatical and model accuracy compared to the translated dataset but suffers from some issues due to resource constraints.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139093771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KurdiSent: a corpus for kurdish sentiment analysis KurdiSent:库尔德人情感分析语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-01-02 DOI: 10.1007/s10579-023-09716-6
Soran Badawi, Arefeh Kazemi, Vali Rezaie

Language is essential for communication and the expression of feelings and sentiments. As technology advances, language has become increasingly ubiquitous in our lives. One of the most critical research areas in natural language processing (NLP) is sentiment analysis, which aims to identify and extract opinions and attitudes from text. Sentiment analysis is particularly useful for understanding public opinion on products, services, and topics of interest. While sentiment analysis systems are well-developed for English, this differs for other languages, such as Kurdish. This is because less-resourced languages have fewer NLP resources, including annotated datasets. To bridge this gap, this paper introduces KurdiSent, the first manually annotated dataset for Kurdish sentiment analysis. KurdiSent consists of over 12,000 instances labeled as positive, negative, or neutral. The corpus covers the Sorani dialect of Kurdish, the most widely spoken dialect. To ensure the quality of KurdiSent, the dataset was trained on machine learning and deep learning classifiers. The experimental results indicated that XLM-R outperformed all machine learning and deep learning classifiers, with an accuracy of 85%, compared to 81% for the best machine learning classifier. KurdiSent is a valuable resource for the NLP community, as it will enable researchers to develop and improve sentiment analysis systems for Kurdish. The corpus will facilitate a better understanding of public opinion in Kurdish-speaking communities.

语言是交流和表达情感与情绪的重要工具。随着技术的进步,语言在我们的生活中越来越无处不在。情感分析是自然语言处理(NLP)中最重要的研究领域之一,其目的是从文本中识别和提取观点和态度。情感分析尤其有助于了解公众对产品、服务和感兴趣的话题的看法。虽然情感分析系统在英语方面发展成熟,但在库尔德语等其他语言方面却有所不同。这是因为资源较少的语言拥有较少的 NLP 资源,包括注释数据集。为了弥补这一差距,本文介绍了库尔德语情感分析的首个人工标注数据集 KurdiSent。KurdiSent 包含 12,000 多个标注为正面、负面或中性的实例。该语料库涵盖库尔德语中使用最广泛的索拉尼方言。为确保 KurdiSent 的质量,我们使用机器学习和深度学习分类器对数据集进行了训练。实验结果表明,XLM-R 的表现优于所有机器学习和深度学习分类器,准确率为 85%,而最佳机器学习分类器的准确率为 81%。KurdiSent 是 NLP 界的宝贵资源,因为它能帮助研究人员开发和改进库尔德语情感分析系统。该语料库将有助于更好地了解库尔德语社区的公众舆论。
{"title":"KurdiSent: a corpus for kurdish sentiment analysis","authors":"Soran Badawi, Arefeh Kazemi, Vali Rezaie","doi":"10.1007/s10579-023-09716-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09716-6","url":null,"abstract":"<p>Language is essential for communication and the expression of feelings and sentiments. As technology advances, language has become increasingly ubiquitous in our lives. One of the most critical research areas in natural language processing (NLP) is sentiment analysis, which aims to identify and extract opinions and attitudes from text. Sentiment analysis is particularly useful for understanding public opinion on products, services, and topics of interest. While sentiment analysis systems are well-developed for English, this differs for other languages, such as Kurdish. This is because less-resourced languages have fewer NLP resources, including annotated datasets. To bridge this gap, this paper introduces KurdiSent, the first manually annotated dataset for Kurdish sentiment analysis. KurdiSent consists of over 12,000 instances labeled as positive, negative, or neutral. The corpus covers the Sorani dialect of Kurdish, the most widely spoken dialect. To ensure the quality of KurdiSent, the dataset was trained on machine learning and deep learning classifiers. The experimental results indicated that XLM-R outperformed all machine learning and deep learning classifiers, with an accuracy of 85%, compared to 81% for the best machine learning classifier. KurdiSent is a valuable resource for the NLP community, as it will enable researchers to develop and improve sentiment analysis systems for Kurdish. The corpus will facilitate a better understanding of public opinion in Kurdish-speaking communities.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"21 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139083544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces 葡萄牙语语法注释:标准、解析器和搜索界面
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-26 DOI: 10.1007/s10579-023-09699-4
Pablo Faria, Charlotte Galves, Catarina Magro

In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the Penn Parsed Historical Corpora (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro & Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.

在过去二十年中,按照最初为宾夕法尼亚大学历史语料库(Penn Parsed Historical Corpora)定义的思路,建立了四个葡萄牙语句法注释语料库(Santorini,2016 年)。它们涵盖了欧洲葡萄牙语的古、中、古典和现代时期,以及十九世纪和二十世纪的巴西葡萄牙语,并包括不同的文本流派和口头话语摘录。它们共同为研究葡萄牙语的变异和变化提供了基本资源。在过去几年中,我们努力最大限度地统一这些语料库的注释方案,以便在一个语料库上进行的检索可以在其他语料库上以完全相同的方式进行。这一努力的成果就是《葡萄牙语句法注释手册》(Magro & Galves, 2019)。在本文中,我们将介绍葡萄牙语语法注释。我们介绍了 ParsPort 的功能,这是一种基于规则的解析器,利用了查询语言语料库搜索(Corpus Search)的修订模式(Randall,2005-2015 年)。我们认为,与 Bikel(2004 年)开发的概率分析器相比,ParsPort 对我们的注释工作更有效率。最后,我们将提到最近在开发更方便用户的句法搜索工具方面取得的进展。
{"title":"Syntactic annotation for Portuguese corpora: standards, parsers, and search interfaces","authors":"Pablo Faria, Charlotte Galves, Catarina Magro","doi":"10.1007/s10579-023-09699-4","DOIUrl":"https://doi.org/10.1007/s10579-023-09699-4","url":null,"abstract":"<p>In the last two decades, four Portuguese syntactically annotated corpora were built along the lines initially defined for the <i>Penn Parsed Historical Corpora</i> (Santorini, 2016). They cover the old, the middle, the classical and the modern periods of European Portuguese, as well as the nineteenth and twentieth century Brazilian Portuguese, and include different textual genres and oral discourse excerpts. Together they provide a fundamental resource for the study of variation and change in Portuguese. In the last years, an effort was made to maximally unify the annotation scheme applied to those corpora, in such a way that the searches done on one corpus could be done in exactly the same manner on the others. This effort resulted in the Portuguese Syntactic Annotation Manual (Magro &amp; Galves, 2019). In this paper, we present the syntactic annotation for the Portuguese Corpora. We describe the functioning of ParsPort, a rule-based parser which makes use of the revision mode of the query language Corpus Search (Randall, 2005–2015). We argue that ParsPort is more efficient to our annotation efforts than the probabilistic parser developed by Bikel (2004), previously used for the syntactic annotation of the Portuguese Corpora. Finally we mention recent advances towards more user-friendly tools for syntactic searches.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"5 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139056467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistic annotation of Byzantine book epigrams 拜占庭书信体的语言注释
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-13 DOI: 10.1007/s10579-023-09703-x
Colin Swaelens, Ilse De Vos, Els Lefever

In this paper, we explore the feasibility of developing a part-of-speech tagger for not-normalised, Byzantine Greek epigrams. Hence, we compared three different transformer-based models with embedding representations, which are then fine-tuned on a fine-grained part-of-speech tagging task. To train the language models, we compiled two data sets: the first consisting of Ancient and Byzantine Greek texts, the second of Ancient, Byzantine and Modern Greek. This allowed us to ascertain whether Modern Greek contributes to the modelling of Byzantine Greek. For the supervised task of part-of-speech tagging, we collected a training set of existing, annotated (Ancient) Greek texts. For evaluation, a gold standard containing 10,000 tokens of unedited Byzantine Greek poems was manually annotated and validated through an inter-annotator agreement study. The experimental results look very promising, with the BERT model trained on all Greek data achieving the best performance for fine-grained part-of-speech tagging.

在本文中,我们探讨了为非规范化的拜占庭希腊警句开发词性标注器的可行性。因此,我们将三种不同的基于转换器的模型与嵌入表示进行了比较,然后对细粒度词性标记任务进行了微调。为了训练语言模型,我们编译了两个数据集:第一个由古代和拜占庭希腊文本组成,第二个由古代、拜占庭和现代希腊文本组成。这使我们能够确定现代希腊语是否对拜占庭希腊语的模型有所贡献。对于词性标注的监督任务,我们收集了一个现有的、注释的(古)希腊语文本的训练集。为了进行评估,一个包含10,000个未经编辑的拜占庭希腊诗歌标记的金标准被手工注释并通过注释者之间的协议研究进行验证。实验结果看起来非常有希望,在所有希腊语数据上训练的BERT模型在细粒度词性标记方面取得了最佳性能。
{"title":"Linguistic annotation of Byzantine book epigrams","authors":"Colin Swaelens, Ilse De Vos, Els Lefever","doi":"10.1007/s10579-023-09703-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09703-x","url":null,"abstract":"<p>In this paper, we explore the feasibility of developing a part-of-speech tagger for not-normalised, Byzantine Greek epigrams. Hence, we compared three different transformer-based models with embedding representations, which are then fine-tuned on a fine-grained part-of-speech tagging task. To train the language models, we compiled two data sets: the first consisting of Ancient and Byzantine Greek texts, the second of Ancient, Byzantine and Modern Greek. This allowed us to ascertain whether Modern Greek contributes to the modelling of Byzantine Greek. For the supervised task of part-of-speech tagging, we collected a training set of existing, annotated (Ancient) Greek texts. For evaluation, a gold standard containing 10,000 tokens of unedited Byzantine Greek poems was manually annotated and validated through an inter-annotator agreement study. The experimental results look very promising, with the BERT model trained on all Greek data achieving the best performance for fine-grained part-of-speech tagging.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"15 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138632327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Democratizing neural machine translation with OPUS-MT 利用 OPUS-MT 实现神经机器翻译的民主化
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-13 DOI: 10.1007/s10579-023-09704-w
Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raúl Vázquez, Sami Virpioja

This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our ongoing mission of increasing language coverage and translation quality, and also describe work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.

本文介绍了OPUS生态系统,重点关注开放式机器翻译模型和工具的开发,以及它们与最终用户应用程序、开发平台和专业工作流程的集成。我们讨论了增加语言覆盖率和翻译质量的持续任务,并描述了在开发模块化翻译模型和速度优化的紧凑型解决方案方面的工作,用于在常规桌面和小型设备上进行实时翻译。
{"title":"Democratizing neural machine translation with OPUS-MT","authors":"Jörg Tiedemann, Mikko Aulamo, Daria Bakshandaeva, Michele Boggia, Stig-Arne Grönroos, Tommi Nieminen, Alessandro Raganato, Yves Scherrer, Raúl Vázquez, Sami Virpioja","doi":"10.1007/s10579-023-09704-w","DOIUrl":"https://doi.org/10.1007/s10579-023-09704-w","url":null,"abstract":"<p>This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our ongoing mission of increasing language coverage and translation quality, and also describe work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"32 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138631759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
When MIPVU goes to no man’s land: a new language resource for hybrid, morpheme-based metaphor identification in Hungarian 当 MIPVU 进入无人区:匈牙利语中基于语素的混合隐喻识别的新语言资源
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-09 DOI: 10.1007/s10579-023-09705-9
Gábor Simon, Tímea Bajzát, Júlia Ballagó, Zsuzsanna Havasi, Emese K. Molnár, Eszter Szlávich

The aim of the article is to present a new language resource for metaphor analysis in corpora that is (i) a MIPVU-inspired, morpheme-based process for identifying metaphor in Hungarian and (ii) the refinement and innovative version of metaphor identification extending the scope of the process to multi-word expressions. The elaboration of language-specific protocols in metaphor identification has become one of the central endeavors in contemporary cross-linguistic research on metaphor, but there is a gap in the field regarding languages with rich morphology, especially in the case of Hungarian. To fill this gap, we developed a hybrid, morpheme-based version of the original method, which can handle morphologically complex metaphorical expressions. Additional innovations of our protocol are the measurement and tagging of idiomaticity in metaphors based on collocation analysis and the identification of semantic relationships between the components of metaphorical expressions. The present paper discusses both the theoretical motivation and the practical details of the adapted method for metaphor identification. As a conclusion, the presented protocol can provide new answers to the questions of metaphor identification in languages with rich morphology and shed new light on the internal semantic organization of linguistic metaphors.

本文旨在介绍一种新的语言资源,用于对语料库中的隐喻进行分析,该资源包括:(i) 受 MIPVU 启发的、基于语素的匈牙利语隐喻识别过程;(ii) 隐喻识别的改进和创新版本,将隐喻识别过程的范围扩展到多词表达。隐喻识别语言特定协议的制定已成为当代跨语言隐喻研究的核心工作之一,但该领域在具有丰富语素的语言方面存在空白,尤其是匈牙利语。为了填补这一空白,我们开发了一种基于语素的混合版本的原始方法,它可以处理语素复杂的隐喻表达。我们方案的其他创新之处还包括基于搭配分析的隐喻惯用性测量和标记,以及隐喻表达成分之间语义关系的识别。本文讨论了隐喻识别方法的理论动机和实践细节。最后,本文提出的方法可以为具有丰富形态的语言中的隐喻识别问题提供新的答案,并为语言隐喻的内部语义组织提供新的启示。
{"title":"When MIPVU goes to no man’s land: a new language resource for hybrid, morpheme-based metaphor identification in Hungarian","authors":"Gábor Simon, Tímea Bajzát, Júlia Ballagó, Zsuzsanna Havasi, Emese K. Molnár, Eszter Szlávich","doi":"10.1007/s10579-023-09705-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09705-9","url":null,"abstract":"<p>The aim of the article is to present a new language resource for metaphor analysis in corpora that is (i) a MIPVU-inspired, morpheme-based process for identifying metaphor in Hungarian and (ii) the refinement and innovative version of metaphor identification extending the scope of the process to multi-word expressions. The elaboration of language-specific protocols in metaphor identification has become one of the central endeavors in contemporary cross-linguistic research on metaphor, but there is a gap in the field regarding languages with rich morphology, especially in the case of Hungarian. To fill this gap, we developed a hybrid, morpheme-based version of the original method, which can handle morphologically complex metaphorical expressions. Additional innovations of our protocol are the measurement and tagging of idiomaticity in metaphors based on collocation analysis and the identification of semantic relationships between the components of metaphorical expressions. The present paper discusses both the theoretical motivation and the practical details of the adapted method for metaphor identification. As a conclusion, the presented protocol can provide new answers to the questions of metaphor identification in languages with rich morphology and shed new light on the internal semantic organization of linguistic metaphors.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"9 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EmoTwiCS: a corpus for modelling emotion trajectories in Dutch customer service dialogues on Twitter EmoTwiCS:用于模拟 Twitter 上荷兰客户服务对话中情绪轨迹的语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-08 DOI: 10.1007/s10579-023-09700-0
Sofie Labat, Thomas Demeester, Véronique Hoste

Due to the rise of user-generated content, social media is increasingly adopted as a channel to deliver customer service. Given the public character of online platforms, the automatic detection of emotions forms an important application in monitoring customer satisfaction and preventing negative word-of-mouth. This paper introduces EmoTwiCS, a corpus of 9489 Dutch customer service dialogues on Twitter that are annotated for emotion trajectories. In our business-oriented corpus, we view emotions as dynamic attributes of the customer that can change at each utterance of the conversation. The term ‘emotion trajectory’ refers therefore not only to the fine-grained emotions experienced by customers (annotated with 28 labels and valence-arousal-dominance scores), but also to the event happening prior to the conversation and the responses made by the human operator (both annotated with 8 categories). Inter-annotator agreement (IAA) scores on the resulting dataset are substantial and comparable with related research, underscoring its high quality. Given the interplay between the different layers of annotated information, we perform several in-depth analyses to investigate (i) static emotions in isolated tweets, (ii) dynamic emotions and their shifts in trajectory, and (iii) the role of causes and response strategies in emotion trajectories. We conclude by listing the advantages and limitations of our dataset, after which we give some suggestions on the different types of predictive modelling tasks and open research questions to which EmoTwiCS can be applied. The dataset is made publicly available at https://lt3.ugent.be/resources/emotwics.

由于用户生成内容的兴起,社交媒体越来越多地被用作提供客户服务的渠道。鉴于网络平台的公共性,情绪的自动检测在监控客户满意度和防止负面口碑方面有着重要的应用。本文介绍了 EmoTwiCS,这是一个由 Twitter 上 9489 条荷兰客户服务对话组成的语料库,其中标注了情感轨迹。在我们这个以商业为导向的语料库中,我们将情绪视为客户的动态属性,在对话的每一句话中都会发生变化。因此,"情绪轨迹 "一词不仅指客户体验到的细粒度情绪(标注有 28 个标签和情绪-唤醒-主导得分),还指对话之前发生的事件和人工操作员做出的回应(均标注有 8 个类别)。由此产生的数据集的注释者之间的一致性(IAA)得分很高,可与相关研究相媲美,这说明数据集的质量很高。鉴于各层注释信息之间的相互作用,我们进行了多项深入分析,以研究(i)孤立推文中的静态情绪,(ii)动态情绪及其轨迹变化,以及(iii)情绪轨迹中原因和应对策略的作用。最后,我们列举了数据集的优势和局限性,并就不同类型的预测建模任务和 EmoTwiCS 可应用的开放式研究问题提出了一些建议。该数据集可通过 https://lt3.ugent.be/resources/emotwics 公开获取。
{"title":"EmoTwiCS: a corpus for modelling emotion trajectories in Dutch customer service dialogues on Twitter","authors":"Sofie Labat, Thomas Demeester, Véronique Hoste","doi":"10.1007/s10579-023-09700-0","DOIUrl":"https://doi.org/10.1007/s10579-023-09700-0","url":null,"abstract":"<p>Due to the rise of user-generated content, social media is increasingly adopted as a channel to deliver customer service. Given the public character of online platforms, the automatic detection of emotions forms an important application in monitoring customer satisfaction and preventing negative word-of-mouth. This paper introduces EmoTwiCS, a corpus of 9489 Dutch customer service dialogues on Twitter that are annotated for emotion trajectories. In our business-oriented corpus, we view emotions as dynamic attributes of the customer that can change at each utterance of the conversation. The term ‘emotion trajectory’ refers therefore not only to the fine-grained emotions experienced by customers (annotated with 28 labels and valence-arousal-dominance scores), but also to the event happening prior to the conversation and the responses made by the human operator (both annotated with 8 categories). Inter-annotator agreement (IAA) scores on the resulting dataset are substantial and comparable with related research, underscoring its high quality. Given the interplay between the different layers of annotated information, we perform several in-depth analyses to investigate (i) static emotions in isolated tweets, (ii) dynamic emotions and their shifts in trajectory, and (iii) the role of causes and response strategies in emotion trajectories. We conclude by listing the advantages and limitations of our dataset, after which we give some suggestions on the different types of predictive modelling tasks and open research questions to which EmoTwiCS can be applied. The dataset is made publicly available at https://lt3.ugent.be/resources/emotwics.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"10 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138561256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Resources building for sentiment analysis of content disseminated by Tunisian medias in social networks 对突尼斯媒体在社交网络中传播的内容进行情感分析的资源建设
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-12-02 DOI: 10.1007/s10579-023-09697-6
Emna Fsih, Rahma Boujelbane, Lamia Hadrich Belguith

Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas and give their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts published in social media. These comments are written in a particular way with different spellings due to the fact that the Tunisian Dialect (TD) does not have an orthographic standard. For this we design crucial resources, namely sentiment lexicon and annotated corpus that we have used to investigate machine-learning and deep-learning models in order to identify the best sentiment analysis model for Tunisian Dialect.

如今,社交网络在向不同类别的受众推广和传播电视和广播节目方面发挥着重要作用。因此,政党、有影响力的团体和政治活动家迅速抓住这些新的传播媒体来传播他们的想法,并就关键问题发表他们的观点。在这种背景下,Twitter、Facebook和YouTube已经成为非常流行的分享视频和与用户交流的工具,用户之间相互交流,讨论一些问题,提出解决方案,给出观点。社交媒体网站上的这种互动产生了大量无结构和嘈杂的文本;因此,需要自动分析技术来对用户评论中传达的情感进行分类。在这项工作中,我们关注的是用一种资源较少的阿拉伯语:突尼斯方言(TD)撰写的意见。在这项工作中,我们提出了一个为突尼斯电视广播在社交媒体上发表的评论建立情感分析模型的过程。由于突尼斯方言(TD)没有正字法标准,这些评论以一种特殊的方式写成,拼写不同。为此,我们设计了关键资源,即情感词典和注释语料库,我们已经使用它们来研究机器学习和深度学习模型,以确定突尼斯方言的最佳情感分析模型。
{"title":"Resources building for sentiment analysis of content disseminated by Tunisian medias in social networks","authors":"Emna Fsih, Rahma Boujelbane, Lamia Hadrich Belguith","doi":"10.1007/s10579-023-09697-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09697-6","url":null,"abstract":"<p>Nowadays, social networks play a fundamental role in promoting and diffusing television and radio programs to different categories of audiences. So, political parties, influential groups and political activists have rapidly seized these new communication media to spread their ideas and give their sentiments concerning critical issues. In this context, Twitter, Facebook and YouTube have become very popular tools for sharing videos and communicating with users who interact with each other to discuss some problems, propose solutions and give viewpoints. This interaction on the social media sites yields to a huge amount of unstructured and noisy texts; hence the need for automated analysis techniques to classify sentiments conveyed in the users’ comments. In this work, we focus on opinions written in a less resourced Arabic language: Tunisian dialect (TD). In this work, we present a process for building a sentiment analyses model for comments written on Tunisian television broadcasts published in social media. These comments are written in a particular way with different spellings due to the fact that the Tunisian Dialect (TD) does not have an orthographic standard. For this we design crucial resources, namely sentiment lexicon and annotated corpus that we have used to investigate machine-learning and deep-learning models in order to identify the best sentiment analysis model for Tunisian Dialect.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"563 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Language Resources and Evaluation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1