首页 > 最新文献

Language Resources and Evaluation最新文献

英文 中文
A corpus of Persian literary text 波斯文学文本的文集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-23 DOI: 10.1007/s10579-023-09689-6
Shahab Raji, Malihe Alikhani, Gerard de Melo, Matthew Stone

Persian poetry has profoundly affected all periods of Persian literature and the literature of other countries as well. It is a fundamental vehicle for expressing Persian culture and political opinion. This paper presents a corpus of Persian literary text mainly focusing on poetry, covering the ninth to twenty-first century annotated for century and style, with additional partial annotation of rhetorical figures. Our resource is the largest and the most diverse corpus available in Persian literary text, with a particularly broad temporal scope. This allows us to conduct several computational experiments to analyze poetic styles, authors and time periods, as well as context shifts over time, for which we rely both on supervised models and on Persian poetry-specific heuristics. The corpus, the tools, and experiments described in this paper can be used not only for digital humanities studies of Persian literature but also for processing Persian texts in general, as well as in other broader cross-linguistic applications.

波斯诗歌深刻地影响了波斯文学的各个时期以及其他国家的文学。它是表达波斯文化和政治观点的基本工具。本文介绍了一个主要以诗歌为主题的波斯文学文本语料库,涵盖了9世纪至21世纪,对世纪和风格进行了注释,并对修辞手法进行了部分注释。我们的资源是波斯语文学文本中最大和最多样化的语料库,具有特别广泛的时间范围。这使我们能够进行几个计算实验来分析诗歌的风格、作者和时期,以及随着时间的推移上下文的变化,为此我们既依赖于监督模型,也依赖于波斯诗歌特定的启发式。本文中描述的语料库、工具和实验不仅可以用于波斯文学的数字人文研究,还可以用于处理一般的波斯文本,以及其他更广泛的跨语言应用。
{"title":"A corpus of Persian literary text","authors":"Shahab Raji, Malihe Alikhani, Gerard de Melo, Matthew Stone","doi":"10.1007/s10579-023-09689-6","DOIUrl":"https://doi.org/10.1007/s10579-023-09689-6","url":null,"abstract":"<p>Persian poetry has profoundly affected all periods of Persian literature and the literature of other countries as well. It is a fundamental vehicle for expressing Persian culture and political opinion. This paper presents a corpus of Persian literary text mainly focusing on poetry, covering the ninth to twenty-first century annotated for century and style, with additional partial annotation of rhetorical figures. Our resource is the largest and the most diverse corpus available in Persian literary text, with a particularly broad temporal scope. This allows us to conduct several computational experiments to analyze poetic styles, authors and time periods, as well as context shifts over time, for which we rely both on supervised models and on Persian poetry-specific heuristics. The corpus, the tools, and experiments described in this paper can be used not only for digital humanities studies of Persian literature but also for processing Persian texts in general, as well as in other broader cross-linguistic applications.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A corpus of English learners with Arabic and Hebrew backgrounds 具有阿拉伯语和希伯来语背景的英语学习者的语料库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-20 DOI: 10.1007/s10579-023-09692-x
Omaima Abboud, Batia Laufer, Noam Ordan, Uliana Sentsova, Shuly Wintner

Learner corpora—datasets that reflect the language of non-native speakers—are instrumental for research of language learning and development, as well as for practical applications, mainly for teaching and education. Such corpora now exist for a plethora of native–foreign language pairs; but until recently, none of them reflected native Hebrew speakers, and very few reflected native Arabic speakers. We introduce a recently-released corpus of English essays authored by learners in Israel. The corpus consists of two sub-corpora, one of them of Arabic native speakers and the other consisting mainly of Hebrew native speakers. We report on the composition and curation of the datasets; specifically, we processed the data so that both sub-corpora are now uniformly represented, facilitating seamless research and computational processing of the data. We provide statistical information on the corpora and outline a few research projects that had already used them. This is the first and only learner corpus in Israel including two major native languages of people in the same educational system regarding the English syllabus. All the resources related to the corpus are freely available.

学习者语料库——反映非母语者语言的数据集——对语言学习和发展的研究以及主要用于教学和教育的实际应用都有帮助。这样的语料库现在存在于大量的母语和外语对中;但直到最近,它们都没有反映以希伯来语为母语的人,也很少反映以阿拉伯语为母语的人。我们介绍了最近发布的由以色列学习者撰写的英语论文语料库。语料库由两个子语料库组成,其中一个是阿拉伯语母语者,另一个主要由希伯来语母语者组成。我们报告了数据集的组成和管理;具体来说,我们对数据进行了处理,使两个子语料库现在都统一表示,从而促进了数据的无缝研究和计算处理。我们提供了语料库的统计信息,并概述了一些已经使用它们的研究项目。这是以色列第一个也是唯一一个学习者语料库,包括同一教育系统中关于英语教学大纲的两种主要母语。所有与语料库相关的资源都是免费的。
{"title":"A corpus of English learners with Arabic and Hebrew backgrounds","authors":"Omaima Abboud, Batia Laufer, Noam Ordan, Uliana Sentsova, Shuly Wintner","doi":"10.1007/s10579-023-09692-x","DOIUrl":"https://doi.org/10.1007/s10579-023-09692-x","url":null,"abstract":"<p>Learner corpora—datasets that reflect the language of non-native speakers—are instrumental for research of language learning and development, as well as for practical applications, mainly for teaching and education. Such corpora now exist for a plethora of native–foreign language pairs; but until recently, none of them reflected native Hebrew speakers, and very few reflected native Arabic speakers. We introduce a recently-released corpus of English essays authored by learners in Israel. The corpus consists of two sub-corpora, one of them of Arabic native speakers and the other consisting mainly of Hebrew native speakers. We report on the composition and curation of the datasets; specifically, we processed the data so that both sub-corpora are now uniformly represented, facilitating seamless research and computational processing of the data. We provide statistical information on the corpora and outline a few research projects that had already used them. This is the first and only learner corpus in Israel including two major native languages of people in the same educational system regarding the English syllabus. All the resources related to the corpus are freely available.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"57 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Reading Everyday Emotion Database (REED): a set of audio-visual recordings of emotions in music and language 阅读日常情绪数据库(REED):一套以音乐和语言表达情绪的视听记录
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-20 DOI: 10.1007/s10579-023-09698-5
Jia Hoong Ong, Florence Yik Nam Leung, Fang Liu

Most audio-visual (AV) emotion databases consist of clips that do not reflect real-life emotion processing (e.g., professional actors in bright studio-like environment), contain only spoken clips, and none have sung clips that express complex emotions. Here, we introduce a new AV database, the Reading Everyday Emotion Database (REED), which directly addresses those gaps. We recorded the faces of everyday adults with a diverse range of acting experience expressing 13 emotions—neutral, the six basic emotions (angry, disgusted, fearful, happy, sad, surprised), and six complex emotions (embarrassed, hopeful, jealous, proud, sarcastic, stressed)—in two auditory domains (spoken and sung) using everyday recording devices (e.g., laptops, mobile phones, etc.). The recordings were validated by an independent group of raters. We found that: intensity ratings of the recordings were positively associated with recognition accuracy; and the basic emotions, as well as the Neutral and Sarcastic emotions, were recognised more accurately than the other complex emotions. Emotion recognition accuracy also differed by utterance. Exploratory analysis revealed that recordings of those with drama experience were better recognised than those without. Overall, this database will benefit those who need AV clips with natural variations in both emotion expressions and recording environment.

大多数视听(AV)情感数据库包含的片段并不反映现实生活中的情感处理(例如,专业演员在明亮的工作室环境中),只包含口头片段,没有一个包含表达复杂情感的歌曲片段。在这里,我们介绍了一个新的AV数据库,阅读日常情绪数据库(REED),它直接解决了这些空白。我们使用日常录音设备(如笔记本电脑,手机等)记录了具有各种表演经验的日常成年人的面部表情,表达了13种情绪-中性情绪,六种基本情绪(愤怒,厌恶,恐惧,快乐,悲伤,惊讶)和六种复杂情绪(尴尬,希望,嫉妒,骄傲,讽刺,强调)-在两个听觉域(口语和歌唱)。录音由一组独立的评分者进行验证。我们发现:录音的强度等级与识别准确率呈正相关;基本情绪,以及中性和讽刺情绪,比其他复杂情绪更准确地被识别出来。情绪识别的准确性也因话语的不同而不同。探索性分析显示,有戏剧经历的人的录音比没有的人更容易被识别。总的来说,这个数据库将有利于那些需要在情感表达和录制环境中自然变化的AV剪辑的人。
{"title":"The Reading Everyday Emotion Database (REED): a set of audio-visual recordings of emotions in music and language","authors":"Jia Hoong Ong, Florence Yik Nam Leung, Fang Liu","doi":"10.1007/s10579-023-09698-5","DOIUrl":"https://doi.org/10.1007/s10579-023-09698-5","url":null,"abstract":"<p>Most audio-visual (AV) emotion databases consist of clips that do not reflect real-life emotion processing (e.g., professional actors in bright studio-like environment), contain only spoken clips, and none have sung clips that express complex emotions. Here, we introduce a new AV database, the Reading Everyday Emotion Database (REED), which directly addresses those gaps. We recorded the faces of everyday adults with a diverse range of acting experience expressing 13 emotions—neutral, the six basic emotions (angry, disgusted, fearful, happy, sad, surprised), and six complex emotions (embarrassed, hopeful, jealous, proud, sarcastic, stressed)—in two auditory domains (spoken and sung) using everyday recording devices (e.g., laptops, mobile phones, etc.). The recordings were validated by an independent group of raters. We found that: intensity ratings of the recordings were positively associated with recognition accuracy; and the basic emotions, as well as the Neutral and Sarcastic emotions, were recognised more accurately than the other complex emotions. Emotion recognition accuracy also differed by utterance. Exploratory analysis revealed that recordings of those with drama experience were better recognised than those without. Overall, this database will benefit those who need AV clips with natural variations in both emotion expressions and recording environment.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 6","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multilingual, multimodal dataset of aggression and bias: the ComMA dataset 侵略和偏见的多语言、多模态数据集:逗号数据集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-16 DOI: 10.1007/s10579-023-09696-7
Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.

在本文中,我们讨论了一个多语言数据集的开发,该数据集用一个分层的、细粒度的标记集来标记不同类型的攻击及其发生的“上下文”。在这里,上下文是由特定评论发生的会话线程以及评论相对于前一个评论所扮演的话语角色的“类型”来定义的。该数据集是作为逗号项目的一部分开发的,由总共57,363条带注释的评论、1142条带注释的模因和大约70小时的带注释的音频(从视频中提取)组成,包括四种语言——美泰语、孟加拉语、印地语和印度英语。这些数据是从YouTube、Facebook、Twitter和Telegram等各种社交媒体平台收集的。与社交媒体网站上的常见情况一样,这些评论中有大量是多语言的,其中许多是英语代码混合的。本文详细描述了在这个项目过程中开发的标签集,并详细说明了开发和使用一个多标签、细粒度的标签集来标记带有各种侵略和偏见的评论的过程,这些偏见包括性别偏见、宗教不宽容(在标签集中称为社区偏见)、阶级/种姓偏见和民族/种族偏见。我们定义并讨论了用于标记通过注释执行的不同话语角色的标记,例如攻击、防御等等。我们还提出了数据集的统计分析,以及我们使用开发的数据集开发自动攻击识别系统的基线实验结果。基于基线实验的结果,我们还认为我们的数据集提供了多样化和“硬”的实例集,这使得它成为训练和测试攻击性和滥用性语言分类新技术的良好数据集。
{"title":"A multilingual, multimodal dataset of aggression and bias: the ComMA dataset","authors":"Ritesh Kumar, Shyam Ratan, Siddharth Singh, Enakshi Nandi, Laishram Niranjana Devi, Akash Bhagat, Yogesh Dawer, Bornini Lahiri, Akanksha Bansal","doi":"10.1007/s10579-023-09696-7","DOIUrl":"https://doi.org/10.1007/s10579-023-09696-7","url":null,"abstract":"<p>In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context\" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"77 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic genre identification: a survey 自动体裁识别:一项调查
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-16 DOI: 10.1007/s10579-023-09695-8
Taja Kuzman, Nikola Ljubešić

Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.

自动体裁识别(AGI)是一项关注体裁的文本分类任务,即由作者的目的、文本的共同功能和文本的常规形式定义的文本类别。获得体裁信息已被证明对广泛的学科有益,包括语言学、语料库语言学、计算语言学、自然语言处理、信息检索和信息安全。因此,在过去的20年里,许多研究人员收集了类型数据集,目的是开发一个有效的类型分类器。然而,他们在类型图式的定义、数据收集和手工注释方面的方法差异很大,导致数据集的差异很大。由于大多数AGI实验都依赖于数据集,因此充分了解可用类型数据集之间的差异对于冒险进入该领域的研究人员来说非常重要。在本文中,我们详细概述了AGI任务的每个步骤的不同方法,从类型概念和类型模式的定义,到数据集收集和注释方法,最后到机器学习策略。特别关注最相关的体裁图式和数据集的描述,并提供了所有数据集的可用性的详细信息。此外,本文还介绍了机器学习方法在自动类型识别方面的最新进展,并提出了开发稳定的多语言类型分类器的方向。
{"title":"Automatic genre identification: a survey","authors":"Taja Kuzman, Nikola Ljubešić","doi":"10.1007/s10579-023-09695-8","DOIUrl":"https://doi.org/10.1007/s10579-023-09695-8","url":null,"abstract":"<p>Automatic genre identification (AGI) is a text classification task focused on genres, i.e., text categories defined by the author’s purpose, common function of the text, and the text’s conventional form. Obtaining genre information has been shown to be beneficial for a wide range of disciplines, including linguistics, corpus linguistics, computational linguistics, natural language processing, information retrieval and information security. Consequently, in the past 20 years, numerous researchers have collected genre datasets with the aim to develop an efficient genre classifier. However, their approaches to the definition of genre schemata, data collection and manual annotation vary substantially, resulting in significantly different datasets. As most AGI experiments are dataset-dependent, a sufficient understanding of the differences between the available genre datasets is of great importance for the researchers venturing into this area. In this paper, we present a detailed overview of different approaches to each of the steps of the AGI task, from the definition of the genre concept and the genre schema, to the dataset collection and annotation methods, and, finally, to machine learning strategies. Special focus is dedicated to the description of the most relevant genre schemata and datasets, and details on the availability of all of the datasets are provided. In addition, the paper presents the recent advances in machine learning approaches to automatic genre identification, and concludes with proposing the directions towards developing a stable multilingual genre classifier.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"22 3","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Brazilian Portuguese corpora for teaching and translation: the CoMET project 用于教学和翻译的巴西葡萄牙语料库:CoMET项目
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-16 DOI: 10.1007/s10579-023-09690-z
Stella E. O. Tagnin

This paper starts with an overview of corpora available for Brazilian Portuguese to subsequently focus mainly on the CoMET Project developed at the University of São Paulo. CoMET consists of three corpora: a comparable Portuguese-English technical corpus (CorTec), a Portuguese-English parallel (translation) corpus (CorTrad) and a multilingual learner corpus, (CoMAprend), all available for online queries with specific tools. CorTec offers over fifty corpora in a variety of domains, from Health Sciences to Olympic Games. CorTrad is divided into three parts: Popular Science, Technical-Scientific and Literary. Each one of CoMET’s corpora is presented in detail. Examples are also provided.

本文首先概述了巴西葡萄牙语可用的语料库,随后主要关注圣保罗大学开发的CoMET项目。CoMET由三个语料库组成:一个类似的葡萄牙语-英语技术语料库(CorTec),一个葡萄牙语-英语平行(翻译)语料库(CorTrad)和一个多语言学习者语料库(CoMAprend),所有这些都可以通过特定的工具在线查询。CorTec提供从健康科学到奥运会等多个领域的50多个语料库。科普特分为三个部分:科普、科技和文学。详细介绍了CoMET的每个语料库。还提供了示例。
{"title":"Brazilian Portuguese corpora for teaching and translation: the CoMET project","authors":"Stella E. O. Tagnin","doi":"10.1007/s10579-023-09690-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09690-z","url":null,"abstract":"<p>This paper starts with an overview of corpora available for Brazilian Portuguese to subsequently focus mainly on the CoMET Project developed at the University of São Paulo. CoMET consists of three corpora: a comparable Portuguese-English technical corpus (CorTec), a Portuguese-English parallel (translation) corpus (CorTrad) and a multilingual learner corpus, (CoMAprend), all available for online queries with specific tools. CorTec offers over fifty corpora in a variety of domains, from Health Sciences to Olympic Games. CorTrad is divided into three parts: Popular Science, Technical-Scientific and Literary. Each one of CoMET’s corpora is presented in detail. Examples are also provided.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"8 4","pages":""},"PeriodicalIF":2.7,"publicationDate":"2023-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138524437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: The DELAD initiative for sharing language resources on speech disorders 更正:DELAD关于语言障碍语言资源共享的倡议
3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-06 DOI: 10.1007/s10579-023-09701-z
Alice Lee, Nicola Bessell, Henk van den Heuvel, Katarzyna Klessa, Satu Saalasti
{"title":"Correction: The DELAD initiative for sharing language resources on speech disorders","authors":"Alice Lee, Nicola Bessell, Henk van den Heuvel, Katarzyna Klessa, Satu Saalasti","doi":"10.1007/s10579-023-09701-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09701-z","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"757 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135636775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI 用于测试NLI的各种逻辑推理能力的可扩展框架
3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-11-04 DOI: 10.1007/s10579-023-09691-y
Ishan Tarunesh, Somak Aditya, Monojit Choudhury
Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.
自然语言推理(NLI)被认为是测试自然语言理解能力的代表性任务。在这项工作中,我们提出了一个可扩展的框架,以集体但分类地测试NLI(以及通过扩展,NLU)所需的各种逻辑推理能力。在行为测试的激励下,我们创建了一个半合成的大型测试台(363个模板,363k个示例)和一个相关框架,提供以下实用工具:(1)在17个推理维度(包括实用推理)上单独测试和分析推理能力;(2)设计实验,研究跨能力的信息内容(留一项或加一项);(3)综合性质使我们能够控制人为因素和偏见。我们从自由形式的自然语言模板(CheckList)和定义良好的功能分类中扩展了一个公开可用的自动化测试用例实例化框架,以覆盖范围越来越广的越来越难的测试用例,同时改变自然语言的复杂性。通过对最先进的NLI系统的分析,我们观察到我们的基准测试确实很难(即使使用额外资源进行训练也很重要)。有些功能比较难。此外,细粒度分析和微调实验揭示了关于这些能力和模型的更多见解——支持和扩展了以前的观察结果;由此可见所提出的试验台的实用性。
{"title":"LoNLI: An Extensible Framework for Testing Diverse Logical Reasoning Capabilities for NLI","authors":"Ishan Tarunesh, Somak Aditya, Monojit Choudhury","doi":"10.1007/s10579-023-09691-y","DOIUrl":"https://doi.org/10.1007/s10579-023-09691-y","url":null,"abstract":"Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU). In this work, we propose an extensible framework to collectively yet categorically test diverse Logical reasoning capabilities required for NLI (and, by extension, NLU). Motivated by behavioral testing, we create a semi-synthetic large test bench (363 templates, 363k examples) and an associated framework that offers the following utilities: (1) individually test and analyze reasoning capabilities along 17 reasoning dimensions (including pragmatic reasoning); (2) design experiments to study cross-capability information content (leave one out or bring one in); and (3) the synthetic nature enables us to control for artifacts and biases. We extend a publicly available framework of automated test case instantiation from free-form natural language templates (CheckList) and a well-defined taxonomy of capabilities to cover a wide range of increasingly harder test cases while varying the complexity of natural language. Through our analysis of state-of-the-art NLI systems, we observe that our benchmark is indeed hard (and non-trivial even with training on additional resources). Some capabilities stand out as harder. Further, fine-grained analysis and fine-tuning experiments reveal more insights about these capabilities and the models – supporting and extending previous observations; thus showing the utility of the proposed testbench.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"11 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135774512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Building the VisSE Corpus of Spanish SignWriting 西班牙语符号写作VisSE语料库的构建
3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-10-26 DOI: 10.1007/s10579-023-09694-9
Antonio F. G. Sevilla, Alberto Díaz Esteban, José María Lahoz-Bengoechea
{"title":"Building the VisSE Corpus of Spanish SignWriting","authors":"Antonio F. G. Sevilla, Alberto Díaz Esteban, José María Lahoz-Bengoechea","doi":"10.1007/s10579-023-09694-9","DOIUrl":"https://doi.org/10.1007/s10579-023-09694-9","url":null,"abstract":"","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"24 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134909333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements 除了简单的有毒之外:构建用于检测易燃话题和不适当陈述的数据集
3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2023-10-21 DOI: 10.1007/s10579-023-09682-z
Nikolay Babakov, Varvara Logacheva, Alexander Panchenko
Toxicity on the Internet is an acknowledged problem. It includes a wide range of actions from the use of obscene words to offenses and hate speech toward particular users or groups of people. However, there also exist other types of inappropriate messages which are usually not viewed as toxic as they do not contain swear words or explicit offenses. Such messages can contain covert toxicity or generalizations, incite harmful actions (crime, suicide, drug use), and provoke “heated” discussions. These messages are often related to particular sensitive topics, e.g. politics, sexual minorities, or social injustice. Such topics tend to yield toxic emotional reactions more often than other topics, e.g. cars or computing. At the same time, not all messages within “flammable” topics are inappropriate. This work focuses on automatically detecting inappropriate language in natural texts. This is crucial for monitoring user-generated content and developing dialogue systems and AI assistants. While many works focus on toxicity detection, we highlight the fact that texts can be harmful without being toxic or containing obscene language. Blind censorship based on keywords is a common approach to address these issues, but it limits a system’s functionality. This work proposes a safe and effective solution to serve broad user needs and develop necessary resources and tools. Thus, machinery for inappropriateness detection could be useful (i) for making communication on the Internet safer, more productive, and inclusive by flagging truly inappropriate content while not banning messages blindly by topic; (ii) for detection of inappropriate messages generated by automatic systems, e.g. neural chatbots, due to biases in training data; (iii) for debiasing training data for language models (e.g. BERT and GPT-2). Towards this end, in this work, we present two text collections labeled according to a binary notion of inappropriateness (124,597 samples) and a multinomial notion of sensitive topic (33,904 samples). Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on a human intuitive understanding of what is not acceptable and harmful. To devise an objective view of inappropriateness, we define it in a data-driven way through crowdsourcing. Namely, we run a large-scale annotation study asking workers if a given chatbot-generated utterance could harm the reputation of the company that created this chatbot. High values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of a sensitive topic in an objective way we use guidelines suggested by specialists in the Legal and PR departments of a large company. We use the collected datasets to train inappropriateness and sensitive topic classifiers employing both classic and Transformer-based models.
互联网上的毒性是一个公认的问题。它包括各种各样的行为,从使用淫秽文字到针对特定用户或人群的冒犯和仇恨言论。然而,也有其他类型的不适当的信息,通常不被视为有毒的,因为它们不包含脏话或明确的冒犯。这些信息可能包含隐蔽的毒性或泛化,煽动有害行为(犯罪、自杀、吸毒),并引发“激烈”的讨论。这些信息通常与特别敏感的话题有关,例如政治、性少数群体或社会不公正。这类话题往往比其他话题(如汽车或计算机)更容易产生有害的情绪反应。同时,并非所有“易燃”话题的信息都是不合适的。这项工作的重点是自动检测自然文本中的不适当语言。这对于监控用户生成的内容以及开发对话系统和人工智能助手至关重要。虽然许多作品关注毒性检测,但我们强调的事实是,文本可以是有害的,而不是有毒或含有淫秽语言。基于关键字的盲目审查是解决这些问题的常见方法,但它限制了系统的功能。这项工作提出了一个安全有效的解决方案,以满足广泛的用户需求,并开发必要的资源和工具。因此,不适当的检测机制可能是有用的(i)通过标记真正不适当的内容,而不是盲目地根据主题禁止信息,使互联网上的通信更安全,更富有成效和包容性;(ii)检测由自动系统(例如神经聊天机器人)由于训练数据中的偏差而产生的不适当消息;(iii)消除语言模型(例如BERT和GPT-2)的训练数据的偏差。为此,在这项工作中,我们提出了两个根据不适当的二元概念(124,597个样本)和敏感主题的多项概念(33,904个样本)标记的文本集合。假设不恰当的概念在同一文化的人们中是普遍的,我们的方法基于人类对什么是不可接受的和有害的直觉理解。为了设计一个客观的不恰当的观点,我们通过众包以数据驱动的方式定义它。也就是说,我们进行了一项大规模的注释研究,询问员工一个给定的聊天机器人生成的话语是否会损害创造这个聊天机器人的公司的声誉。注释者间一致性的高值表明不适当的概念存在,并且可以被不同的人统一理解。为了以客观的方式定义敏感话题的概念,我们使用了一家大公司法律和公关部门专家建议的指导方针。我们使用收集的数据集来训练不恰当和敏感的主题分类器,使用经典和基于transformer的模型。
{"title":"Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements","authors":"Nikolay Babakov, Varvara Logacheva, Alexander Panchenko","doi":"10.1007/s10579-023-09682-z","DOIUrl":"https://doi.org/10.1007/s10579-023-09682-z","url":null,"abstract":"Toxicity on the Internet is an acknowledged problem. It includes a wide range of actions from the use of obscene words to offenses and hate speech toward particular users or groups of people. However, there also exist other types of inappropriate messages which are usually not viewed as toxic as they do not contain swear words or explicit offenses. Such messages can contain covert toxicity or generalizations, incite harmful actions (crime, suicide, drug use), and provoke “heated” discussions. These messages are often related to particular sensitive topics, e.g. politics, sexual minorities, or social injustice. Such topics tend to yield toxic emotional reactions more often than other topics, e.g. cars or computing. At the same time, not all messages within “flammable” topics are inappropriate. This work focuses on automatically detecting inappropriate language in natural texts. This is crucial for monitoring user-generated content and developing dialogue systems and AI assistants. While many works focus on toxicity detection, we highlight the fact that texts can be harmful without being toxic or containing obscene language. Blind censorship based on keywords is a common approach to address these issues, but it limits a system’s functionality. This work proposes a safe and effective solution to serve broad user needs and develop necessary resources and tools. Thus, machinery for inappropriateness detection could be useful (i) for making communication on the Internet safer, more productive, and inclusive by flagging truly inappropriate content while not banning messages blindly by topic; (ii) for detection of inappropriate messages generated by automatic systems, e.g. neural chatbots, due to biases in training data; (iii) for debiasing training data for language models (e.g. BERT and GPT-2). Towards this end, in this work, we present two text collections labeled according to a binary notion of inappropriateness (124,597 samples) and a multinomial notion of sensitive topic (33,904 samples). Assuming that the notion of inappropriateness is common among people of the same culture, we base our approach on a human intuitive understanding of what is not acceptable and harmful. To devise an objective view of inappropriateness, we define it in a data-driven way through crowdsourcing. Namely, we run a large-scale annotation study asking workers if a given chatbot-generated utterance could harm the reputation of the company that created this chatbot. High values of inter-annotator agreement suggest that the notion of inappropriateness exists and can be uniformly understood by different people. To define the notion of a sensitive topic in an objective way we use guidelines suggested by specialists in the Legal and PR departments of a large company. We use the collected datasets to train inappropriateness and sensitive topic classifiers employing both classic and Transformer-based models.","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"14 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135510980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Language Resources and Evaluation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1