首页 > 最新文献

Language Resources and Evaluation最新文献

英文 中文
Automatic construction of direction-aware sentiment lexicon using direction-dependent words 利用方向依赖词自动构建方向感知情感词典
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-25 DOI: 10.1007/s10579-024-09737-9
Jihye Park, Hye Jin Lee, Sungzoon Cho

Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.

可解释性,即相关利益方能够理解导致数据驱动模型决策的关键因素的程度,一直被认为是金融领域的一个基本考虑因素。因此,能够实现合理性能并为用户提供清晰解释的词典一直是基于情感的金融预测中最受欢迎的资源之一。由于基于深度学习的技术有其局限性,即解释结果的依据不明确,因此在要求解释情感估计过程的研究中,词典作为一种重要工具一直吸引着社会各界的关注。构建金融情感词典所面临的挑战之一,是单词的情感取向会因定向表达的应用而改变这一特定领域的特征。例如,"成本 "一词通常传达的是负面情感;然而,当该词与 "减少 "并列构成短语 "成本减少 "时,相关情感则是正面的。有几项研究已经人工建立了包含定向表达的词典。然而,由于人工检查不可避免地需要大量的人力和时间,因此这些研究受到了阻碍。在本研究中,我们提出自动构建 "由方向依赖词组成的情感词库",该词库将每个术语表述为由一个方向词和一个方向依赖词组成的一对。实验结果表明,所提出的情感词典提高了分类性能,证明了我们的方法在自动构建方向感知情感词典方面的有效性。
{"title":"Automatic construction of direction-aware sentiment lexicon using direction-dependent words","authors":"Jihye Park, Hye Jin Lee, Sungzoon Cho","doi":"10.1007/s10579-024-09737-9","DOIUrl":"https://doi.org/10.1007/s10579-024-09737-9","url":null,"abstract":"<p>Explainability, which is the degree to which an interested stakeholder can understand the key factors that led to a data-driven model’s decision, has been considered an essential consideration in the financial domain. Accordingly, lexicons that can achieve reasonable performance and provide clear explanations to users have been among the most popular resources in sentiment-based financial forecasting. Since deep learning-based techniques have limitations in that the basis for interpreting the results is unclear, lexicons have consistently attracted the community’s attention as a crucial tool in studies that demand explanations for the sentiment estimation process. One of the challenges in the construction of a financial sentiment lexicon is the domain-specific feature that the sentiment orientation of a word can change depending on the application of directional expressions. For instance, the word “cost” typically conveys a negative sentiment; however, when the word is juxtaposed with “decrease” to form the phrase “cost decrease,” the associated sentiment is positive. Several studies have manually built lexicons containing directional expressions. However, they have been hindered because manual inspection inevitably requires intensive human labor and time. In this study, we propose to automatically construct the “sentiment lexicon composed of direction-dependent words,” which expresses each term as a pair consisting of a directional word and a direction-dependent word. Experimental results show that the proposed sentiment lexicon yields enhanced classification performance, proving the effectiveness of our method for the automated construction of a direction-aware sentiment lexicon.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"63 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-linguistically consistent semantic and syntactic annotation of child-directed speech 对儿童指令性语音进行跨语言一致的语义和句法注释
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-15 DOI: 10.1007/s10579-024-09734-y
Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman

Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate (approx) 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.

儿童言语和儿童引导言语(CDS)语料库为儿童语言习得研究做出了重大贡献,但此类语料库的语义注释仍然很少,而且缺乏统一标准。CDS 的语义注释对于了解儿童接受的输入的性质和开发儿童语言习得的计算模型尤为重要。例如,假设儿童能够推断出他们听到的(至少部分)话语的意义表征,那么语言习得的任务就是学习一种语法,这种语法能够在噪音和其他上下文可能意义的干扰下,将成人的新话语映射到其相应的意义表征上。要研究这个问题并开发其计算模型,我们需要同时提供成人语篇及其意义表征的语料库,最好是使用在各种语言中都一致的注释,以便于跨语言比较研究。本文提出了一种构建与句子逻辑形式配对的 CDS 语料库的方法,并使用这种方法创建了英语和希伯来语的两个语料库。该方法以依存表征和语义解析的最新进展为基础,实现了跨语言的一致表征。具体来说,该方法包括两个步骤。首先,我们使用通用依存关系(UD)方案对语料库进行句法注释。接下来,我们采用一种自动方法从 UD 结构转换出句法逻辑形式 (LF),从而进一步注释这些数据。UD 和 LF 表示法具有互补优势:UD 结构是语言中性的,支持多个注释者进行一致、可靠的注释,而 LF 在句法派生方面是中性的,可以透明地编码语义关系。利用这种方法,我们为 CHILDES 的两个语料库提供了句法和语义注释:布朗的亚当语料库(英语;我们注释了其 80% 的儿童导向语篇)、伯曼的哈加语料库(希伯来语)中的所有儿童导向语篇。我们利用注释者之间的一致性研究验证了 UD 注释的质量,并对转换后的意义表示进行了人工评估。然后,我们通过(1)对 CDS 中不同句法和语义现象的普遍性进行纵向语料库研究,以及(2)将现有的语言习得计算模型应用于这两个语料库,并对不同语言的结果进行简要比较,证明了编译语料库的实用性。
{"title":"Cross-linguistically consistent semantic and syntactic annotation of child-directed speech","authors":"Ida Szubert, Omri Abend, Nathan Schneider, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman","doi":"10.1007/s10579-024-09734-y","DOIUrl":"https://doi.org/10.1007/s10579-024-09734-y","url":null,"abstract":"<p>Corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, yet semantic annotation for such corpora is still scarce and lacks a uniform standard. Semantic annotation of CDS is particularly important for understanding the nature of the input children receive and developing computational models of child language acquisition. For example, under the assumption that children are able to infer meaning representations for (at least some of) the utterances they hear, the acquisition task is to learn a grammar that can map novel adult utterances onto their corresponding meaning representations, in the face of noise and distraction by other contextually possible meanings. To study this problem and to develop computational models of it, we need corpora that provide both adult utterances and their meaning representations, ideally using annotation that is consistent across a range of languages in order to facilitate cross-linguistic comparative studies. This paper proposes a methodology for constructing such corpora of CDS paired with sentential logical forms, and uses this method to create two such corpora, in English and Hebrew. The approach enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. Specifically, the approach involves two steps. First, we annotate the corpora using the Universal Dependencies (UD) scheme for syntactic annotation, which has been developed to apply consistently to a wide variety of domains and typologically diverse languages. Next, we further annotate these data by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The UD and LF representations have complementary strengths: UD structures are language-neutral and support consistent and reliable annotation by multiple annotators, whereas LFs are neutral as to their syntactic derivation and transparently encode semantic relations. Using this approach, we provide syntactic and semantic annotation for two corpora from CHILDES: Brown’s Adam corpus (English; we annotate <span>(approx)</span> 80% of its child-directed utterances), all child-directed utterances from Berman’s Hagar corpus (Hebrew). We verify the quality of the UD annotation using an inter-annotator agreement study, and manually evaluate the transduced meaning representations. We then demonstrate the utility of the compiled corpora through (1) a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena in the CDS, and (2) applying an existing computational model of language acquisition to the two corpora and briefly comparing the results across languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"70 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain 生物医学领域跨语言命名实体识别的数据扩增和迁移学习
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-10 DOI: 10.1007/s10579-024-09738-8
Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov

Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

随着生物医学领域数据产量的增加和互联网势不可挡的发展,对信息提取(IE)技术的需求急剧上升。命名实体识别(NER)是此类 IE 任务之一,对不同领域的专业人员都很有用。生物医学 NER 有多种应用场合,例如生物医学文献的提取和分析、关系提取、生物医学文档的组织以及知识库的完善。然而,对生物医学领域的实体进行计算处理面临着许多挑战,包括注释成本高、模棱两可以及缺乏英语以外语言的生物医学 NER 数据集。这些困难阻碍了数据的开发,影响了该领域本身及其多语言覆盖范围。本研究的目的是通过开发一种稳健的双语 NER 模型,克服西班牙语 NER 生物医学数据稀缺的问题(目前仅有两个数据集)。受到反向翻译的启发,本文利用神经机器翻译(NMT)领域的进展,创建了科罗拉多富注释全文(CRAFT)数据集的西班牙语合成版本。此外,我们还通过替换原始数据集中 20% 的实体构建了一个新的 CRAFT 数据集,并生成了一个新的增强数据集。我们评估了两种训练方法:数据集连接和连续训练,以评估转换器使用新获得的数据集进行迁移学习的能力。开发集中表现最好的 NER 系统的 F-1 得分为 86.39%。本文提出的新方法是首个双语 NER 系统,它有望改善资源不足语言的应用。
{"title":"Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain","authors":"Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.1007/s10579-024-09738-8","DOIUrl":"https://doi.org/10.1007/s10579-024-09738-8","url":null,"abstract":"<p>Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"47 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Features in extractive supervised single-document summarization: case of Persian news 提取式有监督单篇文档摘要中的特征:波斯新闻案例
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-08 DOI: 10.1007/s10579-024-09739-7
Hosein Rezaei, Seyed Amid Moeinzadeh Mirhosseini, Azar Shahgholian, Mohamad Saraee

Text summarization has been one of the most challenging areas of research in NLP. Much effort has been made to overcome this challenge by using either abstractive or extractive methods. Extractive methods are preferable due to their simplicity compared with the more elaborate abstractive methods. In extractive supervised single-document approaches, the system will not generate sentences. Instead, via supervised learning, it learns how to score sentences within the document based on some textual features and subsequently selects those with the highest rank. Therefore, the core objective is ranking, which enormously depends on the document structure and context. These dependencies have been unnoticed by many state-of-the-art solutions. In this work, document-related features such as topic and relative length are integrated into the vectors of every sentence to enhance the quality of summaries. Our experiment results show that the system takes contextual and structural patterns into account, which will increase the precision of the learned model. Consequently, our method will produce more comprehensive and concise summaries.

文本摘要一直是 NLP 中最具挑战性的研究领域之一。为了克服这一挑战,人们使用抽象或提取方法做出了很多努力。与更复杂的抽象方法相比,提取方法因其简单性而更受欢迎。在抽取式单文档监督方法中,系统不会生成句子。相反,通过监督学习,系统会学习如何根据某些文本特征对文档中的句子进行评分,然后选出排名最高的句子。因此,核心目标是排名,而排名在很大程度上取决于文档结构和上下文。许多最先进的解决方案都没有注意到这些依赖性。在这项工作中,与文档相关的特征(如主题和相对长度)被整合到每个句子的向量中,以提高摘要的质量。我们的实验结果表明,该系统考虑了上下文和结构模式,这将提高所学模型的精确度。因此,我们的方法将产生更全面、更简洁的摘要。
{"title":"Features in extractive supervised single-document summarization: case of Persian news","authors":"Hosein Rezaei, Seyed Amid Moeinzadeh Mirhosseini, Azar Shahgholian, Mohamad Saraee","doi":"10.1007/s10579-024-09739-7","DOIUrl":"https://doi.org/10.1007/s10579-024-09739-7","url":null,"abstract":"<p>Text summarization has been one of the most challenging areas of research in NLP. Much effort has been made to overcome this challenge by using either abstractive or extractive methods. Extractive methods are preferable due to their simplicity compared with the more elaborate abstractive methods. In extractive supervised single-document approaches, the system will not generate sentences. Instead, via supervised learning, it learns how to score sentences within the document based on some textual features and subsequently selects those with the highest rank. Therefore, the core objective is ranking, which enormously depends on the document structure and context. These dependencies have been unnoticed by many state-of-the-art solutions. In this work, document-related features such as topic and relative length are integrated into the vectors of every sentence to enhance the quality of summaries. Our experiment results show that the system takes contextual and structural patterns into account, which will increase the precision of the learned model. Consequently, our method will produce more comprehensive and concise summaries.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"130 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mismatching-aware unsupervised translation quality estimation for low-resource languages 低资源语言的错配感知无监督翻译质量评估
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-05-05 DOI: 10.1007/s10579-024-09727-x
Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English(rightarrow)Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.

翻译质量评估 (QE) 是在没有任何参考的情况下预测机器翻译 (MT) 输出质量的任务。作为 MT 实际应用中的重要组成部分,这项任务越来越受到关注。在本文中,我们首先提出了 XLMRScore,它是通过 XLM-RoBERTa (XLMR) 模型计算的 BERTScore 的跨语言对应指标。该指标可用作一种简单的无监督 QE 方法,但面临两个问题:第一,未翻译的标记会导致意想不到的高翻译分数;第二,在 XLMRScore 中应用贪婪匹配时,源标记和假设标记之间会出现不匹配错误。为了缓解这些问题,我们建议分别用未知标记和预训练模型的跨语言对齐来替换未翻译的单词,以表示更接近彼此的对齐单词。我们在 WMT21 QE 共享任务的四个低资源语言对以及本文引入的一个新的 English(rightarrow)Persian (En-Fa) 测试数据集上评估了所提出的方法。实验表明,我们的方法可以在两个零点场景下获得与有监督基线相当的结果,即皮尔逊相关性的差异小于 0.01,同时在所有低资源语言对中平均超过 8%,优于无监督对手。
{"title":"Mismatching-aware unsupervised translation quality estimation for low-resource languages","authors":"Fatemeh Azadi, Heshaam Faili, Mohammad Javad Dousti","doi":"10.1007/s10579-024-09727-x","DOIUrl":"https://doi.org/10.1007/s10579-024-09727-x","url":null,"abstract":"<p>Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, nevertheless facing two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of the WMT21 QE shared task, as well as a new English<span>(rightarrow)</span>Persian (En-Fa) test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"128 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140930096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing 通过基于自然语言处理的情境感知注意力深度模型改进阿拉伯语情感分析
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-27 DOI: 10.1007/s10579-024-09741-z
Abubakr H. Ombabi, Wael Ouarda, Adel M. Alimi

With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.

近年来,随着社会数据的巨大增长,情感分析越来越受到研究人员的关注,并在各种语言中得到了广泛的探索。阿拉伯语的性质带来了一些挑战,如复杂的形态结构和有限的资源,因此,目前最先进的情感分析方法仍有待改进。这启发我们探索将新兴的深度学习架构应用于阿拉伯语文本分类。在本文中,我们提出了一种整合了卷积神经网络、双向长短期记忆(Bi-LSTM)和注意力机制的集合模型,用于预测阿拉伯语句子的情感取向。卷积层用于从高层句子表征层提取特征,双向长短期记忆(Bi-LSTM)用于从生成的特征集中进一步捕捉上下文信息。Bi-LSTM 隐藏层生成的上下文特征向量中包含两个注意力机制单元,用于突出关键信息。然后,由注意力机制层生成的上下文相关向量会被串联起来并传入分类器,以预测最终标签。为了区分这些成分的影响,我们在多领域语料库和四个基准上验证了所提出模型的三种不同架构。实验结果表明,结合 Bi-LSTM 和注意力机制提高了模型的性能,准确率达到 96.08%。因此,该架构的准确率、精确度和召回率分别提高了 + 14.47%、+ 20.38% 和 + 18.45%,始终优于其他最新方法。这些结果证明了该模型在应对文本分类任务挑战方面的优势。
{"title":"Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing","authors":"Abubakr H. Ombabi, Wael Ouarda, Adel M. Alimi","doi":"10.1007/s10579-024-09741-z","DOIUrl":"https://doi.org/10.1007/s10579-024-09741-z","url":null,"abstract":"<p>With the enormous growth of social data in recent years, sentiment analysis has gained increasing research attention and has been widely explored in various languages. Arabic language nature imposes several challenges, such as the complicated morphological structure and the limited resources, Thereby, the current state-of-the-art methods for sentiment analysis remain to be enhanced. This inspired us to explore the application of the emerging deep-learning architecture to Arabic text classification. In this paper, we present an ensemble model which integrates a convolutional neural network, bidirectional long short-term memory (Bi-LSTM), and attention mechanism, to predict the sentiment orientation of Arabic sentences. The convolutional layer is used for feature extraction from the higher-level sentence representations layer, the BiLSTM is integrated to further capture the contextual information from the produced set of features. Two attention mechanism units are incorporated to highlight the critical information from the contextual feature vectors produced by the Bi-LSTM hidden layers. The context-related vectors generated by the attention mechanism layers are then concatenated and passed into a classifier to predict the final label. To disentangle the influence of these components, the proposed model is validated as three variant architectures on a multi-domains corpus, as well as four benchmarks. Experimental results show that incorporating Bi-LSTM and attention mechanism improves the model’s performance while yielding 96.08% in accuracy. Consequently, this architecture consistently outperforms the other State-of-The-Art approaches with up to + 14.47%, + 20.38%, and + 18.45% improvements in accuracy, precision, and recall respectively. These results demonstrated the strengths of this model in addressing the challenges of text classification tasks.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"8 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140809633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text 虚假仇恨:揭开传播仇恨故事的虚假叙事之网:跨语言印地语-英语代码混合文本中的多标签和多类别数据集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-16 DOI: 10.1007/s10579-024-09732-0
Shankar Biradar, Sunil Saumya, Arun Chauhan

Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated (chi ^2) value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.

不可否认,社交媒体改变了人们的交流方式,但同时也带来了毋庸置疑的弊端,尤其是虚假和仇恨言论的泛滥。最近的观察表明,这两个问题经常并存,关于仇恨话题的讨论经常被虚假言论所主导。因此,探讨虚假叙事在当代仇恨传播中的作用已成为当务之急。在这一方向上,本文提出了一个名为 "虚假仇恨多标签数据集"(FHMLD)的新数据集,该数据集由 8014 条印地语-英语混合编码文本中的虚假仇恨评论组成。据我们所知,这是首次将虚假内容和仇恨内容整合到一个统一的框架中。此外,所提议的数据集收集自 YouTube 和 Twitter 等不同平台,以减少用户相关偏见。为了研究虚假叙事的存在及其对仇恨强度的影响之间的关系,本研究使用卡方检验法(Chi-square test)进行了统计分析。统计结果表明,计算出的(chi ^2)值大于标准表中的值,从而拒绝了零假设。此外,本研究还提出了利用单词和句子层面的句法和语义特征对多类别和多标签数据集进行分类的基线方法。实验结果表明,基于 fastText 和 SVM 的方法在二元假憎预测和严重性预测方面的准确率分别为 71% 和 58%,优于其他模型。
{"title":"Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text","authors":"Shankar Biradar, Sunil Saumya, Arun Chauhan","doi":"10.1007/s10579-024-09732-0","DOIUrl":"https://doi.org/10.1007/s10579-024-09732-0","url":null,"abstract":"<p>Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated <span>(chi ^2)</span> value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach 从社交媒体文本中建立抑郁症状模型:一种 LLM 驱动的半监督学习方法
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-04-04 DOI: 10.1007/s10579-024-09720-4
Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane

A fundamental component of user-level social media language based clinical depression modelling is depression symptoms detection (DSD). Unfortunately, there does not exist any DSD dataset that reflects both the clinical insights and the distribution of depression symptoms from the samples of self-disclosed depressed population. In our work, we describe a semi-supervised learning (SSL) framework which uses an initial supervised learning model that leverages (1) a state-of-the-art large mental health forum text pre-trained language model further fine-tuned on a clinician annotated DSD dataset, (2) a Zero-Shot learning model for DSD, and couples them together to harvest depression symptoms related samples from our large self-curated depressive tweets repository (DTR). Our clinician annotated dataset is the largest of its kind. Furthermore, DTR is created from the samples of tweets in self-disclosed depressed users Twitter timeline from two datasets, including one of the largest benchmark datasets for user-level depression detection from Twitter. This further helps preserve the depression symptoms distribution of self-disclosed tweets. Subsequently, we iteratively retrain our initial DSD model with the harvested data. We discuss the stopping criteria and limitations of this SSL process, and elaborate the underlying constructs which play a vital role in the overall SSL process. We show that we can produce a final dataset which is the largest of its kind. Furthermore, a DSD and a Depression Post Detection model trained on it achieves significantly better accuracy than their initial version.

基于用户级社交媒体语言的临床抑郁症建模的一个基本组成部分是抑郁症状检测(DSD)。遗憾的是,目前还没有任何 DSD 数据集既能反映临床见解,又能从自我披露的抑郁人群样本中反映抑郁症状的分布情况。在我们的工作中,我们介绍了一种半监督学习(SSL)框架,它使用了一个初始监督学习模型,该模型利用了(1)在临床医生注释的 DSD 数据集上进一步微调的最先进的大型心理健康论坛文本预训练语言模型,(2)DSD 的零点学习模型,并将它们结合在一起,从我们的大型自编辑抑郁推文库(DTR)中获取抑郁症状相关样本。我们的临床医生注释数据集是同类数据集中最大的。此外,DTR 是根据两个数据集中自我披露的抑郁用户 Twitter 时间轴中的推文样本创建的,其中包括 Twitter 用户级抑郁检测的最大基准数据集之一。这进一步有助于保留自我披露推文中抑郁症状的分布。随后,我们利用收集到的数据迭代地重新训练初始 DSD 模型。我们讨论了这一 SSL 过程的停止标准和局限性,并阐述了在整个 SSL 过程中发挥重要作用的底层构造。我们证明,我们可以生成同类中最大的最终数据集。此外,在此基础上训练的 DSD 和抑郁后检测模型的准确性也大大高于其初始版本。
{"title":"Depression symptoms modelling from social media text: an LLM driven semi-supervised learning approach","authors":"Nawshad Farruque, Randy Goebel, Sudhakar Sivapalan, Osmar R. Zaïane","doi":"10.1007/s10579-024-09720-4","DOIUrl":"https://doi.org/10.1007/s10579-024-09720-4","url":null,"abstract":"<p>A fundamental component of user-level social media language based clinical depression modelling is depression symptoms detection (DSD). Unfortunately, there does not exist any DSD dataset that reflects both the clinical insights and the distribution of depression symptoms from the samples of self-disclosed depressed population. In our work, we describe a semi-supervised learning (SSL) framework which uses an initial supervised learning model that leverages (1) a state-of-the-art large mental health forum text pre-trained language model further fine-tuned on a clinician annotated DSD dataset, (2) a Zero-Shot learning model for DSD, and couples them together to harvest depression symptoms related samples from our large self-curated depressive tweets repository (DTR). Our clinician annotated dataset is the largest of its kind. Furthermore, DTR is created from the samples of tweets in self-disclosed depressed users Twitter timeline from two datasets, including one of the largest benchmark datasets for user-level depression detection from Twitter. This further helps preserve the depression symptoms distribution of self-disclosed tweets. Subsequently, we iteratively retrain our initial DSD model with the harvested data. We discuss the stopping criteria and limitations of this SSL process, and elaborate the underlying constructs which play a vital role in the overall SSL process. We show that we can produce a final dataset which is the largest of its kind. Furthermore, a DSD and a Depression Post Detection model trained on it achieves significantly better accuracy than their initial version.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"86 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A longitudinal multi-modal dataset for dementia monitoring and diagnosis 用于痴呆症监测和诊断的多模式纵向数据集
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-30 DOI: 10.1007/s10579-023-09718-4

Abstract

Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.

摘要 痴呆症影响成年人的认知功能,包括记忆、语言和行为。核磁共振成像等标准诊断生物标志物成本高昂,而神经心理测试在检测痴呆症发病方面存在灵敏度问题。语音和语言分析已成为诊断和监测痴呆症的一种前景广阔的非侵入性技术。目前,这方面的大多数工作都忽略了人类交流的多模式性质和日常对话互动的交互方面。此外,由于缺乏一致的纵向数据,大多数研究忽略了认知状态随时间的变化。在这里,我们介绍一种新颖的细粒度纵向多模态语料库,该语料库是在自然环境中从健康对照组和痴呆症患者那里收集的,分为两个阶段,每个阶段跨越 28 个会话。该语料库由口语对话(其中一部分已转录)、打字和书面思想以及相关的语言外信息(如笔画和击键)组成。我们介绍了数据收集过程,并详细描述了语料库。此外,我们还为健康对照组和痴呆症患者两个组群建立了基线,以捕捉不同模式下语言的纵向变化,并概述了该语料库的未来研究方向。
{"title":"A longitudinal multi-modal dataset for dementia monitoring and diagnosis","authors":"","doi":"10.1007/s10579-023-09718-4","DOIUrl":"https://doi.org/10.1007/s10579-023-09718-4","url":null,"abstract":"<h3>Abstract</h3> <p>Dementia affects cognitive functions of adults, including memory, language, and behaviour. Standard diagnostic biomarkers such as MRI are costly, whilst neuropsychological tests suffer from sensitivity issues in detecting dementia onset. The analysis of speech and language has emerged as a promising and non-intrusive technology to diagnose and monitor dementia. Currently, most work in this direction ignores the multi-modal nature of human communication and interactive aspects of everyday conversational interaction. Moreover, most studies ignore changes in cognitive status over time due to the lack of consistent longitudinal data. Here we introduce a novel fine-grained longitudinal multi-modal corpus collected in a natural setting from healthy controls and people with dementia over two phases, each spanning 28 sessions. The corpus consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We present the data collection process and describe the corpus in detail. Furthermore, we establish baselines for capturing longitudinal changes in language across different modalities for two cohorts, healthy controls and people with dementia, outlining future research directions enabled by the corpus.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"42 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140577783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DILLo: an Italian lexical database for speech-language pathologists DILLo:供语言病理学家使用的意大利语词汇数据库
IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-03-23 DOI: 10.1007/s10579-024-09722-2
Federica Beccaria, Angela Cristiano, Flavio Pisciotta, Noemi Usardi, Elisa Borgogni, Filippo Prayer Galletti, Giulia Corsi, Lorenzo Gregori, Gloria Gagliardi

A novel lexical resource for treating speech impairments from childhood to senility: DILLo—Database Italiano del Lessico per Logopedisti (i.e., Italian Database for Speech-Language Pathologists) is presented. DILLo is a free online web application that allows extraction of filtered wordlists for flexible rehabilitative purposes. Its major aim is to provide Italian speech-language pathologists (SLPs) with a resource that takes advantage of Information and Communication Technologies for language in a healthcare setting. DILLo’s design adopts an integrated approach that envisages fruitful cooperation between clinical and linguistic professionals. The 7690 Italian words in the database have been selected based on phonological, phonotactic, and morphological properties, and their frequency of use. These linguistic features are encoded in the tool, which includes the orthographic and phonological transcriptions, and the phonotactic structure of each word. Moreover, most of the entries are associated with their respective ARASAAC pictogram, providing an additional and inclusive tool for treating speech impairments. The user-friendly interface is structured to allow for different and adaptable search options. DILLo allows Speech-Language Pathologists (SLPs) to obtain a rich, tailored, and varied selection of suitable linguistic stimuli. It can be used to customize the treatment of many impairments, e.g., Speech Sound Disorders, Childhood Apraxia of Speech, Specific Learning Disabilities, aphasia, dysarthria, dysphonia, and the auditory training that follows cochlear implantations.

用于治疗从儿童到老年的语言障碍的新型词汇资源:介绍了 DILLo-Database Italiano del Lessico per Logopedisti(即意大利语言病理学家数据库)。DILLo 是一个免费的在线网络应用程序,可提取过滤词表用于灵活的康复目的。其主要目的是为意大利语言病理学家(SLPs)提供一种资源,利用信息和通信技术在医疗保健环境中进行语言治疗。DILLo 的设计采用了一种综合方法,设想在临床和语言专业人员之间开展富有成效的合作。数据库中的 7690 个意大利语单词是根据语音、音素学和形态学特性及其使用频率筛选出来的。这些语言特点都已编码在工具中,其中包括每个单词的正字法和语音转写以及音素结构。此外,大多数词条都与各自的 ARASAAC 象形图相关联,为治疗语言障碍提供了一个额外的包容性工具。用户友好的界面结构允许不同和可调整的搜索选项。DILLo 允许语言病理学家 (SLP) 获得丰富的、量身定制的、多样的合适语言刺激选择。它可用于定制多种障碍的治疗,如言语发音障碍、儿童语言障碍、特殊学习障碍、失语症、构音障碍、发音障碍以及人工耳蜗植入术后的听觉训练。
{"title":"DILLo: an Italian lexical database for speech-language pathologists","authors":"Federica Beccaria, Angela Cristiano, Flavio Pisciotta, Noemi Usardi, Elisa Borgogni, Filippo Prayer Galletti, Giulia Corsi, Lorenzo Gregori, Gloria Gagliardi","doi":"10.1007/s10579-024-09722-2","DOIUrl":"https://doi.org/10.1007/s10579-024-09722-2","url":null,"abstract":"<p>A novel lexical resource for treating speech impairments from childhood to senility: DILLo—<i>Database Italiano del Lessico per Logopedisti</i> (i.e., Italian Database for Speech-Language Pathologists) is presented. DILLo is a free online web application that allows extraction of filtered wordlists for flexible rehabilitative purposes. Its major aim is to provide Italian speech-language pathologists (SLPs) with a resource that takes advantage of Information and Communication Technologies for language in a healthcare setting. DILLo’s design adopts an integrated approach that envisages fruitful cooperation between clinical and linguistic professionals. The 7690 Italian words in the database have been selected based on phonological, phonotactic, and morphological properties, and their frequency of use. These linguistic features are encoded in the tool, which includes the orthographic and phonological transcriptions, and the phonotactic structure of each word. Moreover, most of the entries are associated with their respective ARASAAC pictogram, providing an additional and inclusive tool for treating speech impairments. The user-friendly interface is structured to allow for different and adaptable search options. DILLo allows Speech-Language Pathologists (SLPs) to obtain a rich, tailored, and varied selection of suitable linguistic stimuli. It can be used to customize the treatment of many impairments, e.g., Speech Sound Disorders, Childhood Apraxia of Speech, Specific Learning Disabilities, aphasia, dysarthria, dysphonia, and the auditory training that follows cochlear implantations.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"80 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Language Resources and Evaluation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1