首页 > 最新文献

Computer Speech and Language最新文献

英文 中文
On improving conversational interfaces in educational systems 关于改进教育系统中的对话界面
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-23 DOI: 10.1016/j.csl.2024.101693

Conversational Intelligent Tutoring Systems (CITS) have drawn increasing interest in education because of their capacity to tailor learning experiences, improve user engagement, and contribute to the effective transfer of knowledge. Conversational agents employ advanced natural language techniques to engage in a convincing human-like tutorial conversation. In solving math word problems, a significant challenge arises in enabling the system to understand user utterances and accurately map extracted entities to the essential problem quantities required for problem-solving, despite the inherent ambiguity of human natural language. In this study, we propose two possible approaches to enhance the performance of a particular CITS designed to teach learners to solve arithmetic–algebraic word problems. Firstly, we propose an ensemble approach to intent classification and entity extraction, which combines the predictions made by two distinct individual models that use constraints defined by human experts. This approach leverages the intertwined nature of the intents and entities to yield a comprehensive understanding of the user’s utterance, ultimately aiming to enhance semantic accuracy. Secondly, we introduce an adapted Term Frequency-Inverse Document Frequency technique to associate entities with problem quantity descriptions. The evaluation was conducted on the AWPS and MATH-HINTS datasets, containing conversational data and a collection of arithmetical and algebraic math problems, respectively. The results demonstrate that the proposed ensemble approach outperforms individual models, and the proposed method for entity–quantity matching surpasses the performance of typical text semantic embedding models.

对话式智能辅导系统(CITS)因其能够定制学习体验、提高用户参与度和促进知识的有效传递而在教育领域引起越来越多的关注。对话式代理采用先进的自然语言技术,进行令人信服的仿人辅导对话。在解决数学单词问题时,尽管人类自然语言本身具有模糊性,但如何让系统理解用户的话语,并将提取的实体准确映射到解决问题所需的基本问题量上,仍是一个重大挑战。在本研究中,我们提出了两种可能的方法来提高特定 CITS 的性能,该 CITS 专门用于教授学习者解决算术-代数文字问题。首先,我们提出了一种意图分类和实体提取的集合方法,该方法结合了两个不同的单独模型所做的预测,这两个模型使用了人类专家定义的约束条件。这种方法利用意图和实体相互交织的特性,全面理解用户的语句,最终提高语义准确性。其次,我们引入了经调整的术语频率-反向文档频率技术,将实体与问题数量描述联系起来。评估是在 AWPS 和 MATH-HINTS 数据集上进行的,这两个数据集分别包含对话数据以及算术和代数数学问题集。结果表明,所提出的集合方法优于单个模型,而且所提出的实体-数量匹配方法超过了典型文本语义嵌入模型的性能。
{"title":"On improving conversational interfaces in educational systems","authors":"","doi":"10.1016/j.csl.2024.101693","DOIUrl":"10.1016/j.csl.2024.101693","url":null,"abstract":"<div><p>Conversational Intelligent Tutoring Systems (CITS) have drawn increasing interest in education because of their capacity to tailor learning experiences, improve user engagement, and contribute to the effective transfer of knowledge. Conversational agents employ advanced natural language techniques to engage in a convincing human-like tutorial conversation. In solving math word problems, a significant challenge arises in enabling the system to understand user utterances and accurately map extracted entities to the essential problem quantities required for problem-solving, despite the inherent ambiguity of human natural language. In this study, we propose two possible approaches to enhance the performance of a particular CITS designed to teach learners to solve arithmetic–algebraic word problems. Firstly, we propose an ensemble approach to intent classification and entity extraction, which combines the predictions made by two distinct individual models that use constraints defined by human experts. This approach leverages the intertwined nature of the intents and entities to yield a comprehensive understanding of the user’s utterance, ultimately aiming to enhance semantic accuracy. Secondly, we introduce an adapted Term Frequency-Inverse Document Frequency technique to associate entities with problem quantity descriptions. The evaluation was conducted on the AWPS and MATH-HINTS datasets, containing conversational data and a collection of arithmetical and algebraic math problems, respectively. The results demonstrate that the proposed ensemble approach outperforms individual models, and the proposed method for entity–quantity matching surpasses the performance of typical text semantic embedding models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000767/pdfft?md5=56f2f2395571e332090191dc68fc5505&pid=1-s2.0-S0885230824000767-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141851561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus 对痴呆症患者语音转录的计算分析:Anchise 2022 语料库
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-22 DOI: 10.1016/j.csl.2024.101691

Introduction

Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions.

Methods

Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values.

Results and discussion

Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the clas

引言 自动语言分析可以为认知障碍的诊断和治疗实践提供具有成本效益的宝贵线索,从而对福祉产生积极影响。在这项工作中,我们分析了老年痴呆症患者与医疗保健专业人员之间的对话记录。这些材料来自 Anchise 2022 语料库,该语料库收集了大量在自然条件下记录的意大利语对话记录。这项工作的目的是测试一些自动分析方法在发现认知功能衰退患者痴呆症进展与迷你精神状态检查(MMSE)得分之间相关性方面的有效性,迷你精神状态检查是对话参与者唯一可用的心理临床信息。本研究不考虑健康对照组(HC),语料库本身也不包括健康对照组。这项工作的主要创新和优势在于所分析语言的高度生态有效性(迄今为止,大多数文献都涉及受控语言实验);意大利语的使用(意大利语语料库很少);分析数据的规模(考虑了 200 多段对话);采用广泛的 NLP 方法,从传统的形态句法调查到深度语言模型,通过困惑度、情感(极性)和情绪等进行分析。方法分析现实世界中没有考虑到计算分析的互动(如 Anchise 语料库)尤其具有挑战性。为了实现研究目标,我们使用了多种工具。这些工具包括基于数字语言生物标记(DLB)的传统形态句法分析、基于转换器的语言模型、情感和情绪分析以及复杂度度量。分析既针对 MMSE 值的连续范围,也针对 AIFA(意大利药品管理局)指南根据 MMSE 阈值建议的严重/中度/轻度分类。尽管如此,一些相关值在统计上是显著的,并且与文献一致,这表明障碍程度越严重的人,其词汇量越少,有失认症,采用更非正式的语言语域,并显示出简化动词的使用,分词、动名词、从句情态、情态动词的使用减少,以及时态的使用趋向于现在时,而不利于过去时。困惑度与 MMSE 之间-0.26 的反相关性表明,困惑度可以捕捉到稍为具体的语言信息,从而对 MMSE 分数起到补充作用。在分类任务中,基于 DLB 的分类器在 "严重 "和 "轻微 "的二元分类中取得了 0.79 的 F1 分数,在多标签分类中取得了 0.61 的 F1 分数。情感和情绪分析表明,快乐呈反向趋势,而 MMSE 分数则表明,受损程度较轻的人比其他人更不快乐,或者说更 "消极"。考虑到现实世界的背景,这与受痴呆症影响的人意识逐渐减弱的假设是一致的。最后,综合各种分析方法已被证明能够有效地提供有关语言和交流障碍的更广泛的信息,以及有关痴呆症进展的更精确的数据。
{"title":"A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus","authors":"","doi":"10.1016/j.csl.2024.101691","DOIUrl":"10.1016/j.csl.2024.101691","url":null,"abstract":"<div><h3>Introduction</h3><p>Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions.</p></div><div><h3>Methods</h3><p>Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values.</p></div><div><h3>Results and discussion</h3><p>Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the clas","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000743/pdfft?md5=5a1457a7753032d3fdc01ffd4b14e74e&pid=1-s2.0-S0885230824000743-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141844241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition PaSCoNT - 用于自动语音识别的泰语中北部平行语音库
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-22 DOI: 10.1016/j.csl.2024.101692

This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data.

本文提出了泰语北部-中部平行语音语料库(PaSCoNT)。本研究的目的不仅在于了解泰语北部和中部的不同语言特点,还在于利用该语料库进行自动语音识别。该语料库由泰北人日常生活对话中的语音数据组成。我们与专门研究泰北方言的语言学家合作,设计了 2,000 个涵盖所有音素的泰北方言句子。本研究的样本是在清迈府生活了 18 年以上的 200 名讲泰北方言的人。语音记录在开放和封闭的环境中进行。在语音录制过程中,每位说话者必须朗读 100 对中北部泰语句子,以确保语音数据来自同一说话者。总共录制了 100 小时的语音:50 小时北部泰语,50 小时中部泰语。总体而言,PaSCoNT 包含 907,832 个单词和 6,279 个词汇项目。对 PaSCoNT 语料库进行统计分析后发现,词库中 49.64% 的单词属于泰北方言,50.36% 属于泰中方言,1,621 个词汇同时出现在泰北和泰中方言中。统计分析用于研究北部泰语和中部泰语在语音节奏上的差异,即每音素时间 (TTP) 和每分钟音节数 (SPM)。结果显示,中部泰语和北部泰语在语音节奏上存在显著的统计学差异。中部泰语的 TTP 说话和发音速度低于北部泰语,而中部泰语的 SPM 说话和发音速度高于北部泰语。结果还显示,使用泰北语测试语音数据进行测试时,使用泰北语语料库训练的 ASR 模型的 WER% 较低,而使用泰中语测试语音数据进行测试时的 WER% 较高,反之亦然。但是,使用 PaSCoNT 语音语料进行 ASR 模型训练时,泰北和泰中测试语音数据的 WER% 都较低。
{"title":"PaSCoNT - Parallel Speech Corpus of Northern-central Thai for automatic speech recognition","authors":"","doi":"10.1016/j.csl.2024.101692","DOIUrl":"10.1016/j.csl.2024.101692","url":null,"abstract":"<div><p>This paper proposed a Parallel Speech Corpus of Northern-central Thai (PaSCoNT). The purpose of this research is not only to understand the different linguistic characteristics between Northern and Central Thai, but also to utilize this corpus for automatic speech recognition. The corpus is composed of speech data from dialogues of daily life among northern Thai people. We designed 2,000 Northern Thai sentences covering all phonemes, in collaboration with linguists specialized in the Northern Thai dialect. The samples in this study are 200 Northern Thai dialect speakers who had been living in Chiang Mai province for more than 18 years. The speech was recorded in both open and closed environments. In the speech recording, each speaker must read 100 pairs of Northern-Central Thai sentences to ensure that the speech data comes from the same speaker. In total, 100 h of speech were recorded: 50 h of Northern Thai and 50 h of Central Thai. Overall, PaSCoNT consists of 907,832 words and 6,279 vocabulary items. Statistical analysis of the PaSCoNT corpus revealed that 49.64 % of words in the lexicon belongs to the Northern Thai dialect, 50.36 % from the Central Thai dialect, and 1,621 vocabulary items appeared in both Northern and Central Thai. Statistical analysis is used to examine the difference in speech tempo, i.e. time per phoneme (TTP), syllable per minute (SPM), between Northern and Central Thai. The results revealed that there were statistically significant differences speech tempo between Central and Northern Thai. The TTP speaking and articulation rate of Central Thai is lower than Northern Thai whereas SPM speaking and articulation rate of Central Thai is higher than Northern Thai. The results also showed that the ASR model training using Northern Thai speech corpus provides the lower WER% when testing using Northern Thai testing speech data and provides the higher WER% when testing using Central Thai Testing speech data and vice versa. However, the ASR model training using the PaSCoNT speech corpus provides the lower WER% for both Northern Thai and Central Thai testing speech data.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000755/pdfft?md5=f97afe2aa357037c83c6473c50174543&pid=1-s2.0-S0885230824000755-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141839086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalizing Hate Speech Detection Using Multi-Task Learning: A Case Study of Political Public Figures 利用多任务学习实现仇恨言论检测的泛化:政治公众人物案例研究
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-17 DOI: 10.1016/j.csl.2024.101690

Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train–test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor generalization performance when predicting across new domains and datasets. This work proposes a new Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech datasets to construct a more encompassing classification model. Using a dataset-level leave-one-out evaluation (designating a dataset for testing and jointly training on all others), we trial the MTL detection on new, previously unseen datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining the generalization error in train–test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed PubFigs, focusing on the problematic speech of American Public Political Figures. We crowdsource-label using Amazon MTurk more than 20,000 tweets and machine-label problematic speech in all the 305,235 tweets in PubFigs. We find that the abusive and hate tweeting mainly originates from right-leaning figures and relates to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds embeddings that can simultaneously separate abusive from hate speech, and identify its topics.

自动识别仇恨和辱骂内容对于打击有害网络内容的传播及其破坏性影响至关重要。现有的大多数工作都是通过检查仇恨言论数据集上训练-测试分裂的泛化误差来评估模型的。这些数据集的定义和标记标准往往不同,导致在预测新领域和数据集时泛化性能较差。本研究提出了一种新的多任务学习(MTL)管道,可同时在多个仇恨言论数据集上进行训练,以构建一个更全面的分类模型。我们使用数据集级的 "留一弃一 "评估(指定一个数据集进行测试,并在所有其他数据集上进行联合训练),在以前未见过的新数据集上试用 MTL 检测。我们的结果始终优于大量现有工作。在对训练-测试分离的泛化误差进行检查时,我们显示出了很好的结果,而在对以前未见过的数据集进行预测时,我们的结果也有了很大的改进。此外,我们还建立了一个名为 PubFigs 的新数据集,重点关注美国公众政治人物的问题言论。我们使用亚马逊 MTurk 对 20,000 多条推文进行了众包标注,并对 PubFigs 中所有 305,235 条推文中的问题言论进行了机器标注。 我们发现,辱骂性和仇恨性推文主要来自右倾人物,涉及伊斯兰教、妇女、种族和移民等六个主题。我们的研究表明,MTL 建立的嵌入可以同时区分辱骂性和仇恨性言论,并识别其主题。
{"title":"Generalizing Hate Speech Detection Using Multi-Task Learning: A Case Study of Political Public Figures","authors":"","doi":"10.1016/j.csl.2024.101690","DOIUrl":"10.1016/j.csl.2024.101690","url":null,"abstract":"<div><p>Automatic identification of hateful and abusive content is vital in combating the spread of harmful online content and its damaging effects. Most existing works evaluate models by examining the generalization error on train–test splits on hate speech datasets. These datasets often differ in their definitions and labeling criteria, leading to poor generalization performance when predicting across new domains and datasets. This work proposes a new Multi-task Learning (MTL) pipeline that trains simultaneously across multiple hate speech datasets to construct a more encompassing classification model. Using a dataset-level leave-one-out evaluation (designating a dataset for testing and jointly training on all others), we trial the MTL detection on new, previously unseen datasets. Our results consistently outperform a large sample of existing work. We show strong results when examining the generalization error in train–test splits and substantial improvements when predicting on previously unseen datasets. Furthermore, we assemble a novel dataset, dubbed <span>PubFigs</span>, focusing on the problematic speech of American Public Political Figures. We crowdsource-label using Amazon MTurk more than 20,000 tweets and machine-label problematic speech in all the 305,235 tweets in <span>PubFigs</span>. We find that the abusive and hate tweeting mainly originates from right-leaning figures and relates to six topics, including Islam, women, ethnicity, and immigrants. We show that MTL builds embeddings that can simultaneously separate abusive from hate speech, and identify its topics.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000731/pdfft?md5=e169fb47936a2284a9d518194884b197&pid=1-s2.0-S0885230824000731-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141853188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving text classification via computing category correlation matrix from text graph 通过计算文本图中的类别相关矩阵改进文本分类
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-09 DOI: 10.1016/j.csl.2024.101688

In text classification task, models have shown remarkable accuracy across various datasets. However, confusion often arises when certain categories within the dataset are too similar, causing misclassification of certain samples. This paper proposes an improved method for this problem, through the creation of a three-layer text graph for the corpus, which is used to calculate the Category Correlation Matrix (CCM). Additionally, this paper introduces category-adaptive contrastive learning for text embedding from the encoder, enhancing the model’s ability to distinguish between samples in confusable categories that are easily confused. Soft labels are generated using this matrix to guide the classifier, preventing the model from becoming overconfident with one-hot vectors. The efficacy of this approach was demonstrated through experimental evaluations on three text encoders and six different datasets.

在文本分类任务中,各种模型在各种数据集上都表现出了卓越的准确性。然而,当数据集中的某些类别过于相似时,往往会产生混淆,导致对某些样本的错误分类。本文针对这一问题提出了一种改进方法,即为语料库创建一个三层文本图,用于计算类别相关矩阵(CCM)。此外,本文还为编码器的文本嵌入引入了类别自适应对比学习,增强了模型区分易混淆类别样本的能力。利用该矩阵生成软标签来引导分类器,防止模型对单点向量过于自信。通过对三种文本编码器和六个不同数据集的实验评估,证明了这种方法的有效性。
{"title":"Improving text classification via computing category correlation matrix from text graph","authors":"","doi":"10.1016/j.csl.2024.101688","DOIUrl":"10.1016/j.csl.2024.101688","url":null,"abstract":"<div><p>In text classification task, models have shown remarkable accuracy across various datasets. However, confusion often arises when certain categories within the dataset are too similar, causing misclassification of certain samples. This paper proposes an improved method for this problem, through the creation of a three-layer text graph for the corpus, which is used to calculate the Category Correlation Matrix (CCM). Additionally, this paper introduces category-adaptive contrastive learning for text embedding from the encoder, enhancing the model’s ability to distinguish between samples in confusable categories that are easily confused. Soft labels are generated using this matrix to guide the classifier, preventing the model from becoming overconfident with one-hot vectors. The efficacy of this approach was demonstrated through experimental evaluations on three text encoders and six different datasets.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000718/pdfft?md5=936898b07abaca17411cf1265567ad9a&pid=1-s2.0-S0885230824000718-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
C-KGE: Curriculum learning-based Knowledge Graph Embedding C-KGE:基于课程学习的知识图谱嵌入
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-08 DOI: 10.1016/j.csl.2024.101689
Diange Zhou , Shengwen Li , Lijun Dong , Renyao Chen , Xiaoyue Peng , Hong Yao

Knowledge graph embedding (KGE) aims to embed entities and relations in knowledge graphs (KGs) into a continuous, low-dimensional vector space. It has been shown as an effective tool for integrating knowledge graphs to improve various intelligent applications, such as question answering and information extraction. However, previous KGE models ignore the hidden natural order of knowledge learning on learning the embeddings of entities and relations, leaving room for improvement in their performance. Inspired by the easy-to-hard pattern used in human knowledge learning, this paper proposes a Curriculum learning-based KGE (C-KGE) model, which learns the embeddings of entities and relations from “basic knowledge” to “domain knowledge”. Specifically, a seed set representing the basic knowledge and several knowledge subsets are identified from KG. Then, entity overlap is employed to score the learning difficulty of each subset. Finally, C-KGE trains the entities and relations in each subset according to the learning difficulty score of each subset. C-KGE leverages trained embeddings of the seed set as prior knowledge and learns knowledge subsets iteratively to transfer knowledge between the seed set and subsets, smoothing the learning process of knowledge facts. Experimental results on real-world datasets demonstrate that the proposed model achieves improved embedding performances as well as reducing training time. Our codes and data will be released later.

知识图谱嵌入(KGE)旨在将知识图谱(KG)中的实体和关系嵌入到一个连续的低维向量空间中。它已被证明是整合知识图谱以改进各种智能应用(如问题解答和信息提取)的有效工具。然而,以往的知识图谱模型在学习实体和关系的嵌入时忽略了知识学习的隐性自然顺序,因此其性能还有待提高。受人类知识学习从易到难模式的启发,本文提出了一种基于课程学习的 KGE(C-KGE)模型,该模型从 "基础知识 "到 "领域知识 "学习实体和关系的嵌入。具体来说,首先从 KGE 中识别出代表基础知识的种子集和若干知识子集。然后,利用实体重叠度对每个子集的学习难度进行评分。最后,C-KGE 根据每个子集中的学习难度评分,训练每个子集中的实体和关系。C-KGE 利用训练好的种子集嵌入作为先验知识,并迭代学习知识子集,在种子集和子集之间传递知识,从而平滑知识事实的学习过程。在实际数据集上的实验结果表明,所提出的模型不仅提高了嵌入性能,而且缩短了训练时间。我们的代码和数据将于稍后发布。
{"title":"C-KGE: Curriculum learning-based Knowledge Graph Embedding","authors":"Diange Zhou ,&nbsp;Shengwen Li ,&nbsp;Lijun Dong ,&nbsp;Renyao Chen ,&nbsp;Xiaoyue Peng ,&nbsp;Hong Yao","doi":"10.1016/j.csl.2024.101689","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101689","url":null,"abstract":"<div><p>Knowledge graph embedding (KGE) aims to embed entities and relations in knowledge graphs (KGs) into a continuous, low-dimensional vector space. It has been shown as an effective tool for integrating knowledge graphs to improve various intelligent applications, such as question answering and information extraction. However, previous KGE models ignore the hidden natural order of knowledge learning on learning the embeddings of entities and relations, leaving room for improvement in their performance. Inspired by the easy-to-hard pattern used in human knowledge learning, this paper proposes a <strong>C</strong>urriculum learning-based <strong>KGE</strong> (C-KGE) model, which learns the embeddings of entities and relations from “basic knowledge” to “domain knowledge”. Specifically, a seed set representing the basic knowledge and several knowledge subsets are identified from KG. Then, entity overlap is employed to score the learning difficulty of each subset. Finally, C-KGE trains the entities and relations in each subset according to the learning difficulty score of each subset. C-KGE leverages trained embeddings of the seed set as prior knowledge and learns knowledge subsets iteratively to transfer knowledge between the seed set and subsets, smoothing the learning process of knowledge facts. Experimental results on real-world datasets demonstrate that the proposed model achieves improved embedding performances as well as reducing training time. Our codes and data will be released later.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S088523082400072X/pdfft?md5=fb33df044eeec38fa247696a89eb8787&pid=1-s2.0-S088523082400072X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seq2Seq dynamic planning network for progressive text generation 用于渐进文本生成的 Seq2Seq 动态规划网络
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-06 DOI: 10.1016/j.csl.2024.101687

Long text generation is a hot topic in natural language processing. To address the problem of insufficient semantic representation and incoherent text generation in existing long text models, the Seq2Seq dynamic planning network progressive text generation model (DPPG-BART) is proposed. In the data pre-processing stage, the lexical division sorting algorithm is used. To obtain hierarchical sequences of keywords with clear information content, word weight values are calculated and ranked by TF-IDF of word embedding. To enhance the input representation, the dynamic planning progressive generation network is constructed. Positional features and word embedding vector features are integrated at the input side of the model. At the same time, to enrich the semantic information and expand the content of the text, the relevant concept words are generated by the concept expansion module. The scoring network and feedback mechanism are used to adjust the concept expansion module. Experimental results show that the DPPG-BART model is optimized over GPT2-S, GPT2-L, BART and ProGen-2 model approaches in terms of metric values of MSJ, B-BLEU and FBD on long text datasets from two different domains, CNN and Writing Prompts.

长文本生成是自然语言处理领域的一个热门话题。针对现有长文本模型中语义表征不足和文本生成不连贯的问题,提出了 Seq2Seq 动态规划网络渐进文本生成模型(DPPG-BART)。在数据预处理阶段,采用词性划分排序算法。为获得信息内容清晰的关键词分层序列,通过词嵌入的 TF-IDF 计算词权重值并进行排序。为增强输入表示,构建了动态规划渐进生成网络。在模型的输入端集成了位置特征和词嵌入向量特征。同时,为了丰富语义信息和扩展文本内容,概念扩展模块会生成相关的概念词。评分网络和反馈机制用于调整概念扩展模块。实验结果表明,在 CNN 和写作提示这两个不同领域的长文本数据集上,DPPG-BART 模型在 MSJ、B-BLEU 和 FBD 的度量值方面优于 GPT2-S、GPT2-L、BART 和 ProGen-2 模型方法。
{"title":"Seq2Seq dynamic planning network for progressive text generation","authors":"","doi":"10.1016/j.csl.2024.101687","DOIUrl":"10.1016/j.csl.2024.101687","url":null,"abstract":"<div><p>Long text generation is a hot topic in natural language processing. To address the problem of insufficient semantic representation and incoherent text generation in existing long text models, the Seq2Seq dynamic planning network progressive text generation model (DPPG-BART) is proposed. In the data pre-processing stage, the lexical division sorting algorithm is used. To obtain hierarchical sequences of keywords with clear information content, word weight values are calculated and ranked by TF-IDF of word embedding. To enhance the input representation, the dynamic planning progressive generation network is constructed. Positional features and word embedding vector features are integrated at the input side of the model. At the same time, to enrich the semantic information and expand the content of the text, the relevant concept words are generated by the concept expansion module. The scoring network and feedback mechanism are used to adjust the concept expansion module. Experimental results show that the DPPG-BART model is optimized over GPT2-S, GPT2-L, BART and ProGen-2 model approaches in terms of metric values of MSJ, B-BLEU and FBD on long text datasets from two different domains, CNN and Writing Prompts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000706/pdfft?md5=9c314286f96f095183826029b974049f&pid=1-s2.0-S0885230824000706-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modified R-BERT with global semantic information for relation classification task 利用全局语义信息进行关系分类任务的改良 R-BERT
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-06 DOI: 10.1016/j.csl.2024.101686

The objective of the relation classification task is to extract relations between entities. Recent studies have found that R-BERT (Wu and He, 2019) based on pre-trained BERT (Devlin et al., 2019) acquires extremely good results in the relation classification task. However, this method does not take into account the semantic differences between different kinds of entities and global semantic information either. In this paper, we set two different fully connected layers to take into account the semantic difference between subject and object entities. Besides, we build a new module named Concat Module to fully fuse the semantic information among the subject entity vector, object entity vector, and the whole sample sentence representation vector. In addition, we apply the average pooling to acquire a better entity representation of each entity and add the activation operation with a new fully connected layer after our Concat Module. Modifying R-BERT, we propose a new model named BERT with Global Semantic Information (GSR-BERT) for relation classification tasks. We use our approach on two datasets: the SemEval-2010 Task 8 dataset and the Chinese character relationship classification dataset. Our approach achieves a significant improvement over the two datasets. It means that our approach enjoys transferability across different datasets. Furthermore, we prove that these policies we used in our approach also enjoy applicability to named entity recognition task.

关系分类任务的目标是提取实体之间的关系。最近的研究发现,基于预训练 BERT(Devlin 等人,2019 年)的 R-BERT(Wu 和 He,2019 年)在关系分类任务中获得了非常好的结果。然而,这种方法也没有考虑到不同类型实体之间的语义差异和全局语义信息。在本文中,我们设置了两个不同的全连接层,以考虑主体和客体实体之间的语义差异。此外,我们还建立了一个名为 Concat Module 的新模块,以充分融合主语实体向量、宾语实体向量和整个样本句子表示向量之间的语义信息。此外,我们还应用了平均池化技术来获取每个实体的更好的实体表示,并在 Concat 模块之后添加了一个新的全连接层的激活操作。在 R-BERT 的基础上,我们为关系分类任务提出了一个新模型,名为 "全局语义信息 BERT"(GSR-BERT)。我们在两个数据集上使用了我们的方法:SemEval-2010 Task 8 数据集和汉字关系分类数据集。我们的方法在这两个数据集上取得了显著的改进。这意味着我们的方法可以在不同的数据集之间移植。此外,我们还证明了我们方法中使用的这些策略也适用于命名实体识别任务。
{"title":"Modified R-BERT with global semantic information for relation classification task","authors":"","doi":"10.1016/j.csl.2024.101686","DOIUrl":"10.1016/j.csl.2024.101686","url":null,"abstract":"<div><p>The objective of the relation classification task is to extract relations between entities. Recent studies have found that R-BERT (Wu and He, 2019) based on pre-trained BERT (Devlin et al., 2019) acquires extremely good results in the relation classification task. However, this method does not take into account the semantic differences between different kinds of entities and global semantic information either. In this paper, we set two different fully connected layers to take into account the semantic difference between subject and object entities. Besides, we build a new module named Concat Module to fully fuse the semantic information among the subject entity vector, object entity vector, and the whole sample sentence representation vector. In addition, we apply the average pooling to acquire a better entity representation of each entity and add the activation operation with a new fully connected layer after our Concat Module. Modifying R-BERT, we propose a new model named BERT with Global Semantic Information (GSR-BERT) for relation classification tasks. We use our approach on two datasets: the SemEval-2010 Task 8 dataset and the Chinese character relationship classification dataset. Our approach achieves a significant improvement over the two datasets. It means that our approach enjoys transferability across different datasets. Furthermore, we prove that these policies we used in our approach also enjoy applicability to named entity recognition task.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S088523082400069X/pdfft?md5=0315d6e108caefa08e405818e501bafd&pid=1-s2.0-S088523082400069X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge 第 7 届 CHiME 挑战赛 UDASE 任务中对语音增强方法的客观和主观评估
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-06 DOI: 10.1016/j.csl.2024.101685
Simon Leglaive , Matthieu Fraticelli , Hend ElGhazaly , Léonie Borne , Mostafa Sadeghi , Scott Wisdom , Manuel Pariente , John R. Hershey , Daniel Pressnitzer , Jon P. Barker

Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.

用于语音增强的监督模型是利用人工生成的干净语音和噪声信号混合物进行训练的。然而,合成训练条件可能无法准确反映测试过程中遇到的实际情况。当测试域与合成训练域有显著差异时,这种差异会导致性能低下。为了解决这个问题,第七届 CHiME 挑战赛的 UDASE 任务旨在利用来自测试域的真实世界噪声语音记录,对语音增强模型进行无监督域适应。具体来说,该测试域与 CHiME-5 数据集相对应,其特点是在嘈杂和混响的家庭环境中录制的真实多讲话者会话语音记录,而这些记录无法获得地面真实的干净语音信号。在本文中,我们介绍了提交给 CHiME-7 UDASE 任务的系统的客观和主观评价,并对结果进行了分析。分析表明,主观评价与最近提出的几种用于语音增强的有监督非侵入式性能指标之间的相关性有限。相反,结果表明,使用为挑战赛开发的混响LibriCHiME-5数据集,更传统的侵入式客观指标可用于域内性能评估。主观评估结果表明,所有系统都成功降低了背景噪声,但总是以增加失真为代价。在主观评估的四种语音增强方法中,只有一种与未经处理的噪声语音相比,整体质量有所提高,这凸显了这项任务的难度。为 CHiME-7 UDASE 任务创建的工具和音频资料已与社区共享。
{"title":"Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge","authors":"Simon Leglaive ,&nbsp;Matthieu Fraticelli ,&nbsp;Hend ElGhazaly ,&nbsp;Léonie Borne ,&nbsp;Mostafa Sadeghi ,&nbsp;Scott Wisdom ,&nbsp;Manuel Pariente ,&nbsp;John R. Hershey ,&nbsp;Daniel Pressnitzer ,&nbsp;Jon P. Barker","doi":"10.1016/j.csl.2024.101685","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101685","url":null,"abstract":"<div><p>Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000688/pdfft?md5=8f9da64ecc09fa13d3d77b048c8fa3ae&pid=1-s2.0-S0885230824000688-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multilingual non-intrusive binaural intelligibility prediction based on phone classification 基于手机分类的多语言非侵入式双耳可懂度预测
IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-03 DOI: 10.1016/j.csl.2024.101684
Jana Roßbach , Kirsten C. Wagener , Bernd T. Meyer

Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or a-priori knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.

语音清晰度(SI)预测模型是开发助听器或消费电子产品语音处理算法的重要工具。为了在现实环境中使用,SI 模型最好是非侵入式的(不需要分别输入原始语音和降级语音、文字记录或有关信号的先验知识),并能对音频信号进行双耳处理。大多数现有的 SI 模型并不符合所有这些标准。在本研究中,我们提出了一种基于深度神经网络获得的电话概率的 SI 模型。该模型包括一个双耳增强阶段,用于预测现实声学场景中的语音识别阈值(SRT)。在研究的第一部分,不同空间配置下的 SRT 预测结果与正常听力听者的结果进行了比较。平均而言,与三个干扰基线模型相比,我们的方法产生的误差更低,相关性更高。在第二部分中,我们探讨了与空间听力相关的指标,即可懂度级差(ILD)和双耳可懂度级差(BILD),是否可以用我们的建模方法预测。我们还研究了在预测 ILD 和 BILD 时,训练和测试模型之间的语言不匹配是否会产生影响。这一点对于低资源语言尤为重要,因为在低资源语言中,没有数千小时的语言材料可用于训练。我们的模型在预测双耳优势时误差为 1.5 dB。这略高于具有竞争力的基线 MBSTOI 误差(1.1 dB),但不需要分别输入原始语音和降级语音。我们还发现,没有经过目标语言专门训练的模型也能获得良好的双耳预测效果。
{"title":"Multilingual non-intrusive binaural intelligibility prediction based on phone classification","authors":"Jana Roßbach ,&nbsp;Kirsten C. Wagener ,&nbsp;Bernd T. Meyer","doi":"10.1016/j.csl.2024.101684","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101684","url":null,"abstract":"<div><p>Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or <em>a-priori</em> knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000676/pdfft?md5=2480b19144d8254f73d5748237f56388&pid=1-s2.0-S0885230824000676-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Speech and Language
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1