Computer Speech and Language最新文献_第4页

Spoofing countermeasure for fake speech detection using brute force features 利用暴力特征检测假语音的欺骗对策

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-02 DOI: 10.1016/j.csl.2024.101732

Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani

Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.

由于深度学习技术的进步，生成欺骗性语音的技术已经大量涌现。这种合成语音可被用于有害目的，如冒名顶替或传播虚假信息。该领域的研究人员正在研究用于欺骗检测的有用特征。本文广泛研究了语音欺骗检测中的三个问题，即每类样本的不平衡（这可能会对任何检测模型的性能产生负面影响）、特征早期和晚期融合的影响以及对模型的未见攻击分析。关于不平衡问题，我们提出了两种方法（基于合成少数群体过度采样技术（SMOTE）的模型和基于 Bootstrap 的模型）。我们使用 OpenSMILE 工具包提取了不同的特征集，并对其结果以及早期和晚期融合进行了研究。实验使用 ASVspoof 2019 数据集进行评估，其中包括合成、语音转换和重放语音样本。此外，分类中还采用了支持向量机（SVM）和深度神经网络（DNN）。各种测试场景的结果表明，无论是数据集的不平衡性，还是特定特征或它们的融合，其性能都优于蛮力版本的模型，因为不平衡模型在逻辑访问（LA）和物理访问（PA）方面实现的最佳等错误率（EER）分别为 6.67 % 和 1.80 %。

{"title":"Spoofing countermeasure for fake speech detection using brute force features","authors":"Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani","doi":"10.1016/j.csl.2024.101732","DOIUrl":"10.1016/j.csl.2024.101732","url":null,"abstract":"<div><div>Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101732"},"PeriodicalIF":3.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142428363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A language-agnostic model of child language acquisition 儿童语言习得的语言诊断模型

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-09-30 DOI: 10.1016/j.csl.2024.101714

Louis Mahon , Omri Abend , Uri Berger , Katherine Demuth , Mark Johnson , Mark Steedman

This work reimplements a recent semantic bootstrapping child language acquisition (CLA) model, which was originally designed for English, and trains it to learn a new language: Hebrew. The model learns from pairs of utterances and logical forms as meaning representations, and acquires both syntax and word meanings simultaneously. The results show that the model mostly transfers to Hebrew, but that a number of factors, including the richer morphology in Hebrew, makes the learning slower and less robust. This suggests that a clear direction for future work is to enable the model to leverage the similarities between different word forms.

这项工作重新实施了一个最新的语义引导儿童语言习得（CLA）模型，该模型最初是为英语设计的，并训练它学习一种新的语言：希伯来语。该模型以成对的语句和逻辑形式作为意义表征进行学习，并同时掌握句法和词义。结果表明，该模型在很大程度上可以迁移到希伯来语中，但包括希伯来语中更丰富的词形在内的一些因素使得学习速度更慢，稳健性更差。这表明，未来工作的一个明确方向是使模型能够利用不同词形之间的相似性。

引用次数: 0

Evidence and Axial Attention Guided Document-level Relation Extraction 证据和轴向注意力引导的文档级关系提取

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-09-28 DOI: 10.1016/j.csl.2024.101728

Jiawei Yuan , Hongyong Leng , Yurong Qian , Jiaying Chen , Mengnan Ma , Shuxiang Hou

Document-level Relation Extraction (DocRE) aims to identify semantic relations among multiple entity pairs within a document. Most of the previous DocRE methods take the entire document as input. However, for human annotators, a small subset of sentences in the document, namely the evidence, is sufficient to infer the relation of an entity pair. Additionally, a document usually contains multiple entities, and these entities are scattered throughout various location of the document. Previous models use these entities independently, ignore the global interdependency among relation triples. To handle above issues, we propose a novel framework EAAGRE (Evidence and Axial Attention Guided Relation Extraction). Firstly, we use human-annotated evidence labels to supervise the attention module of DocRE system, making the model pay attention to the evidence sentences rather than others. Secondly, we construct an entity-level relation matrix and use axial attention to capture the global interactions among entity pairs. By doing so, we further extract the relations that require multiple entity pairs for prediction. We conduct various experiments on DocRED and have some improvement compared to baseline models, verifying the effectiveness of our model.

文档级关系提取（DocRE）旨在识别文档中多个实体对之间的语义关系。以前的 DocRE 方法大多将整个文档作为输入。然而，对于人类注释者来说，文档中的一小部分句子（即证据）就足以推断出实体对之间的关系。此外，文档通常包含多个实体，而这些实体分散在文档的不同位置。以往的模型将这些实体独立使用，忽略了关系三元组之间的全局相互依赖关系。为了解决上述问题，我们提出了一个新颖的框架 EAAGRE（证据和轴向注意力引导的关系提取）。首先，我们使用人类标注的证据标签来监督 DocRE 系统的注意力模块，使模型关注证据句子而不是其他句子。其次，我们构建了一个实体级关系矩阵，并使用轴向关注来捕捉实体对之间的全局交互。这样，我们就能进一步提取需要多个实体对才能预测的关系。我们在 DocRED 上进行了各种实验，与基线模型相比有了一定的改进，验证了我们模型的有效性。

{"title":"Evidence and Axial Attention Guided Document-level Relation Extraction","authors":"Jiawei Yuan , Hongyong Leng , Yurong Qian , Jiaying Chen , Mengnan Ma , Shuxiang Hou","doi":"10.1016/j.csl.2024.101728","DOIUrl":"10.1016/j.csl.2024.101728","url":null,"abstract":"<div><div>Document-level Relation Extraction (DocRE) aims to identify semantic relations among multiple entity pairs within a document. Most of the previous DocRE methods take the entire document as input. However, for human annotators, a small subset of sentences in the document, namely the evidence, is sufficient to infer the relation of an entity pair. Additionally, a document usually contains multiple entities, and these entities are scattered throughout various location of the document. Previous models use these entities independently, ignore the global interdependency among relation triples. To handle above issues, we propose a novel framework EAAGRE (Evidence and Axial Attention Guided Relation Extraction). Firstly, we use human-annotated evidence labels to supervise the attention module of DocRE system, making the model pay attention to the evidence sentences rather than others. Secondly, we construct an entity-level relation matrix and use axial attention to capture the global interactions among entity pairs. By doing so, we further extract the relations that require multiple entity pairs for prediction. We conduct various experiments on DocRED and have some improvement compared to baseline models, verifying the effectiveness of our model.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101728"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142533818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPNet: A Serial and Parallel Convolutional Neural Network algorithm for the cross-language coreference resolution SPNet：一种用于跨语言共同参考解析的串行和并行卷积神经网络算法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-09-28 DOI: 10.1016/j.csl.2024.101729

Zixi Jia , Tianli Zhao , Jingyu Ru , Yanxiang Meng , Bing Xia

Current models of coreference resolution always neglect the importance of hidden feature extraction, accurate scoring framework design, and the long-term influence of preceding potential antecedents on future decision-making. However, these aspects play vital roles in scoring the likelihood of coreference between an anaphora and its’ real antecedent. In this paper, we present a novel model named Serial and Parallel Convolutional Neural Network (SPNet). Based on the SPNet, two kinds of resolvers are proposed. Given the characteristics of reinforcement learning, we joint the reinforcement learning framework and the SPNet to solve the problem of Chinese zero pronoun resolution. What is more, we make some fine-tuning on the SPNet and propose a new resolver combined with the end-to-end framework to solve the problem of coreference resolution. The experiments are conducted on the CoNLL-2012 dataset and the results show that our model is effective. Our model achieves excellent performance in the Chinese zero pronoun resolution task. On the other hand, compared with our baseline, our model also has an improvement of 0.3% in coreference resolution task.

当前的共参考分辨率模型往往忽视了隐藏特征提取、准确评分框架设计以及之前潜在前因对未来决策的长期影响的重要性。然而，这些方面在评价回指与其真实先行词之间的共指可能性方面起着至关重要的作用。在本文中，我们提出了一种新的模型——串行和并行卷积神经网络（SPNet）。基于SPNet，提出了两种解析器。针对强化学习的特点，我们将强化学习框架与SPNet相结合，解决了汉语零代词的识别问题。此外，我们对SPNet进行了一些微调，提出了一种结合端到端框架的解析器来解决共参考解析问题。在CoNLL-2012数据集上进行了实验，结果表明该模型是有效的。该模型在汉语零代词解析任务中取得了优异的成绩。另一方面，与我们的基线相比，我们的模型在共参考分辨率任务上也有0.3%的提高。

{"title":"SPNet: A Serial and Parallel Convolutional Neural Network algorithm for the cross-language coreference resolution","authors":"Zixi Jia , Tianli Zhao , Jingyu Ru , Yanxiang Meng , Bing Xia","doi":"10.1016/j.csl.2024.101729","DOIUrl":"10.1016/j.csl.2024.101729","url":null,"abstract":"<div><div>Current models of coreference resolution always neglect the importance of hidden feature extraction, accurate scoring framework design, and the long-term influence of preceding potential antecedents on future decision-making. However, these aspects play vital roles in scoring the likelihood of coreference between an anaphora and its’ real antecedent. In this paper, we present a novel model named Serial and Parallel Convolutional Neural Network (SPNet). Based on the SPNet, two kinds of resolvers are proposed. Given the characteristics of reinforcement learning, we joint the reinforcement learning framework and the SPNet to solve the problem of Chinese zero pronoun resolution. What is more, we make some fine-tuning on the SPNet and propose a new resolver combined with the end-to-end framework to solve the problem of coreference resolution. The experiments are conducted on the CoNLL-2012 dataset and the results show that our model is effective. Our model achieves excellent performance in the Chinese zero pronoun resolution task. On the other hand, compared with our baseline, our model also has an improvement of 0.3% in coreference resolution task.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101729"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142747745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech Generation for Indigenous Language Education 土著语言教育的语音生成

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-09-28 DOI: 10.1016/j.csl.2024.101723

Aidan Pine , Erica Cooper , David Guzmán , Eric Joanis , Anna Kazantseva , Ross Krekoski , Roland Kuhn , Samuel Larkin , Patrick Littell , Delaney Lothian , Akwiratékha’ Martin , Korin Richmond , Marc Tessier , Cassia Valentini-Botinhao , Dan Wells , Junichi Yamagishi

As the quality of contemporary speech synthesis improves, so too does the interest from language communities in developing text-to-speech (TTS) systems for a variety of real-world applications. Much of the work on TTS has focused on high-resource languages, resulting in implicitly resource-intensive paths to building such systems. The goal of this paper is to provide signposts and points of reference for future low-resource speech synthesis efforts, with insights drawn from the Speech Generation for Indigenous Language Education (SGILE) project. Funded and coordinated by the National Research Council of Canada (NRC), this multi-year, multi-partner project has the goal of producing high-quality text-to-speech systems that support the teaching of Indigenous languages in a variety of educational contexts. We provide background information and motivation for the project, as well as details about our approach and project structure, including results from a multi-day requirements-gathering session. We discuss some of our key challenges, including building models with appropriate controls for educators, improving model data efficiency, and strategies for low-resource transfer learning and evaluation. Finally, we provide a detailed survey of existing speech synthesis software and introduce EveryVoice TTS, a toolkit designed specifically for low-resource speech synthesis.

随着当代语音合成质量的提高，语言社区对开发文本到语音（TTS）系统以用于各种实际应用的兴趣也日益浓厚。有关 TTS 的大部分工作都集中在高资源语言上，这就导致了构建此类系统的隐性资源密集型途径。本文的目标是为未来的低资源语音合成工作提供路标和参考点，并从 "土著语言教育语音生成（SGILE）"项目中获得启示。由加拿大国家研究理事会 (NRC) 资助和协调的这一多年期多伙伴项目的目标是开发高质量的文本到语音系统，以支持各种教育环境下的土著语言教学。我们将提供该项目的背景信息和动机，并详细介绍我们的方法和项目结构，包括为期多日的需求收集会议的结果。我们讨论了我们面临的一些主要挑战，包括为教育工作者建立具有适当控制功能的模型、提高模型数据的效率以及低资源迁移学习和评估策略。最后，我们对现有的语音合成软件进行了详细调查，并介绍了专为低资源语音合成设计的工具包 EveryVoice TTS。

{"title":"Speech Generation for Indigenous Language Education","authors":"Aidan Pine , Erica Cooper , David Guzmán , Eric Joanis , Anna Kazantseva , Ross Krekoski , Roland Kuhn , Samuel Larkin , Patrick Littell , Delaney Lothian , Akwiratékha’ Martin , Korin Richmond , Marc Tessier , Cassia Valentini-Botinhao , Dan Wells , Junichi Yamagishi","doi":"10.1016/j.csl.2024.101723","DOIUrl":"10.1016/j.csl.2024.101723","url":null,"abstract":"<div><div>As the quality of contemporary speech synthesis improves, so too does the interest from language communities in developing text-to-speech (TTS) systems for a variety of real-world applications. Much of the work on TTS has focused on high-resource languages, resulting in implicitly resource-intensive paths to building such systems. The goal of this paper is to provide signposts and points of reference for future low-resource speech synthesis efforts, with insights drawn from the Speech Generation for Indigenous Language Education (SGILE) project. Funded and coordinated by the National Research Council of Canada (NRC), this multi-year, multi-partner project has the goal of producing high-quality text-to-speech systems that support the teaching of Indigenous languages in a variety of educational contexts. We provide background information and motivation for the project, as well as details about our approach and project structure, including results from a multi-day requirements-gathering session. We discuss some of our key challenges, including building models with appropriate controls for educators, improving model data efficiency, and strategies for low-resource transfer learning and evaluation. Finally, we provide a detailed survey of existing speech synthesis software and introduce EveryVoice TTS, a toolkit designed specifically for low-resource speech synthesis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101723"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142533842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing analysis of diadochokinetic speech using deep neural networks 利用深度神经网络加强对双声道语音的分析

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-09-02 DOI: 10.1016/j.csl.2024.101715

Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet

Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.

声动力言语任务（DDK）涉及辅音-元音音节的重复发音。这些任务有助于检测言语运动障碍、鉴别诊断和监测进展。然而，对这些任务进行人工分析既费时又主观，而且只能提供一个粗略的语音图像。本文介绍了几种深度神经网络模型，这些模型可处理原始波形，用于自动分割未注释和未转录语音中的停止辅音和元音。深度编码器可作为特征提取模块，取代传统的信号处理特征。在这种情况下，不同的深度学习架构，如卷积神经网络（CNN）和大型自监督模型（如 HuBERT），被应用于提取过程。解码器模型使用衍生嵌入来识别帧类型。因此，本文研究了各种深度架构，包括线性层、LSTM、CNN 和变换器。本文评估了这些架构在健康人数据集和帕金森病老年患者未见数据集上检测语音速率、声音持续时间和边界位置的能力。结果表明，在这两个数据集上，LSTM 模型的表现优于所有其他模型，并可与训练有素的人类标注者相媲美。

{"title":"Enhancing analysis of diadochokinetic speech using deep neural networks","authors":"Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet","doi":"10.1016/j.csl.2024.101715","DOIUrl":"10.1016/j.csl.2024.101715","url":null,"abstract":"<div><p>Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101715"},"PeriodicalIF":3.1,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge 大量引用经典：用历史典故知识提高中国诗歌创作水平

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-08-30 DOI: 10.1016/j.csl.2024.101708

Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang

Integrating allusions into poems is an advanced form of human poetry writing, which could clearly express the author’s thoughts and arouse the resonance of readers. However, existing poetry generation works mainly focus on improving the coherence and fluency of poetry, while generating poems with allusion knowledge is rarely considered. To solve this issue, we propose an Allusion-aware Chinese Poetry Generation (ACPG) framework in this study. Concretely, we first release an Allusion-Enriched Poetry (AEP) dataset by linking poems with historical allusions, which might enable a new research direction for poetry generation. Based on this dataset, we design a three-stage learning mechanism to encourage the training stage under a low-resource setting, which can effectively exploit the knowledge of large-scale poetry and allusion data to generate informative allusive poems. Extensive experiments demonstrate the effectiveness of ACPG among a series of proposed baselines. Moreover, the proposed ACPG framework can also be applied to lyrics generation or other controlled text generation tasks, which can incorporate allusion knowledge into the generated results and enhance the meaning and quality of the texts.

将典故融入诗歌是人类诗歌写作的一种高级形式，可以清晰地表达作者的思想，引起读者的共鸣。然而，现有的诗歌生成工作主要集中在提高诗歌的连贯性和流畅性上，而很少考虑生成具有典故知识的诗歌。为了解决这个问题，我们在本研究中提出了一个典故感知中文诗歌创作（ACPG）框架。具体来说，我们首先发布了一个典故丰富诗歌（AEP）数据集，将诗歌与历史典故联系起来，为诗歌生成提供了一个新的研究方向。在此数据集的基础上，我们设计了一种三阶段学习机制，鼓励在低资源环境下的训练阶段，有效利用大规模诗歌和典故数据的知识，生成信息丰富的典故诗。广泛的实验证明了 ACPG 在一系列拟议基线中的有效性。此外，所提出的 ACPG 框架还可应用于歌词生成或其他受控文本生成任务，从而将典故知识纳入生成结果，增强文本的意义和质量。

{"title":"Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge","authors":"Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang","doi":"10.1016/j.csl.2024.101708","DOIUrl":"10.1016/j.csl.2024.101708","url":null,"abstract":"<div><p>Integrating allusions into poems is an advanced form of human poetry writing, which could clearly express the author’s thoughts and arouse the resonance of readers. However, existing poetry generation works mainly focus on improving the coherence and fluency of poetry, while generating poems with allusion knowledge is rarely considered. To solve this issue, we propose an <strong>A</strong>llusion-aware <strong>C</strong>hinese <strong>P</strong>oetry <strong>G</strong>eneration (ACPG) framework in this study. Concretely, we first release an <strong>A</strong>llusion-<strong>E</strong>nriched <strong>P</strong>oetry (AEP) dataset by linking poems with historical allusions, which might enable a new research direction for poetry generation. Based on this dataset, we design a three-stage learning mechanism to encourage the training stage under a low-resource setting, which can effectively exploit the knowledge of large-scale poetry and allusion data to generate informative allusive poems. Extensive experiments demonstrate the effectiveness of ACPG among a series of proposed baselines. Moreover, the proposed ACPG framework can also be applied to lyrics generation or other controlled text generation tasks, which can incorporate allusion knowledge into the generated results and enhance the meaning and quality of the texts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101708"},"PeriodicalIF":3.1,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Significance of chirp MFCC as a feature in speech and audio applications 啁啾 MFCC 作为语音和音频应用特征的意义

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-08-22 DOI: 10.1016/j.csl.2024.101713

S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan

A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with basic MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech–music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.

我们提出了一种基于啁啾z-变换的新特征，它能更好地表示潜在的真实频谱。这一特征，即啁啾 MFCC，是通过计算啁啾幅度频谱而不是傅里叶变换幅度频谱中的梅尔频率共振系数得出的。本文讨论了这一提议的理论基础，并使用似然高斯积进行了实验验证，以显示与基本 MFCC 相比，所提议的啁啾 MFCC 可提供更好的类别分离。此外，还使用三种不同的任务对该特征进行了实际评估，即语音音乐分类、说话人识别和语音命令识别。结果表明，在所有三个任务中，所提出的啁啾 MFCC 都有相当大的改进。

引用次数: 0

Artificial disfluency detection, uh no, disfluency generation for the masses 人工失言检测，呃，不，是为大众生成失言

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-08-17 DOI: 10.1016/j.csl.2024.101711

Tatiana Passali , Thanassis Mavropoulos , Grigorios Tsoumakas , Georgios Meditskos , Stefanos Vrochidis

Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that are encountered in real-world scenarios. At the same time, augmentation techniques for disfluency detection are not able to model complex types of disfluencies. This limits such approaches to only performing pre-training since the generated data are not indicative of disfluencies that occur in real scenarios and, as a result, cannot be directly used for training disfluency detection models, as we experimentally demonstrate. This imposes significant constraints on the usefulness of such approaches in practice since real disfluencies still have to be collected in order to train the models. In this work, we propose Large-scale ARtificial Disfluency Generation (LARD), a method for automatically generating artificial disfluencies, and more specifically repairs, from fluent text. Unlike existing augmentation techniques, LARD can simulate all the different and complex types of disfluencies. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic, context-aware artificial disfluencies. LARD can be used effectively for training disfluency detection models, bypassing the requirement of annotated disfluent data. Our empirical evaluation shows that LARD outperforms existing rule-based augmentation methods and increases the accuracy of existing disfluency detectors. In addition, experiments demonstrate that the proposed method can be effectively used in a low-resource setup.

现有的不流畅语检测方法通常需要大型注释数据集。然而，目前用于这项任务的数据集很有限，存在类不平衡的问题，而且缺乏在真实世界场景中遇到的某些类型的不流畅。同时，用于不流畅检测的增强技术也无法对复杂类型的不流畅进行建模。这就限制了这些方法只能进行预训练，因为生成的数据并不能反映真实场景中出现的不流畅现象，因此不能直接用于训练不流畅检测模型，我们的实验证明了这一点。这对此类方法在实践中的实用性造成了很大的限制，因为要训练模型，还必须收集真实的不流利现象。在这项工作中，我们提出了大规模人工断句生成（LARD），这是一种从流畅文本中自动生成人工断句（更具体地说是修复）的方法。与现有的增强技术不同，LARD 可以模拟所有不同和复杂类型的不流畅语句。此外，它还将上下文嵌入到不流畅语句的生成过程中，从而生成逼真的、具有上下文感知能力的人工不流畅语句。LARD 可有效地用于训练不流利语检测模型，从而绕过了对注释不流利语数据的要求。我们的实证评估表明，LARD 优于现有的基于规则的增强方法，并提高了现有不流利语检测器的准确性。此外，实验还证明，所提出的方法可以在低资源环境下有效使用。

{"title":"Artificial disfluency detection, uh no, disfluency generation for the masses","authors":"Tatiana Passali , Thanassis Mavropoulos , Grigorios Tsoumakas , Georgios Meditskos , Stefanos Vrochidis","doi":"10.1016/j.csl.2024.101711","DOIUrl":"10.1016/j.csl.2024.101711","url":null,"abstract":"<div><p>Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that are encountered in real-world scenarios. At the same time, augmentation techniques for disfluency detection are not able to model complex types of disfluencies. This limits such approaches to only performing pre-training since the generated data are not indicative of disfluencies that occur in real scenarios and, as a result, cannot be directly used for training disfluency detection models, as we experimentally demonstrate. This imposes significant constraints on the usefulness of such approaches in practice since real disfluencies still have to be collected in order to train the models. In this work, we propose Large-scale ARtificial Disfluency Generation (LARD), a method for automatically generating artificial disfluencies, and more specifically repairs, from fluent text. Unlike existing augmentation techniques, LARD can simulate all the different and complex types of disfluencies. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic, context-aware artificial disfluencies. LARD can be used effectively for training disfluency detection models, bypassing the requirement of annotated disfluent data. Our empirical evaluation shows that LARD outperforms existing rule-based augmentation methods and increases the accuracy of existing disfluency detectors. In addition, experiments demonstrate that the proposed method can be effectively used in a low-resource setup.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101711"},"PeriodicalIF":3.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000949/pdfft?md5=3e3442312f5819775b9ad09e131a9dd3&pid=1-s2.0-S0885230824000949-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep multi-task learning based detection of correlated mental disorders using audio modality 基于深度多任务学习的音频模式相关精神障碍检测

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-08-13 DOI: 10.1016/j.csl.2024.101710

Rohan Kumar Gupta, Rohit Sinha

The existence of correlation among mental disorders is a well-known phenomenon. Multi-task learning (MTL) has been reported to yield enhanced detection performance of a targeted mental disorder by leveraging its correlation with other related mental disorders, mainly in textual and visual modalities. The validation of the same on audio modality is yet to be explored. In this study, we explore homogeneous and heterogeneous MTL paradigms for detecting two correlated mental disorders, namely major depressive disorder (MDD) and post-traumatic stress disorder (PTSD), on a publicly available audio dataset. The detection of both disorders is interchangeably employed as an auxiliary task when the other is the main task. In addition, a few other tasks are employed as auxiliary tasks. The results show that both MTL paradigms, implemented using two considered deep-learning models, outperformed the corresponding single-task learning (STL). The best relative improvement in the detection performance of MDD and PTSD is found to be 29.9% and 28.8%, respectively. Furthermore, we analyzed the cross-corpus generalization of MTL using two distinct datasets that involve MDD/PTSD instances. The results indicate that the generalizability of MTL is significantly superior to that of STL. The best relative increment in the cross-corpus generalization performance of MDD and PTSD detection is found to be 25.0% and 56.5%, respectively.

精神障碍之间存在相关性是一个众所周知的现象。据报道，多任务学习（MTL）可利用目标精神障碍与其他相关精神障碍的相关性（主要是在文本和视觉模式中），提高目标精神障碍的检测性能。同样的方法在音频模式上的验证还有待探索。在本研究中，我们探索了同质和异质 MTL 范式，用于在公开可用的音频数据集上检测两种相关精神障碍，即重度抑郁障碍（MDD）和创伤后应激障碍（PTSD）。当一项任务是主要任务时，这两种疾病的检测则作为辅助任务交替使用。此外，还有一些其他任务也被用作辅助任务。结果表明，使用两种深度学习模型实现的 MTL 范式都优于相应的单任务学习（STL）。MDD 和 PTSD 检测性能的最佳相对改善率分别为 29.9% 和 28.8%。此外，我们还使用涉及 MDD/PTSD 实例的两个不同数据集分析了 MTL 的跨语料库泛化能力。结果表明，MTL 的泛化能力明显优于 STL。MDD 和 PTSD 检测的跨语料库泛化性能的最佳相对增量分别为 25.0% 和 56.5%。

{"title":"Deep multi-task learning based detection of correlated mental disorders using audio modality","authors":"Rohan Kumar Gupta, Rohit Sinha","doi":"10.1016/j.csl.2024.101710","DOIUrl":"10.1016/j.csl.2024.101710","url":null,"abstract":"<div><p>The existence of correlation among mental disorders is a well-known phenomenon. Multi-task learning (MTL) has been reported to yield enhanced detection performance of a targeted mental disorder by leveraging its correlation with other related mental disorders, mainly in textual and visual modalities. The validation of the same on audio modality is yet to be explored. In this study, we explore homogeneous and heterogeneous MTL paradigms for detecting two correlated mental disorders, namely major depressive disorder (MDD) and post-traumatic stress disorder (PTSD), on a publicly available audio dataset. The detection of both disorders is interchangeably employed as an auxiliary task when the other is the main task. In addition, a few other tasks are employed as auxiliary tasks. The results show that both MTL paradigms, implemented using two considered deep-learning models, outperformed the corresponding single-task learning (STL). The best relative improvement in the detection performance of MDD and PTSD is found to be 29.9% and 28.8%, respectively. Furthermore, we analyzed the cross-corpus generalization of MTL using two distinct datasets that involve MDD/PTSD instances. The results indicate that the generalizability of MTL is significantly superior to that of STL. The best relative increment in the cross-corpus generalization performance of MDD and PTSD detection is found to be 25.0% and 56.5%, respectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101710"},"PeriodicalIF":3.1,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000937/pdfft?md5=abe8ab646f019a4cea29fbd4acdd6557&pid=1-s2.0-S0885230824000937-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142011678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0