Computer Speech and Language最新文献_第3页

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-19 DOI: 10.1016/j.csl.2024.101752

Wei Xiang , Cheng Liu , Bang Wang

Document-level event causality identification (ECI) aims to detect causal relations in between event mentions in a document. Existing approaches for document-level event causality identification detect the causal relation for each pair of event mentions independently, while ignoring latent correlated cause–effect structure in a document, i.e., one cause (effect) with multiple effects (causes). We argue that identifying the causal relation of one event pair may facilitate the causality identification for other event pairs. In light of such considerations, we propose to model the correlated causal-effect structure by a hypergraph and jointly identify multiple causal relations with the same cause (effect). In particular, we propose an event-hypergraph neural encoding model, called EHNEM, for document-level event causality identification. In EHNEM, we first initialize event mentions’ embeddings via a pre-trained language model and obtain potential causal relation of each event pair via a multilayer perceptron. To capture causal correlations, we construct a hypergraph by integrating potential causal relations for the same event as a hyperedge. On the constructed event-hypergraph, we use a hypergraph convolutional network to learn the representation of each event node for final causality identification. Experiments on both EventStoryLine corpus and English-MECI corpus show that our EHNEM model significantly outperforms the state-of-the-art algorithms.

文档级事件因果关系识别（ECI）旨在检测文档中事件提及之间的因果关系。现有的文档级事件因果关系识别方法是独立检测每一对事件的因果关系，而忽略了文档中潜在的相关因果结构，即一因（果）多果（因）。我们认为，识别一个事件对的因果关系可能有助于识别其他事件对的因果关系。鉴于这种考虑，我们建议用超图来模拟相关的因果结构，并联合识别具有相同原因（结果）的多个因果关系。具体而言，我们提出了一种用于文档级事件因果关系识别的事件超图神经编码模型，称为 EHNEM。在 EHNEM 中，我们首先通过预训练的语言模型初始化事件提及的嵌入，然后通过多层感知器获取每个事件对的潜在因果关系。为了捕捉因果相关性，我们将同一事件的潜在因果关系整合为一个超边，从而构建一个超图。在构建的事件超图上，我们使用超图卷积网络来学习每个事件节点的表示法，以进行最终的因果关系识别。在 EventStoryLine 语料库和 English-MECI 语料库上进行的实验表明，我们的 EHNEM 模型明显优于最先进的算法。

{"title":"Modeling correlated causal-effect structure with a hypergraph for document-level event causality identification","authors":"Wei Xiang , Cheng Liu , Bang Wang","doi":"10.1016/j.csl.2024.101752","DOIUrl":"10.1016/j.csl.2024.101752","url":null,"abstract":"<div><div>Document-level event causality identification (ECI) aims to detect causal relations in between event mentions in a document. Existing approaches for document-level event causality identification detect the causal relation for each pair of event mentions independently, while ignoring latent correlated cause–effect structure in a document, i.e., one cause (effect) with multiple effects (causes). We argue that identifying the causal relation of one event pair may facilitate the causality identification for other event pairs. In light of such considerations, we propose to model the correlated causal-effect structure by a hypergraph and jointly identify multiple causal relations with the same cause (effect). In particular, we propose an event-hypergraph neural encoding model, called EHNEM, for document-level event causality identification. In EHNEM, we first initialize event mentions’ embeddings via a pre-trained language model and obtain potential causal relation of each event pair via a multilayer perceptron. To capture causal correlations, we construct a hypergraph by integrating potential causal relations for the same event as a hyperedge. On the constructed event-hypergraph, we use a hypergraph convolutional network to learn the representation of each event node for final causality identification. Experiments on both EventStoryLine corpus and English-MECI corpus show that our EHNEM model significantly outperforms the state-of-the-art algorithms.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101752"},"PeriodicalIF":3.1,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

You Are What You Write: Author re-identification privacy attacks in the era of pre-trained language models 你就是你写的东西预训练语言模型时代的作者再识别隐私攻击

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-16 DOI: 10.1016/j.csl.2024.101746

Richard Plant , Valerio Giuffrida , Dimitra Gkatzia

The widespread use of pre-trained language models has revolutionised knowledge transfer in natural language processing tasks. However, there is a concern regarding potential breaches of user trust due to the risk of re-identification attacks, where malicious users could extract Personally Identifiable Information (PII) from other datasets. To assess the extent of extractable personal information on popular pre-trained models, we conduct the first wide coverage evaluation and comparison of state-of-the-art privacy-preserving algorithms on a large multi-lingual dataset for sentiment analysis annotated with demographic information (including location, age, and gender). Our results suggest a link between model complexity, pre-training data volume, and the efficacy of privacy-preserving embeddings. We found that privacy-preserving methods demonstrate greater effectiveness when applied to larger and more complex models, with improvements exceeding

> 20 %

over non-private baselines. Additionally, we observe that local differential privacy imposes serious performance penalties of

\approx 20 %

in our test setting, which can be mitigated using hybrid or metric-DP techniques.

预训练语言模型的广泛使用彻底改变了自然语言处理任务中的知识转移。然而，由于存在重新识别攻击的风险，恶意用户可能会从其他数据集中提取个人身份信息（PII），因此存在潜在的破坏用户信任的问题。为了评估流行的预训练模型可提取个人信息的程度，我们首次在一个标注了人口统计信息（包括位置、年龄和性别）的大型多语言情感分析数据集上对最先进的隐私保护算法进行了广泛的评估和比较。我们的研究结果表明，模型复杂度、预训练数据量与隐私保护嵌入的功效之间存在联系。我们发现，当应用于更大型、更复杂的模型时，隐私保护方法表现出更大的有效性，与非隐私基线相比，改进幅度超过 20%。此外，我们还观察到，在我们的测试环境中，局部差分隐私会带来≈20%的严重性能损失，而使用混合或度量-隐私嵌入技术则可以减轻这种损失。

{"title":"You Are What You Write: Author re-identification privacy attacks in the era of pre-trained language models","authors":"Richard Plant , Valerio Giuffrida , Dimitra Gkatzia","doi":"10.1016/j.csl.2024.101746","DOIUrl":"10.1016/j.csl.2024.101746","url":null,"abstract":"<div><div>The widespread use of pre-trained language models has revolutionised knowledge transfer in natural language processing tasks. However, there is a concern regarding potential breaches of user trust due to the risk of re-identification attacks, where malicious users could extract Personally Identifiable Information (PII) from other datasets. To assess the extent of extractable personal information on popular pre-trained models, we conduct the first wide coverage evaluation and comparison of state-of-the-art privacy-preserving algorithms on a large multi-lingual dataset for sentiment analysis annotated with demographic information (including location, age, and gender). Our results suggest a link between model complexity, pre-training data volume, and the efficacy of privacy-preserving embeddings. We found that privacy-preserving methods demonstrate greater effectiveness when applied to larger and more complex models, with improvements exceeding <span><math><mrow><mo>></mo><mn>20</mn><mtext>%</mtext></mrow></math></span> over non-private baselines. Additionally, we observe that local differential privacy imposes serious performance penalties of <span><math><mrow><mo>≈</mo><mn>20</mn><mtext>%</mtext></mrow></math></span> in our test setting, which can be mitigated using hybrid or metric-DP techniques.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101746"},"PeriodicalIF":3.1,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Speech-to-Text Translation: A Survey 端到端语音到文本翻译：调查

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-14 DOI: 10.1016/j.csl.2024.101751

Nivedita Sethiya, Chandresh Kumar Maurya

Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.

语音到文本（ST）翻译是指将一种语言的语音信号转换成另一种语言的文本。它可应用于各种领域，如免提通信、听写、视频讲座转录和翻译等。自动语音识别（ASR）和机器翻译（MT）模型在传统的 ST 翻译中发挥着至关重要的作用，可将口语的原始形式转换为书面文本，促进无缝跨语言交流。ASR 识别口语单词，而 MT 则将转录文本翻译成目标语言。这种集成模型存在级联错误传播以及资源和培训成本高的问题。因此，研究人员一直在探索 ST 翻译的端到端（E2E）模型。然而，据我们所知，目前还没有关于 E2E ST 的全面综述。因此，本调查报告将讨论这方面的工作。我们试图对 ST 任务所使用的模型、度量标准和数据集进行全面评述，并提供具有新见解的挑战和未来研究方向。我们相信，这篇综述将对研究 ST 模型各种应用的研究人员有所帮助。

{"title":"End-to-End Speech-to-Text Translation: A Survey","authors":"Nivedita Sethiya, Chandresh Kumar Maurya","doi":"10.1016/j.csl.2024.101751","DOIUrl":"10.1016/j.csl.2024.101751","url":null,"abstract":"<div><div>Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101751"},"PeriodicalIF":3.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction 语料库和无监督基准：实现他加禄语语法错误校正

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-14 DOI: 10.1016/j.csl.2024.101750

Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang

Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC.

语法纠错（GEC）是自然语言处理技术的一项具有挑战性的任务。许多针对语法纠错的工作都是针对英语或汉语等高资源语言的。然而，由于缺乏大型注释语料库，针对低资源语言所做的工作十分有限。在低资源语言中，目前基于语言模型评分的无监督 GEC 性能良好。然而，在这种情况下，预训练语言模型仍有待探索。本研究提出了一种基于 BERT 的无监督 GEC 框架，该框架主要解决词级错误，GEC 被视为一种多类分类任务。该框架包含三个模块：数据流构建模块、句子易读性评分模块和错误检测与纠正模块。我们提出了一种新颖的伪易错性评分方法来评估句子的可能正确性，并为他加禄语 GEC 研究构建了他加禄语语料库。它在自建的他加禄语语料库和开源印尼语语料库上取得了具有竞争力的性能，并证明了我们的框架是低资源 GEC 任务基准方法的补充。我们的语料库可从 https://github.com/GKLMIP/TagalogGEC 获取。

{"title":"Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction","authors":"Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang","doi":"10.1016/j.csl.2024.101750","DOIUrl":"10.1016/j.csl.2024.101750","url":null,"abstract":"<div><div>Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from <span><span>https://github.com/GKLMIP/TagalogGEC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101750"},"PeriodicalIF":3.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction TR-Net：用于联合实体和关系提取的令牌关系启发填表网络

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-09 DOI: 10.1016/j.csl.2024.101749

Yongle Kong , Zhihao Yang , Zeyuan Ding , Wenfei Liu , Shiqi Zhang , Jianan Xu , Hongfei Lin

Recently, table filling models have achieved promising performance in jointly extracting relation triplets from complex sentences, leveraging their inherent structural advantage of delineating entities and relations as table cells. Nonetheless, these models predominantly concentrate on the cells corresponding to entity pairs within the predicted tables, neglecting the interrelations among other token pairs. This oversight can potentially lead to the exclusion of essential token information. To address these challenges, we introduce the Token Relation-Inspired Network (TR-Net), a novel framework for the joint extraction of entities and relations. It encompasses a token relation generator that adaptively constructs a token relation table, concentrating on the prominent token cells. Moreover, it also uses a structure-enhanced encoder that integrates the structural and sequential data of sentences via a highway gate mechanism. Our experimental analysis demonstrates that TR-Net delivers considerable enhancements and achieves state-of-the-art performance on four public datasets.

最近，表格填充模型利用其固有的结构优势，将实体和关系划分为表格单元，在联合提取复杂句子中的关系三元组方面取得了可喜的成绩。然而，这些模型主要集中于预测表格中实体对对应的单元格，而忽略了其他标记对之间的相互关系。这种疏忽有可能导致重要的标记信息被排除在外。为了应对这些挑战，我们引入了令牌关系启发网络（TR-Net），这是一个联合提取实体和关系的新型框架。它包括一个令牌关系生成器，该生成器能自适应地构建令牌关系表，并集中于突出的令牌单元。此外，它还使用了结构增强编码器，通过高速公路门机制整合句子的结构和顺序数据。我们的实验分析表明，TR-Net 在四个公共数据集上实现了相当大的提升，并达到了最先进的性能。

{"title":"TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction","authors":"Yongle Kong , Zhihao Yang , Zeyuan Ding , Wenfei Liu , Shiqi Zhang , Jianan Xu , Hongfei Lin","doi":"10.1016/j.csl.2024.101749","DOIUrl":"10.1016/j.csl.2024.101749","url":null,"abstract":"<div><div>Recently, table filling models have achieved promising performance in jointly extracting relation triplets from complex sentences, leveraging their inherent structural advantage of delineating entities and relations as table cells. Nonetheless, these models predominantly concentrate on the cells corresponding to entity pairs within the predicted tables, neglecting the interrelations among other token pairs. This oversight can potentially lead to the exclusion of essential token information. To address these challenges, we introduce the <em>Token Relation-Inspired Network (TR-Net)</em>, a novel framework for the joint extraction of entities and relations. It encompasses a token relation generator that adaptively constructs a token relation table, concentrating on the prominent token cells. Moreover, it also uses a structure-enhanced encoder that integrates the structural and sequential data of sentences via a highway gate mechanism. Our experimental analysis demonstrates that TR-Net delivers considerable enhancements and achieves state-of-the-art performance on four public datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101749"},"PeriodicalIF":3.1,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023 完善语音合成评估：2023 年暴风雪挑战赛总结

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-08 DOI: 10.1016/j.csl.2024.101747

Olivier Perrotin , Brooke Stephenson , Silvain Gerber , Gérard Bailly , Simon King

The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.

自 2005 年以来，"暴风雪挑战赛 "一直在为文本到语音（TTS）领域的进展设定基准。该挑战赛见证了一个个重要的里程碑，其结果表明，到 2021 年，合成语音在可懂度方面与自然语音无异，到同年，合成语音在自然度方面甚至可能与自然语音无异。因此，最新 TTS 系统生成的高质量合成语音暴露了 ITU-T P.800.1 平均意见分（MOS）在检测合成语音与自然语音之间的其余差异方面的局限性。然而，它是前几届挑战赛中使用的唯一方法，目前仍是语音合成评估领域最流行的方法。在 2023 年挑战赛中，我们通过采用最先进的语音合成评估技术来完善对语音质量、说话人相似度和可懂度的评估，从而解决了以往挑战赛中观察到的局限性。在语音质量方面，通过对获得最佳 MOS 的系统进行相对比较，我们发现了系统之间更多的显著差异。在说话者相似度方面，我们证明了听者是否熟悉目标声音会产生很大的偏差。至于可懂度，与合成语篇的全局转录任务相比，对特定语言现象（如同音字的发音）的评估更能突出系统的局限性。除了报告 2023 年挑战赛 18 个参赛项目的结果外，我们还将结果分析扩展到 TTS 模块的类型，以提供有关模型设计最新进展的一些见解。总之，今年的结果表明，有必要向改进 TTS 评估的新方法转变，以揭示合成语音与自然语音之间越来越小的局部差异。

{"title":"Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023","authors":"Olivier Perrotin , Brooke Stephenson , Silvain Gerber , Gérard Bailly , Simon King","doi":"10.1016/j.csl.2024.101747","DOIUrl":"10.1016/j.csl.2024.101747","url":null,"abstract":"<div><div>The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101747"},"PeriodicalIF":3.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification CLIPMulti：探索用于零镜头文本分类的多模态增强型 CLIP 的性能

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-07 DOI: 10.1016/j.csl.2024.101748

Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang

Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.

零镜头文本分类不需要大量标注数据，旨在处理缺乏标注训练数据的文本分类任务。现有的零镜头文本分类使用文本-文本匹配范式或文本-图像匹配范式，在不同的基准数据集上显示出良好的性能。然而，现有的分类范式只考虑了文本匹配的单一模态，很少关注多模态对文本分类的帮助。为了将多模态纳入零镜头文本分类，我们提出了一种多模态增强型 CLIP 框架（CLIPMulti），它采用文本-图像&文本匹配范式来增强零镜头文本分类的有效性。我们测试了三种不同的图像和文本组合对零镜头文本分类的影响，并进一步提出了一种匹配方法（Match-CLIPMulti），以根据分类图像自动查找相应的文本。我们在七个公开的零镜头文本分类数据集上进行了实验，并取得了具有竞争力的性能。此外，我们还分析了不同参数对 Match-CLIPMulti 实验的影响。我们希望这项工作能为语言任务中的多模态融合带来更多思考和探索。

{"title":"CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification","authors":"Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang","doi":"10.1016/j.csl.2024.101748","DOIUrl":"10.1016/j.csl.2024.101748","url":null,"abstract":"<div><div>Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101748"},"PeriodicalIF":3.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UniKDD: A Unified Generative model for Knowledge-driven Dialogue UniKDD：知识驱动对话的统一生成模型

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-30 DOI: 10.1016/j.csl.2024.101740

Qian Wang , Yan Chen , Yang Wang , Xu Wang

knowledge-driven dialogue (KDD) is to introduce an external knowledge base, generating an informative and fluent response. However, previous works employ different models to conduct the sub-tasks of KDD, ignoring the connection between sub-tasks and resulting in a difficulty of training and inference. To solve those issues above, we propose the UniKDD, a unified generative model for KDD, which models all sub-tasks into a generation task, enhancing the connection between tasks and facilitating the training and inference. Specifically, UniKDD simplifies the complex KDD tasks into three main sub-tasks, i.e., entity prediction, attribute prediction, and dialogue generation. These tasks are transformed into a text generation task and trained by an end-to-end way. In the inference phase, UniKDD first predicts a set of entities used for current turn dialogue according to the dialogue history. Then, for each predicted entity, UniKDD predicts the corresponding attributes by the dialogue history. Finally, UniKDD generates a high-quality and informative response using the dialogue history and predicted knowledge triplets. The experimental results show that our proposed UniKDD can perform KDD task well and outperform the baseline on the evaluation of knowledge selection and response generation. The code is available at https://github.com/qianandfei/UniKDD.git.

知识驱动对话（KDD）的目的是引入外部知识库，生成信息丰富且流畅的回应。然而，以往的研究采用不同的模型来完成 KDD 的子任务，忽略了子任务之间的联系，导致训练和推理困难。为了解决上述问题，我们提出了统一的 KDD 生成模型 UniKDD，它将所有子任务建模为一个生成任务，加强了任务之间的联系，方便了训练和推理。具体来说，UniKDD 将复杂的 KDD 任务简化为三个主要子任务，即实体预测、属性预测和对话生成。这些任务被转化为文本生成任务，并通过端到端的方式进行训练。在推理阶段，UniKDD 首先根据对话历史记录预测一组用于当前回合对话的实体。然后，对于每个预测的实体，UniKDD 根据对话历史预测相应的属性。最后，UniKDD 利用对话历史和预测的知识三元组生成高质量和信息丰富的回复。实验结果表明，我们提出的 UniKDD 可以很好地完成 KDD 任务，在知识选择和响应生成的评估方面优于基线。代码见 https://github.com/qianandfei/UniKDD.git。

{"title":"UniKDD: A Unified Generative model for Knowledge-driven Dialogue","authors":"Qian Wang , Yan Chen , Yang Wang , Xu Wang","doi":"10.1016/j.csl.2024.101740","DOIUrl":"10.1016/j.csl.2024.101740","url":null,"abstract":"<div><div>knowledge-driven dialogue (KDD) is to introduce an external knowledge base, generating an informative and fluent response. However, previous works employ different models to conduct the sub-tasks of KDD, ignoring the connection between sub-tasks and resulting in a difficulty of training and inference. To solve those issues above, we propose the UniKDD, a unified generative model for KDD, which models all sub-tasks into a generation task, enhancing the connection between tasks and facilitating the training and inference. Specifically, UniKDD simplifies the complex KDD tasks into three main sub-tasks, i.e., entity prediction, attribute prediction, and dialogue generation. These tasks are transformed into a text generation task and trained by an end-to-end way. In the inference phase, UniKDD first predicts a set of entities used for current turn dialogue according to the dialogue history. Then, for each predicted entity, UniKDD predicts the corresponding attributes by the dialogue history. Finally, UniKDD generates a high-quality and informative response using the dialogue history and predicted knowledge triplets. The experimental results show that our proposed UniKDD can perform KDD task well and outperform the baseline on the evaluation of knowledge selection and response generation. The code is available at <span><span>https://github.com/qianandfei/UniKDD.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101740"},"PeriodicalIF":3.1,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the ability of LLMs to classify written proficiency levels 探索法律硕士划分书面能力水平的能力

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-29 DOI: 10.1016/j.csl.2024.101745

Susanne DeVore

This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.

本文采用基准研究设计方法，测试了 LLM 对英语和普通话学习者所写文章的语言水平评分进行分类的能力。首先，使用纯基础教学提示语测试了五个变量（LLM 模型、提示语版本、提示语、评分标准和温度）对评分准确性的影响。其次，测试结果的一致性。第三，利用第一次和第二次测试中表现最好的一致条件，测试添加示例和/或能力指南以及使用零、一和少量思维链提示技术对准确性评级的影响。虽然测试结果没有达到实际应用所需的水平，但可以为 LLM 和提示技术的持续开发提供参考，从而提高准确率。本文重点介绍了语言学领域之外有关提示工程的最新研究，并选择了理论上与能力评级相关的提示变量和技术。最后，本文讨论了从这些测试中获得的关键启示，这些启示可以为未来的开发提供参考，以及为什么在其他情况下有效的方法在熟练程度评级中却不那么有效。

{"title":"Exploring the ability of LLMs to classify written proficiency levels","authors":"Susanne DeVore","doi":"10.1016/j.csl.2024.101745","DOIUrl":"10.1016/j.csl.2024.101745","url":null,"abstract":"<div><div>This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101745"},"PeriodicalIF":3.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Entity and relationship extraction based on span contribution evaluation and focusing framework 基于跨度贡献评估和聚焦框架的实体和关系提取

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-29 DOI: 10.1016/j.csl.2024.101744

Qibin Li , Nianmin Yao , Nai Zhou , Jian Zhao

Entity and relationship extraction involves identifying named entities and extracting relationships between them. Existing research focuses on enhancing span representations, yet overlooks the impact of non-target spans(ie, the span is non-entity or the span pair has no relationship) on model training. In this work, we propose a span contribution evaluation and focusing framework named CEFF, which assigns a contribution score to each non-target span in a sentence through pre-training, which reflects the contribution of span to model performance improvement. To a certain extent, this method considers the impact of different spans on model training, making the training more targeted. Additionally, leveraging the contribution scores of non-target spans, we introduce a simplified variant of the model, termed CEFF

_{s}

, which achieves comparable performance to models trained with all spans while utilizing fewer spans. This approach reduces training costs and improves training efficiency. Through extensive validation, we demonstrate that our contribution scores accurately reflect span contributions and achieve state-of-the-art results on five benchmark datasets.

实体和关系提取包括识别命名实体和提取实体之间的关系。现有研究侧重于增强跨度表示，但忽略了非目标跨度（即跨度为非实体或跨度对没有关系）对模型训练的影响。在这项工作中，我们提出了一个名为 CEFF 的跨度贡献评估和聚焦框架，通过预训练为句子中的每个非目标跨度分配一个贡献分值，以反映跨度对模型性能提升的贡献。这种方法在一定程度上考虑了不同跨度对模型训练的影响，使训练更有针对性。此外，利用非目标跨度的贡献分数，我们还引入了一种简化的模型变体，称为 CEFFs，它可以在利用较少跨度的情况下达到与所有跨度训练的模型相当的性能。这种方法降低了训练成本，提高了训练效率。通过广泛的验证，我们证明了我们的贡献分数能准确反映跨度贡献，并在五个基准数据集上取得了最先进的结果。

{"title":"Entity and relationship extraction based on span contribution evaluation and focusing framework","authors":"Qibin Li , Nianmin Yao , Nai Zhou , Jian Zhao","doi":"10.1016/j.csl.2024.101744","DOIUrl":"10.1016/j.csl.2024.101744","url":null,"abstract":"<div><div>Entity and relationship extraction involves identifying named entities and extracting relationships between them. Existing research focuses on enhancing span representations, yet overlooks the impact of non-target spans(ie, the span is non-entity or the span pair has no relationship) on model training. In this work, we propose a span contribution evaluation and focusing framework named CEFF, which assigns a contribution score to each non-target span in a sentence through pre-training, which reflects the contribution of span to model performance improvement. To a certain extent, this method considers the impact of different spans on model training, making the training more targeted. Additionally, leveraging the contribution scores of non-target spans, we introduce a simplified variant of the model, termed CEFF<span><math><msub><mrow></mrow><mrow><mi>s</mi></mrow></msub></math></span>, which achieves comparable performance to models trained with all spans while utilizing fewer spans. This approach reduces training costs and improves training efficiency. Through extensive validation, we demonstrate that our contribution scores accurately reflect span contributions and achieve state-of-the-art results on five benchmark datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101744"},"PeriodicalIF":3.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0