Computer Speech and Language最新文献_第2页

End-to-End Speech-to-Text Translation: A Survey 端到端语音到文本翻译：调查

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-14 DOI: 10.1016/j.csl.2024.101751

Nivedita Sethiya, Chandresh Kumar Maurya

Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.

语音到文本（ST）翻译是指将一种语言的语音信号转换成另一种语言的文本。它可应用于各种领域，如免提通信、听写、视频讲座转录和翻译等。自动语音识别（ASR）和机器翻译（MT）模型在传统的 ST 翻译中发挥着至关重要的作用，可将口语的原始形式转换为书面文本，促进无缝跨语言交流。ASR 识别口语单词，而 MT 则将转录文本翻译成目标语言。这种集成模型存在级联错误传播以及资源和培训成本高的问题。因此，研究人员一直在探索 ST 翻译的端到端（E2E）模型。然而，据我们所知，目前还没有关于 E2E ST 的全面综述。因此，本调查报告将讨论这方面的工作。我们试图对 ST 任务所使用的模型、度量标准和数据集进行全面评述，并提供具有新见解的挑战和未来研究方向。我们相信，这篇综述将对研究 ST 模型各种应用的研究人员有所帮助。

{"title":"End-to-End Speech-to-Text Translation: A Survey","authors":"Nivedita Sethiya, Chandresh Kumar Maurya","doi":"10.1016/j.csl.2024.101751","DOIUrl":"10.1016/j.csl.2024.101751","url":null,"abstract":"<div><div>Speech-to-Text (ST) translation pertains to the task of converting speech signals in one language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such integrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the works in this direction. We have attempted to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101751"},"PeriodicalIF":3.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction 语料库和无监督基准：实现他加禄语语法错误校正

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-14 DOI: 10.1016/j.csl.2024.101750

Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang

Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC.

语法纠错（GEC）是自然语言处理技术的一项具有挑战性的任务。许多针对语法纠错的工作都是针对英语或汉语等高资源语言的。然而，由于缺乏大型注释语料库，针对低资源语言所做的工作十分有限。在低资源语言中，目前基于语言模型评分的无监督 GEC 性能良好。然而，在这种情况下，预训练语言模型仍有待探索。本研究提出了一种基于 BERT 的无监督 GEC 框架，该框架主要解决词级错误，GEC 被视为一种多类分类任务。该框架包含三个模块：数据流构建模块、句子易读性评分模块和错误检测与纠正模块。我们提出了一种新颖的伪易错性评分方法来评估句子的可能正确性，并为他加禄语 GEC 研究构建了他加禄语语料库。它在自建的他加禄语语料库和开源印尼语语料库上取得了具有竞争力的性能，并证明了我们的框架是低资源 GEC 任务基准方法的补充。我们的语料库可从 https://github.com/GKLMIP/TagalogGEC 获取。

{"title":"Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction","authors":"Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang","doi":"10.1016/j.csl.2024.101750","DOIUrl":"10.1016/j.csl.2024.101750","url":null,"abstract":"<div><div>Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from <span><span>https://github.com/GKLMIP/TagalogGEC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101750"},"PeriodicalIF":3.1,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction TR-Net：用于联合实体和关系提取的令牌关系启发填表网络

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-09 DOI: 10.1016/j.csl.2024.101749

Yongle Kong , Zhihao Yang , Zeyuan Ding , Wenfei Liu , Shiqi Zhang , Jianan Xu , Hongfei Lin

Recently, table filling models have achieved promising performance in jointly extracting relation triplets from complex sentences, leveraging their inherent structural advantage of delineating entities and relations as table cells. Nonetheless, these models predominantly concentrate on the cells corresponding to entity pairs within the predicted tables, neglecting the interrelations among other token pairs. This oversight can potentially lead to the exclusion of essential token information. To address these challenges, we introduce the Token Relation-Inspired Network (TR-Net), a novel framework for the joint extraction of entities and relations. It encompasses a token relation generator that adaptively constructs a token relation table, concentrating on the prominent token cells. Moreover, it also uses a structure-enhanced encoder that integrates the structural and sequential data of sentences via a highway gate mechanism. Our experimental analysis demonstrates that TR-Net delivers considerable enhancements and achieves state-of-the-art performance on four public datasets.

最近，表格填充模型利用其固有的结构优势，将实体和关系划分为表格单元，在联合提取复杂句子中的关系三元组方面取得了可喜的成绩。然而，这些模型主要集中于预测表格中实体对对应的单元格，而忽略了其他标记对之间的相互关系。这种疏忽有可能导致重要的标记信息被排除在外。为了应对这些挑战，我们引入了令牌关系启发网络（TR-Net），这是一个联合提取实体和关系的新型框架。它包括一个令牌关系生成器，该生成器能自适应地构建令牌关系表，并集中于突出的令牌单元。此外，它还使用了结构增强编码器，通过高速公路门机制整合句子的结构和顺序数据。我们的实验分析表明，TR-Net 在四个公共数据集上实现了相当大的提升，并达到了最先进的性能。

{"title":"TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction","authors":"Yongle Kong , Zhihao Yang , Zeyuan Ding , Wenfei Liu , Shiqi Zhang , Jianan Xu , Hongfei Lin","doi":"10.1016/j.csl.2024.101749","DOIUrl":"10.1016/j.csl.2024.101749","url":null,"abstract":"<div><div>Recently, table filling models have achieved promising performance in jointly extracting relation triplets from complex sentences, leveraging their inherent structural advantage of delineating entities and relations as table cells. Nonetheless, these models predominantly concentrate on the cells corresponding to entity pairs within the predicted tables, neglecting the interrelations among other token pairs. This oversight can potentially lead to the exclusion of essential token information. To address these challenges, we introduce the <em>Token Relation-Inspired Network (TR-Net)</em>, a novel framework for the joint extraction of entities and relations. It encompasses a token relation generator that adaptively constructs a token relation table, concentrating on the prominent token cells. Moreover, it also uses a structure-enhanced encoder that integrates the structural and sequential data of sentences via a highway gate mechanism. Our experimental analysis demonstrates that TR-Net delivers considerable enhancements and achieves state-of-the-art performance on four public datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101749"},"PeriodicalIF":3.1,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023 完善语音合成评估：2023 年暴风雪挑战赛总结

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-08 DOI: 10.1016/j.csl.2024.101747

Olivier Perrotin , Brooke Stephenson , Silvain Gerber , Gérard Bailly , Simon King

The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.

自 2005 年以来，"暴风雪挑战赛 "一直在为文本到语音（TTS）领域的进展设定基准。该挑战赛见证了一个个重要的里程碑，其结果表明，到 2021 年，合成语音在可懂度方面与自然语音无异，到同年，合成语音在自然度方面甚至可能与自然语音无异。因此，最新 TTS 系统生成的高质量合成语音暴露了 ITU-T P.800.1 平均意见分（MOS）在检测合成语音与自然语音之间的其余差异方面的局限性。然而，它是前几届挑战赛中使用的唯一方法，目前仍是语音合成评估领域最流行的方法。在 2023 年挑战赛中，我们通过采用最先进的语音合成评估技术来完善对语音质量、说话人相似度和可懂度的评估，从而解决了以往挑战赛中观察到的局限性。在语音质量方面，通过对获得最佳 MOS 的系统进行相对比较，我们发现了系统之间更多的显著差异。在说话者相似度方面，我们证明了听者是否熟悉目标声音会产生很大的偏差。至于可懂度，与合成语篇的全局转录任务相比，对特定语言现象（如同音字的发音）的评估更能突出系统的局限性。除了报告 2023 年挑战赛 18 个参赛项目的结果外，我们还将结果分析扩展到 TTS 模块的类型，以提供有关模型设计最新进展的一些见解。总之，今年的结果表明，有必要向改进 TTS 评估的新方法转变，以揭示合成语音与自然语音之间越来越小的局部差异。

{"title":"Refining the evaluation of speech synthesis: A summary of the Blizzard Challenge 2023","authors":"Olivier Perrotin , Brooke Stephenson , Silvain Gerber , Gérard Bailly , Simon King","doi":"10.1016/j.csl.2024.101747","DOIUrl":"10.1016/j.csl.2024.101747","url":null,"abstract":"<div><div>The Blizzard Challenge has benchmarked progress in Text-to-Speech (TTS) since 2005. The Challenge has seen important milestones passed, with results suggesting that synthetic speech was indistinguishable from natural speech in terms of intelligibility in 2021 and that by that same year it was perhaps even indistinguishable in naturalness. The high quality of synthetic speech generated by the latest TTS systems has thus revealed limitations with ITU-T P.800.1 Mean Opinion Score (MOS) in detecting the remaining differences between synthetic and natural speech. Yet, it was the only method used in previous Challenges and is still the most popular method in the field for speech synthesis evaluation. In the 2023 Challenge, we addressed observed limitations of past Challenges by incorporating state-of-the-art speech synthesis evaluation techniques to refine the evaluation of speech quality, speaker similarity and intelligibility. For speech quality, a relative comparison of the systems receiving the best MOS was able to discover a greater number of significant differences between systems. Regarding speaker similarity, we demonstrated that there is a strong bias depending on whether the listeners are familiar with the target voice or not. As for intelligibility, the evaluation of language-specific phenomena, such as the pronunciation of homographs, better highlighted system limits compared to global transcription tasks of synthesised utterances. In addition to reporting results for the 18 entries to the 2023 Challenge, we extend the results analysis to type of TTS module to provide some insights on the most recent advances in model design. Overall, this year’s results demonstrate the need for a shift towards new methods for refining TTS evaluation to shed light on increasingly smaller and localised differences between synthesised and natural speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101747"},"PeriodicalIF":3.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142699471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification CLIPMulti：探索用于零镜头文本分类的多模态增强型 CLIP 的性能

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-11-07 DOI: 10.1016/j.csl.2024.101748

Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang

Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.

零镜头文本分类不需要大量标注数据，旨在处理缺乏标注训练数据的文本分类任务。现有的零镜头文本分类使用文本-文本匹配范式或文本-图像匹配范式，在不同的基准数据集上显示出良好的性能。然而，现有的分类范式只考虑了文本匹配的单一模态，很少关注多模态对文本分类的帮助。为了将多模态纳入零镜头文本分类，我们提出了一种多模态增强型 CLIP 框架（CLIPMulti），它采用文本-图像&文本匹配范式来增强零镜头文本分类的有效性。我们测试了三种不同的图像和文本组合对零镜头文本分类的影响，并进一步提出了一种匹配方法（Match-CLIPMulti），以根据分类图像自动查找相应的文本。我们在七个公开的零镜头文本分类数据集上进行了实验，并取得了具有竞争力的性能。此外，我们还分析了不同参数对 Match-CLIPMulti 实验的影响。我们希望这项工作能为语言任务中的多模态融合带来更多思考和探索。

{"title":"CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification","authors":"Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang","doi":"10.1016/j.csl.2024.101748","DOIUrl":"10.1016/j.csl.2024.101748","url":null,"abstract":"<div><div>Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101748"},"PeriodicalIF":3.1,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UniKDD: A Unified Generative model for Knowledge-driven Dialogue UniKDD：知识驱动对话的统一生成模型

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-30 DOI: 10.1016/j.csl.2024.101740

Qian Wang , Yan Chen , Yang Wang , Xu Wang

knowledge-driven dialogue (KDD) is to introduce an external knowledge base, generating an informative and fluent response. However, previous works employ different models to conduct the sub-tasks of KDD, ignoring the connection between sub-tasks and resulting in a difficulty of training and inference. To solve those issues above, we propose the UniKDD, a unified generative model for KDD, which models all sub-tasks into a generation task, enhancing the connection between tasks and facilitating the training and inference. Specifically, UniKDD simplifies the complex KDD tasks into three main sub-tasks, i.e., entity prediction, attribute prediction, and dialogue generation. These tasks are transformed into a text generation task and trained by an end-to-end way. In the inference phase, UniKDD first predicts a set of entities used for current turn dialogue according to the dialogue history. Then, for each predicted entity, UniKDD predicts the corresponding attributes by the dialogue history. Finally, UniKDD generates a high-quality and informative response using the dialogue history and predicted knowledge triplets. The experimental results show that our proposed UniKDD can perform KDD task well and outperform the baseline on the evaluation of knowledge selection and response generation. The code is available at https://github.com/qianandfei/UniKDD.git.

知识驱动对话（KDD）的目的是引入外部知识库，生成信息丰富且流畅的回应。然而，以往的研究采用不同的模型来完成 KDD 的子任务，忽略了子任务之间的联系，导致训练和推理困难。为了解决上述问题，我们提出了统一的 KDD 生成模型 UniKDD，它将所有子任务建模为一个生成任务，加强了任务之间的联系，方便了训练和推理。具体来说，UniKDD 将复杂的 KDD 任务简化为三个主要子任务，即实体预测、属性预测和对话生成。这些任务被转化为文本生成任务，并通过端到端的方式进行训练。在推理阶段，UniKDD 首先根据对话历史记录预测一组用于当前回合对话的实体。然后，对于每个预测的实体，UniKDD 根据对话历史预测相应的属性。最后，UniKDD 利用对话历史和预测的知识三元组生成高质量和信息丰富的回复。实验结果表明，我们提出的 UniKDD 可以很好地完成 KDD 任务，在知识选择和响应生成的评估方面优于基线。代码见 https://github.com/qianandfei/UniKDD.git。

{"title":"UniKDD: A Unified Generative model for Knowledge-driven Dialogue","authors":"Qian Wang , Yan Chen , Yang Wang , Xu Wang","doi":"10.1016/j.csl.2024.101740","DOIUrl":"10.1016/j.csl.2024.101740","url":null,"abstract":"<div><div>knowledge-driven dialogue (KDD) is to introduce an external knowledge base, generating an informative and fluent response. However, previous works employ different models to conduct the sub-tasks of KDD, ignoring the connection between sub-tasks and resulting in a difficulty of training and inference. To solve those issues above, we propose the UniKDD, a unified generative model for KDD, which models all sub-tasks into a generation task, enhancing the connection between tasks and facilitating the training and inference. Specifically, UniKDD simplifies the complex KDD tasks into three main sub-tasks, i.e., entity prediction, attribute prediction, and dialogue generation. These tasks are transformed into a text generation task and trained by an end-to-end way. In the inference phase, UniKDD first predicts a set of entities used for current turn dialogue according to the dialogue history. Then, for each predicted entity, UniKDD predicts the corresponding attributes by the dialogue history. Finally, UniKDD generates a high-quality and informative response using the dialogue history and predicted knowledge triplets. The experimental results show that our proposed UniKDD can perform KDD task well and outperform the baseline on the evaluation of knowledge selection and response generation. The code is available at <span><span>https://github.com/qianandfei/UniKDD.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101740"},"PeriodicalIF":3.1,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the ability of LLMs to classify written proficiency levels 探索法律硕士划分书面能力水平的能力

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-29 DOI: 10.1016/j.csl.2024.101745

Susanne DeVore

This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.

本文采用基准研究设计方法，测试了 LLM 对英语和普通话学习者所写文章的语言水平评分进行分类的能力。首先，使用纯基础教学提示语测试了五个变量（LLM 模型、提示语版本、提示语、评分标准和温度）对评分准确性的影响。其次，测试结果的一致性。第三，利用第一次和第二次测试中表现最好的一致条件，测试添加示例和/或能力指南以及使用零、一和少量思维链提示技术对准确性评级的影响。虽然测试结果没有达到实际应用所需的水平，但可以为 LLM 和提示技术的持续开发提供参考，从而提高准确率。本文重点介绍了语言学领域之外有关提示工程的最新研究，并选择了理论上与能力评级相关的提示变量和技术。最后，本文讨论了从这些测试中获得的关键启示，这些启示可以为未来的开发提供参考，以及为什么在其他情况下有效的方法在熟练程度评级中却不那么有效。

{"title":"Exploring the ability of LLMs to classify written proficiency levels","authors":"Susanne DeVore","doi":"10.1016/j.csl.2024.101745","DOIUrl":"10.1016/j.csl.2024.101745","url":null,"abstract":"<div><div>This paper tests the ability of LLMs to classify language proficiency ratings of texts written by learners of English and Mandarin, taking a benchmarking research design approach. First, the impact of five variables (LLM model, prompt version, prompt language, grading scale, and temperature) on rating accuracy are tested using a basic instruction-only prompt. Second, the consistency of results is tested. Third, the top performing consistent conditions emerging from the first and second tests are used to test the impact of adding examples and/or proficiency guidelines and the use of zero-, one-, and few-shot chain-of-thought prompting techniques on accuracy rating. While performance does not meet levels necessary for real-world use cases, the results can inform ongoing development of LLMs and prompting techniques to improve accuracy. This paper highlights recent research on prompt engineering outside of the field of linguistics and selects prompt variables and techniques that are theoretically relevant to proficiency rating. Finally, it discusses key takeaways from these tests that can inform future development and why approaches that have been effective in other contexts were not as effective for proficiency rating.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101745"},"PeriodicalIF":3.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142657092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Entity and relationship extraction based on span contribution evaluation and focusing framework 基于跨度贡献评估和聚焦框架的实体和关系提取

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-29 DOI: 10.1016/j.csl.2024.101744

Qibin Li , Nianmin Yao , Nai Zhou , Jian Zhao

Entity and relationship extraction involves identifying named entities and extracting relationships between them. Existing research focuses on enhancing span representations, yet overlooks the impact of non-target spans(ie, the span is non-entity or the span pair has no relationship) on model training. In this work, we propose a span contribution evaluation and focusing framework named CEFF, which assigns a contribution score to each non-target span in a sentence through pre-training, which reflects the contribution of span to model performance improvement. To a certain extent, this method considers the impact of different spans on model training, making the training more targeted. Additionally, leveraging the contribution scores of non-target spans, we introduce a simplified variant of the model, termed CEFF

_{s}

, which achieves comparable performance to models trained with all spans while utilizing fewer spans. This approach reduces training costs and improves training efficiency. Through extensive validation, we demonstrate that our contribution scores accurately reflect span contributions and achieve state-of-the-art results on five benchmark datasets.

实体和关系提取包括识别命名实体和提取实体之间的关系。现有研究侧重于增强跨度表示，但忽略了非目标跨度（即跨度为非实体或跨度对没有关系）对模型训练的影响。在这项工作中，我们提出了一个名为 CEFF 的跨度贡献评估和聚焦框架，通过预训练为句子中的每个非目标跨度分配一个贡献分值，以反映跨度对模型性能提升的贡献。这种方法在一定程度上考虑了不同跨度对模型训练的影响，使训练更有针对性。此外，利用非目标跨度的贡献分数，我们还引入了一种简化的模型变体，称为 CEFFs，它可以在利用较少跨度的情况下达到与所有跨度训练的模型相当的性能。这种方法降低了训练成本，提高了训练效率。通过广泛的验证，我们证明了我们的贡献分数能准确反映跨度贡献，并在五个基准数据集上取得了最先进的结果。

{"title":"Entity and relationship extraction based on span contribution evaluation and focusing framework","authors":"Qibin Li , Nianmin Yao , Nai Zhou , Jian Zhao","doi":"10.1016/j.csl.2024.101744","DOIUrl":"10.1016/j.csl.2024.101744","url":null,"abstract":"<div><div>Entity and relationship extraction involves identifying named entities and extracting relationships between them. Existing research focuses on enhancing span representations, yet overlooks the impact of non-target spans(ie, the span is non-entity or the span pair has no relationship) on model training. In this work, we propose a span contribution evaluation and focusing framework named CEFF, which assigns a contribution score to each non-target span in a sentence through pre-training, which reflects the contribution of span to model performance improvement. To a certain extent, this method considers the impact of different spans on model training, making the training more targeted. Additionally, leveraging the contribution scores of non-target spans, we introduce a simplified variant of the model, termed CEFF<span><math><msub><mrow></mrow><mrow><mi>s</mi></mrow></msub></math></span>, which achieves comparable performance to models trained with all spans while utilizing fewer spans. This approach reduces training costs and improves training efficiency. Through extensive validation, we demonstrate that our contribution scores accurately reflect span contributions and achieve state-of-the-art results on five benchmark datasets.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101744"},"PeriodicalIF":3.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taking relations as known conditions: A tagging based method for relational triple extraction 将关系作为已知条件基于标记的关系三提取方法

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-24 DOI: 10.1016/j.csl.2024.101734

Guanqing Kong , Qi Lei

Relational triple extraction refers to extracting entities and relations from natural texts, which is a crucial task in the construction of knowledge graph. Recently, tagging based methods have received increasing attention because of their simple and effective structural form. Among them, the two-step extraction method is easy to cause the problem of category imbalance. To address this issue, we propose a novel two-step extraction method, which first extracts subjects, generates a fixed-size embedding for each relation, and then regards these relations as known conditions to extract the objects directly with the identified subjects. In order to eliminate the influence of irrelevant relations when predicting objects, we use a relation-special attention mechanism and a gate unit to select appropriate relations. In addition, most current models do not account for two-way interaction between tasks, so we design a feature interactive network to achieve bidirectional interaction between subject and object extraction tasks and enhance their connection. Experimental results on NYT, WebNLG, NYT

^{⋆}

and WebNLG

^{⋆}

datasets show that our model is competitive among joint extraction models.

关系三元提取是指从自然文本中提取实体和关系，这是构建知识图谱的一项重要任务。近年来，基于标记的方法因其简单有效的结构形式而受到越来越多的关注。其中，两步提取法容易造成类别不平衡的问题。针对这一问题，我们提出了一种新颖的两步提取法，即首先提取主体，为每种关系生成固定大小的嵌入，然后将这些关系视为已知条件，直接提取与所识别主体相关的对象。为了在预测对象时消除无关关系的影响，我们使用了关系特别关注机制和门单元来选择适当的关系。此外，目前大多数模型都没有考虑到任务之间的双向交互，因此我们设计了一个特征交互网络，以实现主体和对象提取任务之间的双向交互，增强它们之间的联系。在 NYT、WebNLG、NYT⋆ 和 WebNLG⋆ 数据集上的实验结果表明，我们的模型在联合提取模型中具有竞争力。

{"title":"Taking relations as known conditions: A tagging based method for relational triple extraction","authors":"Guanqing Kong , Qi Lei","doi":"10.1016/j.csl.2024.101734","DOIUrl":"10.1016/j.csl.2024.101734","url":null,"abstract":"<div><div>Relational triple extraction refers to extracting entities and relations from natural texts, which is a crucial task in the construction of knowledge graph. Recently, tagging based methods have received increasing attention because of their simple and effective structural form. Among them, the two-step extraction method is easy to cause the problem of category imbalance. To address this issue, we propose a novel two-step extraction method, which first extracts subjects, generates a fixed-size embedding for each relation, and then regards these relations as known conditions to extract the objects directly with the identified subjects. In order to eliminate the influence of irrelevant relations when predicting objects, we use a relation-special attention mechanism and a gate unit to select appropriate relations. In addition, most current models do not account for two-way interaction between tasks, so we design a feature interactive network to achieve bidirectional interaction between subject and object extraction tasks and enhance their connection. Experimental results on NYT, WebNLG, NYT<span><math><msup><mrow></mrow><mrow><mo>⋆</mo></mrow></msup></math></span> and WebNLG<span><math><msup><mrow></mrow><mrow><mo>⋆</mo></mrow></msup></math></span> datasets show that our model is competitive among joint extraction models.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101734"},"PeriodicalIF":3.1,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures 对话语音有什么复杂的？基于 HMM 和基于变换器的 ASR 架构比较

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language

Pub Date : 2024-10-22 DOI: 10.1016/j.csl.2024.101738

Julian Linke , Bernhard C. Geiger , Gernot Kubin , Barbara Schuppler

Highly performing speech recognition is important for more fluent human–machine interaction (e.g., dialogue systems). Modern ASR architectures achieve human-level recognition performance on read speech but still perform sub-par on conversational speech, which arguably is or, at least, will be instrumental for human–machine interaction. Understanding the factors behind this shortcoming of modern ASR systems may suggest directions for improving them. In this work, we compare the performances of HMM- vs. transformer-based ASR architectures on a corpus of Austrian German conversational speech. Specifically, we investigate how strongly utterance length, prosody, pronunciation, and utterance complexity as measured by perplexity affect different ASR architectures. Among other findings, we observe that single-word utterances – which are characteristic of conversational speech and constitute roughly 30% of the corpus – are recognized more accurately if their F0 contour is flat; for longer utterances, the effects of the F0 contour tend to be weaker. We further find that zero-shot systems require longer utterance lengths and are less robust to pronunciation variation, which indicates that pronunciation lexicons and fine-tuning on the respective corpus are essential ingredients for the successful recognition of conversational speech.

高性能的语音识别对于更流畅的人机交互（如对话系统）非常重要。现代 ASR 架构在阅读语音方面达到了人类水平的识别性能，但在会话语音方面的表现仍然不尽如人意，而会话语音可以说是或至少将是人机交互的关键。了解现代自动语音识别系统这一缺陷背后的因素，或许能为改进这些系统指明方向。在这项研究中，我们比较了基于 HMM 和转换器的 ASR 架构在奥地利德语对话语音语料库中的表现。具体来说，我们研究了语篇长度、拟声词、发音和语篇复杂度对不同 ASR 架构的影响。除其他发现外，我们还观察到，如果单字语篇的 F0 等高线是平坦的，则其识别率更高；对于较长的语篇，F0 等高线的影响往往较弱，而单字语篇是会话语音的特征，约占语料库的 30%。我们还发现，"0-shot "系统需要更长的语篇长度，而且对发音变化的稳健性较差，这表明发音词典和对相应语料的微调是成功识别会话语音的基本要素。

{"title":"What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures","authors":"Julian Linke , Bernhard C. Geiger , Gernot Kubin , Barbara Schuppler","doi":"10.1016/j.csl.2024.101738","DOIUrl":"10.1016/j.csl.2024.101738","url":null,"abstract":"<div><div>Highly performing speech recognition is important for more fluent human–machine interaction (e.g., dialogue systems). Modern ASR architectures achieve human-level recognition performance on read speech but still perform sub-par on conversational speech, which arguably is or, at least, will be instrumental for human–machine interaction. Understanding the factors behind this shortcoming of modern ASR systems may suggest directions for improving them. In this work, we compare the performances of HMM- vs. transformer-based ASR architectures on a corpus of Austrian German conversational speech. Specifically, we investigate how strongly utterance length, prosody, pronunciation, and utterance complexity as measured by perplexity affect different ASR architectures. Among other findings, we observe that single-word utterances – which are characteristic of conversational speech and constitute roughly 30% of the corpus – are recognized more accurately if their F0 contour is flat; for longer utterances, the effects of the F0 contour tend to be weaker. We further find that zero-shot systems require longer utterance lengths and are less robust to pronunciation variation, which indicates that pronunciation lexicons and fine-tuning on the respective corpus are essential ingredients for the successful recognition of conversational speech.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101738"},"PeriodicalIF":3.1,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0