Pub Date : 2024-07-11DOI: 10.1109/TASLP.2024.3423652
Xincheng Yu;Dongyue Guo;Jianwei Zhang;Yi Lin
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and further impacts automatic speech recognition (ASR) accuracy. In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. Specifically, 1) In the U-Net architecture, an attention-based skip-fusion (ABSF) module is applied to mine shared features from encoders using an attention mask, which enables the model to effectively fuse the hierarchical features. 2) A channel and sequence attention (CSAtt) module is innovatively designed to guide the model to focus on informative features in dual parallel attention paths, aiming to enhance the effective representations and suppress the interference noises. 3) Based on the handcrafted features, ASR-oriented optimization targets are designed to improve recognition performance in the ATC environment by learning robust feature representations. By incorporating both the SE-oriented and ASR-oriented losses, ROSE is implemented in a multi-objective learning manner by optimizing shared representations across the two task objectives. The experimental results show that the ROSE significantly outperforms other state-of-the-art methods for both the SE and ASR tasks, in which all the proposed improvements are confirmed by designed experiments. In addition, the proposed approach can contribute to the desired performance improvements on public datasets.
无线电语音回声是空中交通管制(ATC)领域的一种特殊现象,它会降低语音质量并进一步影响自动语音识别(ASR)的准确性。本研究基于卷积编码器-解码器的 U-Net 框架,提出了一种面向时域识别的语音增强(ROSE)框架,以改善语音清晰度,同时提高自动语音识别(ASR)的准确性。具体来说,1)在 U-Net 架构中,基于注意力的跳过融合(ABSF)模块利用注意力掩码从编码器中挖掘共享特征,从而使模型能够有效融合分层特征。2) 创新设计了通道和序列注意(CSAtt)模块,引导模型在双并行注意路径中关注信息特征,旨在增强有效表征并抑制干扰噪声。3) 在手工特征的基础上,设计了面向 ASR 的优化目标,通过学习稳健的特征表征来提高空管环境下的识别性能。通过结合面向 SE 和面向 ASR 的损失,ROSE 以多目标学习的方式,通过优化两个任务目标的共享表征来实现。实验结果表明,在 SE 和 ASR 任务中,ROSE 的表现明显优于其他最先进的方法。此外,所提出的方法还有助于提高公共数据集的性能。
{"title":"ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning","authors":"Xincheng Yu;Dongyue Guo;Jianwei Zhang;Yi Lin","doi":"10.1109/TASLP.2024.3423652","DOIUrl":"10.1109/TASLP.2024.3423652","url":null,"abstract":"Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and further impacts automatic speech recognition (ASR) accuracy. In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. Specifically, 1) In the U-Net architecture, an attention-based skip-fusion (ABSF) module is applied to mine shared features from encoders using an attention mask, which enables the model to effectively fuse the hierarchical features. 2) A channel and sequence attention (CSAtt) module is innovatively designed to guide the model to focus on informative features in dual parallel attention paths, aiming to enhance the effective representations and suppress the interference noises. 3) Based on the handcrafted features, ASR-oriented optimization targets are designed to improve recognition performance in the ATC environment by learning robust feature representations. By incorporating both the SE-oriented and ASR-oriented losses, ROSE is implemented in a multi-objective learning manner by optimizing shared representations across the two task objectives. The experimental results show that the ROSE significantly outperforms other state-of-the-art methods for both the SE and ASR tasks, in which all the proposed improvements are confirmed by designed experiments. In addition, the proposed approach can contribute to the desired performance improvements on public datasets.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3365-3378"},"PeriodicalIF":4.1,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with Unstructured Knowledge Access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog and 4. Situated interactive multimodal dialog. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.
{"title":"Overview of the Ninth Dialog System Technology Challenge: DSTC9","authors":"Chulaka Gunasekara;Seokhwan Kim;Luis Fernando D'Haro;Abhinav Rastogi;Yun-Nung Chen;Mihail Eric;Behnam Hedayatnia;Karthik Gopalakrishnan;Yang Liu;Chao-Wei Huang;Dilek Hakkani-Tür;Jinchao Li;Qi Zhu;Lingxiao Luo;Lars Liden;Kaili Huang;Shahin Shayandeh;Runze Liang;Baolin Peng;Zheng Zhang;Swadheen Shukla;Minlie Huang;Jianfeng Gao;Shikib Mehri;Yulan Feng;Carla Gordon;Seyed Hossein Alavi;David Traum;Maxine Eskenazi;Ahmad Beirami;Eunjoon Cho;Paul A. Crook;Ankita De;Alborz Geramifard;Satwik Kottur;Seungwhan Moon;Shivani Poddar;Rajen Subba","doi":"10.1109/TASLP.2024.3426331","DOIUrl":"10.1109/TASLP.2024.3426331","url":null,"abstract":"This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with Unstructured Knowledge Access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog and 4. Situated interactive multimodal dialog. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4066-4076"},"PeriodicalIF":4.1,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10595468","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-10DOI: 10.1109/TASLP.2024.3426287
Yinlong Xiao;Zongcheng Ji;Jianqiang Li;Mei Han
Integrating lexical knowledge in Chinese named entity recognition (NER) has been proven effective. Among the existing methods, Flat-LAttice Transformer (FLAT) has achieved great success in both performance and efficiency. FLAT performs lexical enhancement for each sentence by constructing a flat lattice (i.e., a sequence of tokens including the characters in a sentence and the matched words in a lexicon) and calculating self-attention with a fully-connected structure. However, the different interactions between tokens, which can bring different aspects of semantic information for Chinese NER, cannot be well captured by self-attention with a fully-connected structure. In this paper, we propose a novel Multi-View Transformer (MVT) to effectively capture the different interactions between tokens. We first define four views to capture four different token interaction structures. We then construct a view-aware visible matrix for each view according to the corresponding structure and introduce a view-aware dot-product attention for each view to limit the attention scope by incorporating the corresponding visible matrix. Finally, we design three different MVT variants to fuse the multi-view features at different levels of the Transformer architecture. Experimental results conducted on four public Chinese NER datasets show the effectiveness of the proposed method. Specifically, on the most challenging dataset Weibo, which is in an informal text style, MVT outperforms FLAT in F1 score by 2.56%, and when combined with BERT, MVT outperforms FLAT in F1 score by 3.03%.
在中文命名实体识别(NER)中整合词汇知识已被证明是有效的。在现有方法中,扁平格转换器(FLAT)在性能和效率方面都取得了巨大成功。FLAT 通过构建平面网格(即包括句子中的字符和词库中的匹配词在内的标记序列)和计算具有全连接结构的自注意力来对每个句子进行词性增强。然而,完全连接结构的自注意力无法很好地捕捉到标记之间的不同交互作用,而这些交互作用会为中文 NER 带来不同方面的语义信息。在本文中,我们提出了一种新颖的多视图转换器(Multi-View Transformer,MVT),以有效捕捉标记之间的不同交互。我们首先定义了四种视图,以捕捉四种不同的标记交互结构。然后,我们根据相应的结构为每个视图构建一个视图感知可见矩阵,并为每个视图引入一个视图感知点积注意力,通过结合相应的可见矩阵来限制注意力范围。最后,我们设计了三种不同的 MVT 变体,在 Transformer 架构的不同层次上融合多视图特征。在四个公开的中文 NER 数据集上进行的实验结果表明了所提方法的有效性。具体来说,在最具挑战性的非正式文本数据集微博上,MVT 的 F1 得分比 FLAT 高出 2.56%,如果与 BERT 结合使用,MVT 的 F1 得分比 FLAT 高出 3.03%。
{"title":"MVT: Chinese NER Using Multi-View Transformer","authors":"Yinlong Xiao;Zongcheng Ji;Jianqiang Li;Mei Han","doi":"10.1109/TASLP.2024.3426287","DOIUrl":"10.1109/TASLP.2024.3426287","url":null,"abstract":"Integrating lexical knowledge in Chinese named entity recognition (NER) has been proven effective. Among the existing methods, Flat-LAttice Transformer (FLAT) has achieved great success in both performance and efficiency. FLAT performs lexical enhancement for each sentence by constructing a flat lattice (i.e., a sequence of tokens including the characters in a sentence and the matched words in a lexicon) and calculating self-attention with a fully-connected structure. However, the different interactions between tokens, which can bring different aspects of semantic information for Chinese NER, cannot be well captured by self-attention with a fully-connected structure. In this paper, we propose a novel Multi-View Transformer (MVT) to effectively capture the different interactions between tokens. We first define four views to capture four different token interaction structures. We then construct a view-aware visible matrix for each view according to the corresponding structure and introduce a view-aware dot-product attention for each view to limit the attention scope by incorporating the corresponding visible matrix. Finally, we design three different MVT variants to fuse the multi-view features at different levels of the Transformer architecture. Experimental results conducted on four public Chinese NER datasets show the effectiveness of the proposed method. Specifically, on the most challenging dataset Weibo, which is in an informal text style, MVT outperforms FLAT in F1 score by 2.56%, and when combined with BERT, MVT outperforms FLAT in F1 score by 3.03%.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3656-3668"},"PeriodicalIF":4.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The present paper explores the use of several deep neural network architectures to carry out a grapheme-to-phoneme (G2P) conversion, aiming to find a universal and language-independent approach to the task. The models explored are trained on whole sentences in order to automatically capture cross-word context (such as voicedness assimilation) if it exists in the given language. Four different languages, English, Czech, Russian, and German, were chosen due to their different nature and requirements for the G2P task. Ultimately, the Text-to-Text Transfer Transformer (T5) based model achieved very high conversion accuracy on all the tested languages. Also, it exceeded the accuracy reached by a similar system, when trained on a public LibriSpeech database.
{"title":"T5G2P: Text-to-Text Transfer Transformer Based Grapheme-to-Phoneme Conversion","authors":"Markéta Řezáčková;Daniel Tihelka;Jindřich Matoušek","doi":"10.1109/TASLP.2024.3426332","DOIUrl":"10.1109/TASLP.2024.3426332","url":null,"abstract":"The present paper explores the use of several deep neural network architectures to carry out a grapheme-to-phoneme (G2P) conversion, aiming to find a universal and language-independent approach to the task. The models explored are trained on whole sentences in order to automatically capture cross-word context (such as voicedness assimilation) if it exists in the given language. Four different languages, English, Czech, Russian, and German, were chosen due to their different nature and requirements for the G2P task. Ultimately, the Text-to-Text Transfer Transformer (T5) based model achieved very high conversion accuracy on all the tested languages. Also, it exceeded the accuracy reached by a similar system, when trained on a public LibriSpeech database.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3466-3476"},"PeriodicalIF":4.1,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-10DOI: 10.1109/TASLP.2024.3426292
Ho-Lam Chung;Ying-Hong Chan;Yao-Chung Fan
In recent years, Question Generation (QG) has gained significant attention as a research topic, particularly in the context of its potential to support automatic reading comprehension assessment preparation. However, current QG models are mostly trained on factoid-type datasets, which tend to produce questions that are too simple for assessing advanced abilities. One promising alternative is to train QG models on exam-type datasets, which contain questions that require content reasoning. Unfortunately, there is a shortage of such training data compared to factoid-type questions. To address this issue and improve the quality of QG for generating advanced questions, we propose the Handover QG