Journal on Audio Speech and Music Processing最新文献

A survey of technologies for automatic Dysarthric speech recognition 困难语音自动识别技术综述

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-11-11 DOI: 10.1186/s13636-023-00318-2

Zhaopeng Qian, Kejing Xiao, Chongchong Yu

Abstract Speakers with dysarthria often struggle to accurately pronounce words and effectively communicate with others. Automatic speech recognition (ASR) is a powerful tool for extracting the content from speakers with dysarthria. However, the narrow concept of ASR typically only covers technologies that process acoustic modality signals. In this paper, we broaden the scope of this concept that the generalized concept of ASR for dysarthric speech. Our survey discussed the systems encompassed acoustic modality processing, articulatory movements processing and audio-visual modality fusion processing in the application of recognizing dysarthric speech. Contrary to previous surveys on dysarthric speech recognition, we have conducted a systematic review of the advancements in this field. In particular, we introduced state-of-the-art technologies to supplement the survey of recent research during the era of multi-modality fusion in dysarthric speech recognition. Our survey found that audio-visual fusion technologies perform better than traditional ASR technologies in the task of dysarthric speech recognition. However, training audio-visual fusion models requires more computing resources, and the available data corpus for dysarthric speech is limited. Despite these challenges, state-of-the-art technologies show promising potential for further improving the accuracy of dysarthric speech recognition in the future.

患有构音障碍的说话者经常难以准确地发音和有效地与他人交流。自动语音识别(ASR)是一种从构音障碍说话者中提取内容的强大工具。然而，狭义的ASR概念通常只涵盖处理声学模态信号的技术。在本文中，我们扩大了这一概念的范围，即泛化的ASR概念用于困难言语。我们的调查讨论了声学模态处理、发音动作处理和视听模态融合处理在识别困难语音中的应用。与以往对困难语音识别的调查相反，我们对这一领域的进展进行了系统的回顾。特别是，我们介绍了最先进的技术，以补充在困难语音识别的多模态融合时代的最新研究的调查。我们的调查发现，在困难语音识别任务中，视听融合技术比传统的ASR技术表现更好。然而，训练视听融合模型需要更多的计算资源，并且可用于困难语音的数据语料库有限。尽管存在这些挑战，最先进的技术显示出在未来进一步提高困难语音识别准确性的潜力。

{"title":"A survey of technologies for automatic Dysarthric speech recognition","authors":"Zhaopeng Qian, Kejing Xiao, Chongchong Yu","doi":"10.1186/s13636-023-00318-2","DOIUrl":"https://doi.org/10.1186/s13636-023-00318-2","url":null,"abstract":"Abstract Speakers with dysarthria often struggle to accurately pronounce words and effectively communicate with others. Automatic speech recognition (ASR) is a powerful tool for extracting the content from speakers with dysarthria. However, the narrow concept of ASR typically only covers technologies that process acoustic modality signals. In this paper, we broaden the scope of this concept that the generalized concept of ASR for dysarthric speech. Our survey discussed the systems encompassed acoustic modality processing, articulatory movements processing and audio-visual modality fusion processing in the application of recognizing dysarthric speech. Contrary to previous surveys on dysarthric speech recognition, we have conducted a systematic review of the advancements in this field. In particular, we introduced state-of-the-art technologies to supplement the survey of recent research during the era of multi-modality fusion in dysarthric speech recognition. Our survey found that audio-visual fusion technologies perform better than traditional ASR technologies in the task of dysarthric speech recognition. However, training audio-visual fusion models requires more computing resources, and the available data corpus for dysarthric speech is limited. Despite these challenges, state-of-the-art technologies show promising potential for further improving the accuracy of dysarthric speech recognition in the future.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"43 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135042425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling 使用子词标记进行语言建模，以改善形态复杂马拉雅拉姆语的语音识别系统

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-11-04 DOI: 10.1186/s13636-023-00313-7

Kavya Manohar, Jayan A R, Rajeev Rajan

Abstract This article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–hidden Markov model (DNN-HMM)-based automatic speech recognition (ASR). We propose a novel method, syllable-byte pair encoding (S-BPE), that combines linguistically informed syllable tokenization with the data-driven tokenization method of byte pair encoding (BPE). The proposed method ensures words are always segmented at valid pronunciation boundaries. On a text corpus that has been divided into tokens using the proposed method, we construct statistical n-gram language models and assess the modeling effectiveness in terms of both information-theoretic and corpus linguistic metrics. A comparative study of the proposed method with other data-driven (BPE, Morfessor, and Unigram), linguistic (Syllable), and baseline (Word) tokenization algorithms is also presented. Pronunciation lexicons of subword tokenized units are built with pronunciation described as graphemes. We develop ASR systems employing the subword tokenized language models and pronunciation lexicons. The resulting ASR models are comprehensively evaluated to answer the research questions regarding the impact of subword tokenization algorithms on language modeling complexity and on ASR performance. Our study highlights the strong performance of the hybrid S-BPE tokens, achieving a notable 10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond $$n=3$$ n = 3 . Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.

摘要本文介绍了利用子词标记进行语言建模来改进马拉雅拉姆语语音识别系统的研究工作。采用基于深度神经网络隐马尔可夫模型(DNN-HMM)的自动语音识别(ASR)技术构建语音识别系统。我们提出了一种新的方法，音节-字节对编码(S-BPE)，它结合了语言信息的音节标记化和数据驱动的字节对编码(BPE)标记化方法。所提出的方法确保单词总是在有效的发音边界上被分割。在使用所提出的方法将文本语料库划分为标记的基础上，我们构建了统计n-gram语言模型，并从信息论和语料库语言度量两方面评估了建模的有效性。本文还将该方法与其他数据驱动(BPE、morfessand Unigram)、语言(Syllable)和基线(Word)标记化算法进行了比较研究。子词分词单元的发音词典是用描述为字素的发音来构建的。我们开发了使用子词标记化语言模型和发音词汇的ASR系统。对所得的ASR模型进行了综合评价，以回答有关子词标记化算法对语言建模复杂性和ASR性能的影响的研究问题。我们的研究强调了混合S-BPE令牌的强劲表现，达到了显著的10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond $$n=3$$ n = 3 . Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.

{"title":"Improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling","authors":"Kavya Manohar, Jayan A R, Rajeev Rajan","doi":"10.1186/s13636-023-00313-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00313-7","url":null,"abstract":"Abstract This article presents the research work on improving speech recognition systems for the morphologically complex Malayalam language using subword tokens for language modeling. The speech recognition system is built using a deep neural network–hidden Markov model (DNN-HMM)-based automatic speech recognition (ASR). We propose a novel method, syllable-byte pair encoding (S-BPE), that combines linguistically informed syllable tokenization with the data-driven tokenization method of byte pair encoding (BPE). The proposed method ensures words are always segmented at valid pronunciation boundaries. On a text corpus that has been divided into tokens using the proposed method, we construct statistical n-gram language models and assess the modeling effectiveness in terms of both information-theoretic and corpus linguistic metrics. A comparative study of the proposed method with other data-driven (BPE, Morfessor, and Unigram), linguistic (Syllable), and baseline (Word) tokenization algorithms is also presented. Pronunciation lexicons of subword tokenized units are built with pronunciation described as graphemes. We develop ASR systems employing the subword tokenized language models and pronunciation lexicons. The resulting ASR models are comprehensively evaluated to answer the research questions regarding the impact of subword tokenization algorithms on language modeling complexity and on ASR performance. Our study highlights the strong performance of the hybrid S-BPE tokens, achieving a notable 10.6% word error rate (WER), which represents a substantial 16.8% improvement over the baseline word-level ASR system. The ablation study has revealed that the performance of S-BPE segmentation, which initially underperformed compared to syllable tokens with lower amounts of textual data for language modeling, exhibited steady improvement with the increase in LM training data. The extensive ablation study indicates that there is a limited advantage in raising the n-gram order of the language model beyond $$n=3$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mrow> <mml:mi>n</mml:mi> <mml:mo>=</mml:mo> <mml:mn>3</mml:mn> </mml:mrow> </mml:math> . Such an increase results in considerable model size growth without significant improvements in WER. The implementation of the algorithm and all associated experiments are available under an open license, allowing for reproduction, adaptation, and reuse.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"66 s7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135773441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios 使用扬声器嵌入的自组织麦克风聚类的鲁棒性:在现实和具有挑战性的场景下的评估

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-31 DOI: 10.1186/s13636-023-00310-w

Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu

Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.

摘要基于ECAPA-TDNN的扬声器验证网络中的扬声器嵌入是最近被引入的一种特征，用于在ad hoc阵列中对麦克风进行聚类。我们之前的工作表明，与基于信号的Mod-MFCC功能相比，使用扬声器嵌入在感兴趣的源周围产生了更健壮和更逻辑的麦克风聚类。这项工作旨在通过解决我们之前工作中产生的开放和额外的实际问题，进一步建立扬声器嵌入作为临时麦克风聚类的强大功能。具体来说，虽然我们最初的工作是利用基于鞋盒声学模型的模拟数据，但我们现在在更现实的环境中进行了更彻底的分析。此外，我们还研究了其他重要的考虑因素，如模糊c均值聚类中使用的距离度量的选择;需要聚合数据以获得健壮集群的最小时间范围;在越来越具有挑战性的情况下，以及在多个扬声器的情况下，这些功能的表现。我们还对比了量化这种特别集群质量的几个指标的基础上的结果。结果表明，说话人嵌入对较短的推理时间具有鲁棒性，并且即使在源彼此非常接近时也能提供逻辑和有用的聚类。

{"title":"Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios","authors":"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu","doi":"10.1186/s13636-023-00310-w","DOIUrl":"https://doi.org/10.1186/s13636-023-00310-w","url":null,"abstract":"Abstract Speaker embeddings, from the ECAPA-TDNN speaker verification network, were recently introduced as features for the task of clustering microphones in ad hoc arrays. Our previous work demonstrated that, in comparison to signal-based Mod-MFCC features, using speaker embeddings yielded a more robust and logical clustering of the microphones around the sources of interest. This work aims to further establish speaker embeddings as a robust feature for ad hoc microphone clustering by addressing open and additional questions of practical interest, arising from our prior work. Specifically, whereas our initial work made use of simulated data based on shoe-box acoustics models, we now present a more thorough analysis in more realistic settings. Furthermore, we investigate additional important considerations such as the choice of the distance metric used in the fuzzy C-means clustering; the minimal time range across which data need to be aggregated to obtain robust clusters; and the performance of the features in increasingly more challenging situations, and with multiple speakers. We also contrast the results on the basis of several metrics for quantifying the quality of such ad hoc clusters. Results indicate that the speaker embeddings are robust to short inference times, and deliver logical and useful clusters, even when the sources are very close to each other.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"15 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135814221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision W2VC:基于梯度反转蒸馏和CTC监督的WavLM表示的一次性语音转换

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-28 DOI: 10.1186/s13636-023-00312-8

Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He

Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.

近年来，由于自监督预训练表示(SSPR)的应用，非并行数据语音转换(VC)取得了相当大的突破。通过预训练模型提取的特征被期望包含更多的内容信息。然而，在常用的带有SSPR的VC中，没有专门的实现来去除SSPR提取的内容表示中的说话人信息，这阻碍了从SSPR表示中进一步纯化说话人信息。此外，在传统的VC中，通常选择mel -谱图作为重构的声学特征，这与内容编码器的输入不一致，导致部分信息丢失。基于以上原因，我们提出了W2VC来解决这些问题。W2VC由三部分组成:(1)从WavLM表示(WLMR)中重构出与内容编码器输入更一致的特征;(2)使用连接时态分类(CTC)从音素层面对内容表示和文本上下文进行对齐，在内容表示提取中使用内容编码器加基于梯度反转层(GRL)的说话人分类器去除说话人信息;(3)训练基于WLMR的HiFi-GAN将WLMR转换为波形语音。VC实验结果表明，GRL能够很好地净化自监督模型的内容信息。对内容编码器进行GRL净化和CTC监督是提高VC性能的互补措施。此外，使用WLMR再训练声码器合成的语音在主观和客观评价上都取得了更好的效果。在VCTK和CMU数据库上对该方法进行了验证。结果表明，该方法的客观MCD得分为8.901，语音自然度得分为4.45，主观MOS得分为3.62，均优于基线。

{"title":"W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision","authors":"Hao Huang, Lin Wang, Jichen Yang, Ying Hu, Liang He","doi":"10.1186/s13636-023-00312-8","DOIUrl":"https://doi.org/10.1186/s13636-023-00312-8","url":null,"abstract":"Abstract Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136233736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation 余音:多模式电子商务背景音乐推荐的多任务学习模型

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-19 DOI: 10.1186/s13636-023-00306-6

Le Ma, Xinda Wu, Ruiyuan Tang, Chongjun Zhong, Kejun Zhang

Abstract Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.

电子商务广告中合适的背景音乐可以刺激消费，塑造产品形象。但是，情感、产品类别等因素需要考虑，手工选择音乐耗时长，需要专业知识，自动为视频推荐音乐变得至关重要。针对目前没有电子商务广告数据集的情况，我们首先建立了一个大型电子商务广告数据集Commercial-98K，该数据集涵盖了主要的电子商务类别。然后，我们提出了一个视频音乐检索模型“玉音”来学习视频和音乐之间的相关性。我们引入加权融合模块(WFM)来融合音乐中的情感特征和音频特征，以获得更细粒度的音乐表示。考虑到音乐在同一产品类别中的相似性，玉音采用多任务学习的方式进行训练，通过视频、音乐和标签的交叉匹配，探索视频和音乐之间的相关性，并进行类别预测任务。我们进行了大量的实验，证明宇音在Commercial-98K上的视频音乐检索方面取得了显著的进步。

{"title":"YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation","authors":"Le Ma, Xinda Wu, Ruiyuan Tang, Chongjun Zhong, Kejun Zhang","doi":"10.1186/s13636-023-00306-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00306-6","url":null,"abstract":"Abstract Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135729848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformer-based autoencoder with ID constraint for unsupervised anomalous sound detection 基于ID约束的变压器自编码器的无监督异常声检测

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-13 DOI: 10.1186/s13636-023-00308-4

Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang

Abstract Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID-constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.

无监督异常声检测(ASD)的目的是在只有正常声音数据的情况下检测设备的未知异常声。自编码器(AE)和基于自监督学习的方法是两种主流方法。然而，基于ae的方法可能会受到限制，因为从正常声音中学习到的特征也可能适合异常声音，从而降低了模型从声音中检测异常的能力。即使是同一类型的机器，自监督方法也不总是稳定的，并且表现不同。此外，异常的声音可能是短暂的，使其更难与正常的声音区分。针对无监督ASD，提出了一种基于id约束变压器的自编码器(IDC-TransAE)结构，该结构具有加权异常评分计算。机器ID通过引入简单的ID分类器来约束基于变压器的自编码器(TransAE)的潜在空间，以学习相同机器类型的分布差异，增强模型识别异常声音的能力。引入加权异常分值计算，突出出现时间较短的异常事件的异常分值。在DCASE 2020 Challenge Task2开发数据集上进行的实验证明了我们提出的方法的有效性和优越性。

{"title":"Transformer-based autoencoder with ID constraint for unsupervised anomalous sound detection","authors":"Jian Guan, Youde Liu, Qiuqiang Kong, Feiyang Xiao, Qiaoxi Zhu, Jiantong Tian, Wenwu Wang","doi":"10.1186/s13636-023-00308-4","DOIUrl":"https://doi.org/10.1186/s13636-023-00308-4","url":null,"abstract":"Abstract Unsupervised anomalous sound detection (ASD) aims to detect unknown anomalous sounds of devices when only normal sound data is available. The autoencoder (AE) and self-supervised learning based methods are two mainstream methods. However, the AE-based methods could be limited as the feature learned from normal sounds can also fit with anomalous sounds, reducing the ability of the model in detecting anomalies from sound. The self-supervised methods are not always stable and perform differently, even for machines of the same type. In addition, the anomalous sound may be short-lived, making it even harder to distinguish from normal sound. This paper proposes an ID-constrained Transformer-based autoencoder (IDC-TransAE) architecture with weighted anomaly score computation for unsupervised ASD. Machine ID is employed to constrain the latent space of the Transformer-based autoencoder (TransAE) by introducing a simple ID classifier to learn the difference in the distribution for the same machine type and enhance the ability of the model in distinguishing anomalous sound. Moreover, weighted anomaly score computation is introduced to highlight the anomaly scores of anomalous events that only appear for a short time. Experiments performed on DCASE 2020 Challenge Task2 development dataset demonstrate the effectiveness and superiority of our proposed method.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135853424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Battling with the low-resource condition for snore sound recognition: introducing a meta-learning strategy 解决鼾声识别资源不足的问题:引入元学习策略

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-13 DOI: 10.1186/s13636-023-00309-3

Jingtan Li, Mengkai Sun, Zhonghao Zhao, Xingcan Li, Gaigai Li, Chen Wu, Kun Qian, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller

Abstract Snoring affects 57 % of men, 40 % of women, and 27 % of children in the USA. Besides, snoring is highly correlated with obstructive sleep apnoea (OSA), which is characterised by loud and frequent snoring. OSA is also closely associated with various life-threatening diseases such as sudden cardiac arrest and is regarded as a grave medical ailment. Preliminary studies have shown that in the USA, OSA affects over 34 % of men and 14 % of women. In recent years, polysomnography has increasingly been used to diagnose OSA. However, due to its drawbacks such as being time-consuming and costly, intelligent audio analysis of snoring has emerged as an alternative method. Considering the higher demand for identifying the excitation location of snoring in clinical practice, we utilised the Munich-Passau Snore Sound Corpus (MPSSC) snoring database which classifies the snoring excitation location into four categories. Nonetheless, the problem of small samples remains in the MPSSC database due to factors such as privacy concerns and difficulties in accurate labelling. In fact, accurately labelled medical data that can be used for machine learning is often scarce, especially for rare diseases. In view of this, Model-Agnostic Meta-Learning (MAML), a small sample method based on meta-learning, is used to classify snore signals with less resources in this work. The experimental results indicate that even when using only the ESC-50 dataset (non-snoring sound signals) as the data for meta-training, we are able to achieve an unweighted average recall of 60.2 % on the test dataset after fine-tuning on just 36 instances of snoring from the development part of the MPSSC dataset. While our results only exceed the baseline by 4.4 %, they still demonstrate that even with fine-tuning on a few instances of snoring, our model can outperform the baseline. This implies that the MAML algorithm can effectively tackle the low-resource problem even with limited data resources.

在美国，有57%的男性、40%的女性和27%的儿童打鼾。此外，打鼾与阻塞性睡眠呼吸暂停(OSA)高度相关，其特征是大声和频繁的打鼾。阻塞性睡眠呼吸暂停还与各种危及生命的疾病密切相关，如心脏骤停，被认为是一种严重的医学疾病。初步研究表明，在美国，超过34%的男性和14%的女性患有阻塞性睡眠呼吸暂停综合症。近年来，多导睡眠图越来越多地用于OSA的诊断。然而，由于其耗时和昂贵的缺点，智能音频分析打鼾已经成为一种替代方法。考虑到临床实践中对打鼾激发位置识别的更高要求，我们利用慕尼黑-帕绍鼾声语料库(MPSSC)打鼾数据库，将打鼾激发位置分为四类。尽管如此，由于隐私问题和准确标记困难等因素，小样本问题仍然存在于MPSSC数据库中。事实上，可以用于机器学习的准确标记的医疗数据通常是稀缺的，特别是对于罕见疾病。鉴于此，本研究采用基于元学习的小样本方法——模型不可知元学习(Model-Agnostic Meta-Learning, MAML)对资源较少的打鼾信号进行分类。实验结果表明，即使只使用ESC-50数据集(非打鼾声音信号)作为元训练数据，我们也能够在MPSSC数据集开发部分的36个打鼾实例进行微调后，在测试数据集上实现60.2%的未加权平均召回率。虽然我们的结果只超过基线4.4%，但它们仍然表明，即使对一些打鼾的实例进行微调，我们的模型也可以优于基线。这意味着即使在数据资源有限的情况下，MAML算法也能有效地解决低资源问题。

{"title":"Battling with the low-resource condition for snore sound recognition: introducing a meta-learning strategy","authors":"Jingtan Li, Mengkai Sun, Zhonghao Zhao, Xingcan Li, Gaigai Li, Chen Wu, Kun Qian, Bin Hu, Yoshiharu Yamamoto, Björn W. Schuller","doi":"10.1186/s13636-023-00309-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00309-3","url":null,"abstract":"Abstract Snoring affects 57 % of men, 40 % of women, and 27 % of children in the USA. Besides, snoring is highly correlated with obstructive sleep apnoea (OSA), which is characterised by loud and frequent snoring. OSA is also closely associated with various life-threatening diseases such as sudden cardiac arrest and is regarded as a grave medical ailment. Preliminary studies have shown that in the USA, OSA affects over 34 % of men and 14 % of women. In recent years, polysomnography has increasingly been used to diagnose OSA. However, due to its drawbacks such as being time-consuming and costly, intelligent audio analysis of snoring has emerged as an alternative method. Considering the higher demand for identifying the excitation location of snoring in clinical practice, we utilised the Munich-Passau Snore Sound Corpus (MPSSC) snoring database which classifies the snoring excitation location into four categories. Nonetheless, the problem of small samples remains in the MPSSC database due to factors such as privacy concerns and difficulties in accurate labelling. In fact, accurately labelled medical data that can be used for machine learning is often scarce, especially for rare diseases. In view of this, Model-Agnostic Meta-Learning (MAML), a small sample method based on meta-learning, is used to classify snore signals with less resources in this work. The experimental results indicate that even when using only the ESC-50 dataset (non-snoring sound signals) as the data for meta-training, we are able to achieve an unweighted average recall of 60.2 % on the test dataset after fine-tuning on just 36 instances of snoring from the development part of the MPSSC dataset. While our results only exceed the baseline by 4.4 %, they still demonstrate that even with fine-tuning on a few instances of snoring, our model can outperform the baseline. This implies that the MAML algorithm can effectively tackle the low-resource problem even with limited data resources.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135855849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments 用于噪声混响环境下语音分离的深度编解码器双路径神经网络

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-12 DOI: 10.1186/s13636-023-00307-5

Chunxi Wang, Maoshen Jia, Xinfeng Zhang

Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.

近年来，随着深度神经网络(dnn)的发展，与说话人无关的单通道语音分离问题取得了重大进展。然而，将每个感兴趣的说话者的讲话从包括其他说话者的讲话、背景噪声和房间混响的环境中分离出来仍然是一个挑战。为了解决这一问题，提出了一种噪声混响环境下的语音分离方法。首先，本文介绍了用于语音分离的深度编/解码器双路径神经网络的时域端到端网络结构。其次，为了使模型在训练过程中不陷入局部最优，受尺度不变信噪比(SISNR)的启发，提出了一种损失函数拉伸最优尺度不变信噪比(SOSISNR);同时，为了使训练更适合人类听觉系统，基于短时目标可理解度(STOI)对联合损失函数进行了扩展。再次，提出了一种对准操作，以减小混响引起的延时对分离性能的影响。综合上述方法，主客观评价指标表明，与基线方法相比，本研究在复杂声场环境下具有更好的分离性能。

{"title":"Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments","authors":"Chunxi Wang, Maoshen Jia, Xinfeng Zhang","doi":"10.1186/s13636-023-00307-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00307-5","url":null,"abstract":"Abstract In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135924888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech emotion recognition based on Graph-LSTM neural network 基于Graph-LSTM神经网络的语音情感识别

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-11 DOI: 10.1186/s13636-023-00303-9

Yan Li, Yapeng Wang, Xu Yang, Sio-Kei Im

Abstract Currently, Graph Neural Networks have been extended to the field of speech signal processing. It is the more compact and flexible way to represent speech sequences by graphs. However, the structures of the relationships in recent studies are tend to be relatively uncomplicated. Moreover, the graph convolution module exhibits limitations that impede its adaptability to intricate application scenarios. In this study, we establish the speech-graph using feature similarity and introduce a novel architecture for graph neural network that leverages an LSTM aggregator and weighted pooling. The unweighted accuracy of 65.39% and the weighted accuracy of 71.83% are obtained on the IEMOCAP dataset, achieving the performance comparable to or better than existing graph baselines. This method can improve the interpretability of the model to some extent, and identify speech emotion features effectively.

目前，图神经网络已经扩展到语音信号处理领域。用图表示语音序列是一种更紧凑、更灵活的方法。然而，在最近的研究中，这种关系的结构往往相对简单。此外，图卷积模块显示出阻碍其适应复杂应用场景的局限性。在本研究中，我们利用特征相似度建立了语音图，并引入了一种利用LSTM聚合器和加权池的新型图神经网络架构。在IEMOCAP数据集上获得了65.39%的未加权精度和71.83%的加权精度，达到了与现有图基线相当或更好的性能。该方法可以在一定程度上提高模型的可解释性，有效地识别语音情感特征。

引用次数: 0

An acoustic echo canceller optimized for hands-free speech telecommunication in large vehicle cabins 一种针对大型汽车舱内免提语音通信优化的回声消除器

3区计算机科学

Journal on Audio Speech and Music Processing

Pub Date : 2023-10-07 DOI: 10.1186/s13636-023-00305-7

Amin Saremi, Balaji Ramkumar, Ghazaleh Ghaffari, Zonghua Gu

Abstract Acoustic echo cancelation (AEC) is a system identification problem that has been addressed by various techniques and most commonly by normalized least mean square (NLMS) adaptive algorithms. However, performing a successful AEC in large commercial vehicles has proved complicated due to the size and challenging variations in the acoustic characteristics of their cabins. Here, we present a wideband fully linear time domain NLMS algorithm for AEC that is enhanced by a statistical double-talk detector (DTD) and a voice activity detector (VAD). The proposed solution was tested in four main Volvo truck models, with various cabin geometries, using standard Swedish hearing-in-noise (HINT) sentences in the presence and absence of engine noise. The results show that the proposed solution achieves a high echo return loss enhancement (ERLE) of at least 25 dB with a fast convergence time, fulfilling ITU G.168 requirements. The presented solution was particularly developed to provide a practical compromise between accuracy and computational cost to allow its real-time implementation on commercial digital signal processors (DSPs). A real-time implementation of the solution was coded in C on an ARM Cortex M-7 DSP. The algorithmic latency was measured at less than 26 ms for processing each 50-ms buffer indicating the computational feasibility of the proposed solution for real-time implementation on common DSPs and embedded systems with limited computational and memory resources. MATLAB source codes and related audio files are made available online for reference and further development.

声学回波消除(AEC)是一个系统识别问题，已经被各种技术解决，最常用的是归一化最小均方(NLMS)自适应算法。然而，由于大型商用车的尺寸和客舱声学特性的变化，在大型商用车上成功实施AEC已经被证明是复杂的。在这里，我们提出了一种用于AEC的宽带全线性时域NLMS算法，该算法通过统计双话检测器(DTD)和语音活动检测器(VAD)进行了增强。提出的解决方案在四种主要的沃尔沃卡车车型上进行了测试，这些车型具有不同的客舱几何形状，在存在和不存在发动机噪音的情况下，使用标准的瑞典噪音听力(HINT)语句。结果表明，该方案实现了至少25 dB的高回波回波损耗增强(ERLE)，且收敛时间快，满足ITU G.168要求。提出的解决方案是为了在精度和计算成本之间提供一个实际的折衷，以允许其在商业数字信号处理器(dsp)上实时实现。在ARM Cortex M-7 DSP上用C语言编写了该方案的实时实现。算法延迟在处理每个50 ms缓冲区时小于26 ms，这表明所提出的解决方案在计算和内存资源有限的通用dsp和嵌入式系统上实时实现的计算可行性。MATLAB源代码和相关音频文件已在网上提供，供参考和进一步开发。

{"title":"An acoustic echo canceller optimized for hands-free speech telecommunication in large vehicle cabins","authors":"Amin Saremi, Balaji Ramkumar, Ghazaleh Ghaffari, Zonghua Gu","doi":"10.1186/s13636-023-00305-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00305-7","url":null,"abstract":"Abstract Acoustic echo cancelation (AEC) is a system identification problem that has been addressed by various techniques and most commonly by normalized least mean square (NLMS) adaptive algorithms. However, performing a successful AEC in large commercial vehicles has proved complicated due to the size and challenging variations in the acoustic characteristics of their cabins. Here, we present a wideband fully linear time domain NLMS algorithm for AEC that is enhanced by a statistical double-talk detector (DTD) and a voice activity detector (VAD). The proposed solution was tested in four main Volvo truck models, with various cabin geometries, using standard Swedish hearing-in-noise (HINT) sentences in the presence and absence of engine noise. The results show that the proposed solution achieves a high echo return loss enhancement (ERLE) of at least 25 dB with a fast convergence time, fulfilling ITU G.168 requirements. The presented solution was particularly developed to provide a practical compromise between accuracy and computational cost to allow its real-time implementation on commercial digital signal processors (DSPs). A real-time implementation of the solution was coded in C on an ARM Cortex M-7 DSP. The algorithmic latency was measured at less than 26 ms for processing each 50-ms buffer indicating the computational feasibility of the proposed solution for real-time implementation on common DSPs and embedded systems with limited computational and memory resources. MATLAB source codes and related audio files are made available online for reference and further development.","PeriodicalId":49309,"journal":{"name":"Journal on Audio Speech and Music Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135254957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0