首页 > 最新文献

Eurasip Journal on Audio Speech and Music Processing最新文献

英文 中文
Signal processing and machine learning for speech and audio in acoustic sensor networks 声学传感器网络中的语音和音频信号处理与机器学习
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-12-17 DOI: 10.1186/s13636-023-00322-6
Walter Kellermann, Rainer Martin, Nobutaka Ono
<p>Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.</p><p>Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.</p><p>For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.</p><p>The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed se
将来自分布式麦克风阵列的多个麦克风信号与描述场景声学特性的信息相结合,以改进声源定位。这些信息包括麦克风的位置、房间大小和混响时间。他们提出的双输入神经网络(DI-NN)是一种简单高效的技术,用于构建能够处理两种不同数据类型的神经网络。他们在不同的场景中对其进行了测试,并将其与传统的最小二乘法和卷积递归神经网络等其他模型进行了比较。虽然拟议的 DI-NN 并不针对每个新场景进行再训练,但作者的研究结果证明了拟议的 DI-NN 的优越性,在合成数据和真实录音数据集上实现了定位误差的大幅降低。作者将传统方法和基于学习的方法相结合,以加强这些任务,并实现对未知房间脉冲响应 (RIR) 和阵列配置的鲁棒性。他们提出了一种三阶段方法,该方法需要计算空间相干矩阵(SCM),其基础是作为定向声源空间特征的白化相对传递函数(wRTF)。他们通过评估空间相干矩阵和局部相干函数来检测目标扬声器的活动。然后,将 SCM 的特征值和两个扬声器之间帧间全局活动分布的最大相似度输入扬声器计数网络(SCnet)。为了提取每个独立的说话者信号,采用了全局和局部活动驱动网络(GLADnet)。Kawamura 等人撰写的最后一篇论文题为 "Acoustic object canceller: removing a known signal from monaural recording using blind synchronization"(声学对象消除器:利用盲同步从单声道录音中消除已知信号),解决了在有干扰参考信号的情况下从单个麦克风信号中消除不期望干扰的问题。作者提出的方法是将干扰视为一个声学对象,其信号在到达接收麦克风之前经过线性滤波。假定声学物体和麦克风的信号表现出不同的采样率,首先对信号进行同步,然后使用大化最小化算法通过最大似然估计确定从物体到麦克风传播路径的频率响应,研究和评估应保留的理想信号的各种统计模型。作者和单位德国埃尔兰根-纽伦堡弗里德里希-亚历山大大学沃尔特-凯勒曼德国波鸿鲁尔大学雷纳-马丁日本日野市东京都立大学大野信孝日本Nobutaka Ono作者Walter Kellermann查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Rainer Martin查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Nobutaka Ono查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者通讯作者给Walter Kellermann的回信。开放获取本文采用知识共享署名 4.0 国际许可协议进行许可,该协议允许以任何媒介或格式使用、共享、改编、分发和复制,只要您适当注明原作者和来源,提供知识共享许可协议的链接,并说明是否进行了修改。本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中,除非在材料的署名栏中另有说明。如果材料未包含在文章的知识共享许可协议中,且您打算使用的材料不符合法律规定或超出许可使用范围,您需要直接从版权所有者处获得许可。要查看该许可的副本,请访问 http://creativecommons.org/licenses/by/4.0/.Reprints and PermissionsCite this articleKellermann, W., Martin, R. &amp; Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks.J audio speech music proc.2023, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6Download citationPublished: 17 December 2023DOI: https://doi.org/10.
{"title":"Signal processing and machine learning for speech and audio in acoustic sensor networks","authors":"Walter Kellermann, Rainer Martin, Nobutaka Ono","doi":"10.1186/s13636-023-00322-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00322-6","url":null,"abstract":"&lt;p&gt;Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.&lt;/p&gt;&lt;p&gt;Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.&lt;/p&gt;&lt;p&gt;For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.&lt;/p&gt;&lt;p&gt;The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed se","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"55 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight target speaker separation network based on joint training 基于联合训练的轻量级目标扬声器分离网络
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-12-06 DOI: 10.1186/s13636-023-00317-3
Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu
Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.
目标说话人分离旨在从混合语音中分离出目标说话人的语音成分,并去除噪声等无关成分。近年来,基于深度学习的语音分离方法取得了重大突破,并逐渐成为主流。然而,由于模型规模较大,这些现有方法普遍面临系统延迟和性能上限等问题。为解决这些问题,本文提出了网络结构和训练方法的改进措施,以提高模型的性能。本文提出了一种基于长短期记忆(LSTM)的轻量级目标扬声器分离网络,它能在保持分离性能的同时减小模型体积和计算延迟。在此基础上,提出了一种基于联合训练的目标扬声器分离方法,以实现目标扬声器分离系统的整体训练和优化。为进一步提高系统性能,还提出了基于扬声器注册和扬声器分离的联合损失函数,用于网络的联合训练。实验结果表明,本文提出的轻量级目标扬声器分离网络在轻量级的同时具有更好的性能,利用我们提出的损失函数对目标扬声器分离网络进行联合训练可以进一步提高原始模型的分离性能。
{"title":"Lightweight target speaker separation network based on joint training","authors":"Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu","doi":"10.1186/s13636-023-00317-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00317-3","url":null,"abstract":"Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138546436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model 利用可微谐波加噪声模型对音乐信号进行有效的带宽扩展
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-12-05 DOI: 10.1186/s13636-023-00315-5
Pierre-Amaury Grumiaux, Mathieu Lagrange
The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.
带宽扩展的任务是根据声音的低频部分的知识来处理音频信号中缺失的高频的生成。该任务适用于各种问题,如音频编码或音频恢复。在本文中,我们重点研究了利用可微数字信号处理(DDSP)模型对单音和复音音乐信号进行有效的带宽扩展。该模型由一个训练参数相对较少的神经网络部分组成,用于推断可微数字信号处理模型的参数,从而有效地生成输出的全频段音频信号。我们首先讨论了单音信号的带宽扩展,然后提出了两种显式处理复音信号的方法。提出的模型的优点首先在单音和复音合成数据上针对基线和基于深度学习的ResNet模型进行了演示。模型是下一步评估记录单音和复调数据,为各种各样的乐器和音乐流派。我们表明,所有提出的模型都超过了在频域计算的客观度量的更高复杂性深度学习模型。一项MUSHRA听力测试证实了该方法在感知质量方面的优越性。
{"title":"Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model","authors":"Pierre-Amaury Grumiaux, Mathieu Lagrange","doi":"10.1186/s13636-023-00315-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00315-5","url":null,"abstract":"The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":" 6","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Piano score rearrangement into multiple difficulty levels via notation-to-notation approach 钢琴乐谱重新排列成多个难度级别,通过符号到符号的方法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-12-05 DOI: 10.1186/s13636-023-00321-7
Masahiro Suzuki
Musical score rearrangement is an emerging area in symbolic music processing, which aims to transform a musical score into a different style. This study focuses on the task of changing the playing difficulty of piano scores, addressing two challenges in musical score rearrangement. First, we address the challenge of handling musical notation on scores. While symbolic music research often relies on note-level (MIDI-equivalent) information, musical scores contain notation that cannot be adequately represented at the note level. We propose an end-to-end framework that utilizes tokenized representations of notation to directly rearrange musical scores at the notation level. We also propose the ST+ representation, which includes a novel structure and token types for better score rearrangement. Second, we address the challenge of rearranging musical scores across multiple difficulty levels. We introduce a difficulty conditioning scheme to train a single sequence model capable of handling various difficulty levels, while leveraging scores from various levels in model training. We collect commercial-quality pop piano scores at four difficulty levels and train a MEGA model (with 0.3M parameters) to rearrange between these levels. Objective evaluation shows that our method successfully rearranges piano scores into other three difficulty levels, achieving comparable difficulty to human-made scores. Additionally, our method successfully generates musical notation including articulations. Subjective evaluation (by score experts and musicians) also reveals that our generated scores generally surpass the quality of previous rule-based or note-level methods on several criteria. Our framework enables novel notation-to-notation processing of scores and can be applied to various score rearrangement tasks.
乐谱重排是符号音乐处理中的一个新兴领域,其目的是将乐谱转换成不同的风格。本研究以钢琴乐谱演奏难度的改变为研究对象,探讨了乐谱重排中存在的两大挑战。首先,我们解决在乐谱上处理乐谱的挑战。虽然符号音乐研究通常依赖于音符级别(midi等效)的信息,乐谱包含不能在音符级别充分表示的符号。我们提出了一个端到端框架,它利用符号的标记化表示来直接在符号级别重新排列乐谱。我们还提出了ST+表示,它包括一种新的结构和标记类型,以便更好地重新排列分数。其次,我们解决了跨多个难度级别重新安排乐谱的挑战。我们引入了一种难度调节方案来训练能够处理不同难度级别的单序列模型,同时利用模型训练中不同级别的分数。我们收集了四个难度级别的商业质量流行钢琴乐谱,并训练了一个MEGA模型(具有0.3M参数)在这些级别之间重新排列。客观评价表明,我们的方法成功地将钢琴乐谱重新排列到其他三个难度级别,达到了与人造乐谱相当的难度。此外,我们的方法成功地生成了包含发音的乐谱。主观评估(由分数专家和音乐家)也表明,我们生成的分数通常在几个标准上超过了以前基于规则或音符级方法的质量。我们的框架支持对乐谱进行新颖的符号到符号的处理,并可应用于各种乐谱重排任务。
{"title":"Piano score rearrangement into multiple difficulty levels via notation-to-notation approach","authors":"Masahiro Suzuki","doi":"10.1186/s13636-023-00321-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00321-7","url":null,"abstract":"Musical score rearrangement is an emerging area in symbolic music processing, which aims to transform a musical score into a different style. This study focuses on the task of changing the playing difficulty of piano scores, addressing two challenges in musical score rearrangement. First, we address the challenge of handling musical notation on scores. While symbolic music research often relies on note-level (MIDI-equivalent) information, musical scores contain notation that cannot be adequately represented at the note level. We propose an end-to-end framework that utilizes tokenized representations of notation to directly rearrange musical scores at the notation level. We also propose the ST+ representation, which includes a novel structure and token types for better score rearrangement. Second, we address the challenge of rearranging musical scores across multiple difficulty levels. We introduce a difficulty conditioning scheme to train a single sequence model capable of handling various difficulty levels, while leveraging scores from various levels in model training. We collect commercial-quality pop piano scores at four difficulty levels and train a MEGA model (with 0.3M parameters) to rearrange between these levels. Objective evaluation shows that our method successfully rearranges piano scores into other three difficulty levels, achieving comparable difficulty to human-made scores. Additionally, our method successfully generates musical notation including articulations. Subjective evaluation (by score experts and musicians) also reveals that our generated scores generally surpass the quality of previous rule-based or note-level methods on several criteria. Our framework enables novel notation-to-notation processing of scores and can be applied to various score rearrangement tasks.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":" 5","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effective acoustic parameters for automatic classification of performed and synthesized Guzheng music 古筝演奏与合成音乐自动分类的有效声学参数
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-12-01 DOI: 10.1186/s13636-023-00320-8
Huiwen Xue, Chenxin Sun, Mingcheng Tang, Chenrui Hu, Zhengqing Yuan, Min Huang, Zhongzhe Xiao
This study focuses on exploring the acoustic differences between synthesized Guzheng pieces and real Guzheng performances, with the aim of improving the quality of synthesized Guzheng music. A dataset with consideration of generalizability with multiple sources and genres is constructed as the basis of analysis. Classification accuracy up to 93.30% with a single feature put forward the fact that although the synthesized Guzheng pieces in subjective perception evaluation are recognized by human listeners, there is a very significant difference to the performed Guzheng music. With features compensating to each other, a combination of only three features can achieve a nearly perfect classification accuracy of 99.73%, with the essential two features related to spectral flux and an auxiliary feature related to MFCC. The conclusion of this work points out a potential future improvement direction in Guzheng synthesized algorithms with spectral flux properties.
本研究主要探讨合成古筝作品与真实古筝演奏的声学差异,旨在提高合成古筝音乐的质量。在分析的基础上,构建了一个考虑多来源和多类型的可泛化性的数据集。单一特征下的分类准确率高达93.30%,说明在主观感知评价中合成的古筝作品虽然被人类听众所识别,但与实际演奏的古筝音乐存在非常显著的差异。在特征相互补偿的情况下,仅三个特征的组合就可以达到近乎完美的99.73%的分类准确率,其中两个基本特征与光谱通量相关,一个辅助特征与MFCC相关。本工作的结论为具有谱通量特性的古筝合成算法指出了未来可能的改进方向。
{"title":"Effective acoustic parameters for automatic classification of performed and synthesized Guzheng music","authors":"Huiwen Xue, Chenxin Sun, Mingcheng Tang, Chenrui Hu, Zhengqing Yuan, Min Huang, Zhongzhe Xiao","doi":"10.1186/s13636-023-00320-8","DOIUrl":"https://doi.org/10.1186/s13636-023-00320-8","url":null,"abstract":"This study focuses on exploring the acoustic differences between synthesized Guzheng pieces and real Guzheng performances, with the aim of improving the quality of synthesized Guzheng music. A dataset with consideration of generalizability with multiple sources and genres is constructed as the basis of analysis. Classification accuracy up to 93.30% with a single feature put forward the fact that although the synthesized Guzheng pieces in subjective perception evaluation are recognized by human listeners, there is a very significant difference to the performed Guzheng music. With features compensating to each other, a combination of only three features can achieve a nearly perfect classification accuracy of 99.73%, with the essential two features related to spectral flux and an auxiliary feature related to MFCC. The conclusion of this work points out a potential future improvement direction in Guzheng synthesized algorithms with spectral flux properties.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":" 7","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predominant audio source separation in polyphonic music 在复调音乐中主要的音源分离
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-11-24 DOI: 10.1186/s13636-023-00316-4
Lekshmi Chandrika Reghunath, Rajeev Rajan
Predominant source separation is the separation of one or more desired predominant signals, such as voice or leading instruments, from polyphonic music. The proposed work uses time-frequency filtering on predominant source separation and conditional adversarial networks to improve the perceived quality of isolated sounds. The pitch tracks corresponding to the prominent sound sources of the polyphonic music are estimated using a predominant pitch extraction algorithm and a binary mask corresponding to each pitch track and its harmonics are generated. Time-frequency filtering is performed on the spectrogram of the input signal using a binary mask that isolates the dominant sources based on pitch. The perceptual quality of source-separated music signal is enhanced using a CycleGAN-based conditional adversarial network operating on spectrogram images. The proposed work is systematically evaluated using the IRMAS and ADC 2004 datasets. Subjective and objective evaluations have been carried out. The reconstructed spectrogram is converted back to music signals by applying the inverse short-time Fourier transform. The intelligibility of separated audio is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The performance of the proposed method is compared with state-of-the-art Demucs and Wave-U-Net architectures and shows competing performance both objectively and subjectively.
优势源分离是指从复调音乐中分离出一个或多个期望的优势信号,如人声或主导乐器。提出的工作在主要源分离和条件对抗网络上使用时频滤波来提高孤立声音的感知质量。使用主要的音高提取算法估计与复调音乐的突出声源相对应的音高轨道,并生成与每个音高轨道及其谐波相对应的二进制掩模。对输入信号的频谱图进行时频滤波,使用基于基音隔离优势源的二值掩模。使用基于cyclegan的条件对抗网络对谱图图像进行操作,增强了源分离音乐信号的感知质量。使用IRMAS和ADC 2004数据集对建议的工作进行了系统评估。进行了主观和客观评价。利用短时傅里叶反变换将重构谱图转换回音乐信号。使用基于音频样式转移方案的可理解性增强模块来增强分离音频的可理解性。将该方法的性能与最先进的Demucs和Wave-U-Net架构进行了比较,并在客观上和主观上显示了竞争性能。
{"title":"Predominant audio source separation in polyphonic music","authors":"Lekshmi Chandrika Reghunath, Rajeev Rajan","doi":"10.1186/s13636-023-00316-4","DOIUrl":"https://doi.org/10.1186/s13636-023-00316-4","url":null,"abstract":"Predominant source separation is the separation of one or more desired predominant signals, such as voice or leading instruments, from polyphonic music. The proposed work uses time-frequency filtering on predominant source separation and conditional adversarial networks to improve the perceived quality of isolated sounds. The pitch tracks corresponding to the prominent sound sources of the polyphonic music are estimated using a predominant pitch extraction algorithm and a binary mask corresponding to each pitch track and its harmonics are generated. Time-frequency filtering is performed on the spectrogram of the input signal using a binary mask that isolates the dominant sources based on pitch. The perceptual quality of source-separated music signal is enhanced using a CycleGAN-based conditional adversarial network operating on spectrogram images. The proposed work is systematically evaluated using the IRMAS and ADC 2004 datasets. Subjective and objective evaluations have been carried out. The reconstructed spectrogram is converted back to music signals by applying the inverse short-time Fourier transform. The intelligibility of separated audio is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The performance of the proposed method is compared with state-of-the-art Demucs and Wave-U-Net architectures and shows competing performance both objectively and subjectively.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":" 8","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MYRiAD: a multi-array room acoustic database. MYRiAD:一个多阵列房间声学数据库。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-01-01 DOI: 10.1186/s13636-023-00284-9
Thomas Dietzen, Randall Ali, Maja Taseska, Toon van Waterschoot

In the development of acoustic signal processing algorithms, their evaluation in various acoustic environments is of utmost importance. In order to advance evaluation in realistic and reproducible scenarios, several high-quality acoustic databases have been developed over the years. In this paper, we present another complementary database of acoustic recordings, referred to as the Multi-arraY Room Acoustic Database (MYRiAD). The MYRiAD database is unique in its diversity of microphone configurations suiting a wide range of enhancement and reproduction applications (such as assistive hearing, teleconferencing, or sound zoning), the acoustics of the two recording spaces, and the variety of contained signals including 1214 room impulse responses (RIRs), reproduced speech, music, and stationary noise, as well as recordings of live cocktail parties held in both rooms. The microphone configurations comprise a dummy head (DH) with in-ear omnidirectional microphones, two behind-the-ear (BTE) pieces equipped with 2 omnidirectional microphones each, 5 external omnidirectional microphones (XMs), and two concentric circular microphone arrays (CMAs) consisting of 12 omnidirectional microphones in total. The two recording spaces, namely the SONORA Audio Laboratory (SAL) and the Alamire Interactive Laboratory (AIL), have reverberation times of 2.1 s and 0.5 s, respectively. Audio signals were reproduced using 10 movable loudspeakers in the SAL and a built-in array of 24 loudspeakers in the AIL. MATLAB and Python scripts are included for accessing the signals as well as microphone and loudspeaker coordinates. The database is publicly available (https://zenodo.org/record/7389996).

在声信号处理算法的发展中,它们在各种声环境中的评估是至关重要的。为了在现实和可重复的情况下推进评估,多年来开发了几个高质量的声学数据库。在本文中,我们提出了另一个声学记录的补充数据库,称为多阵列房间声学数据库(MYRiAD)。MYRiAD数据库的独特之处在于其麦克风配置的多样性,适合广泛的增强和再现应用(如辅助听力,电话会议或声音分区),两个录音空间的声学,以及各种包含的信号,包括1214个房间脉冲响应(RIRs),再现语音,音乐和固定噪声,以及在两个房间举行的现场鸡尾酒会的录音。麦克风配置包括一个带入耳式全向麦克风的假头(DH),两个带2个全向麦克风的耳后(BTE), 5个外置全向麦克风(XMs),以及两个共包含12个全向麦克风的同心圆形麦克风阵列(cma)。两个录音空间,即SONORA音频实验室(SAL)和Alamire互动实验室(AIL),其混响时间分别为2.1秒和0.5秒。音频信号的再现使用10个可移动扬声器在SAL和一个内置阵列的24个扬声器在AIL。包含MATLAB和Python脚本,用于访问信号以及麦克风和扬声器坐标。该数据库是公开的(https://zenodo.org/record/7389996)。
{"title":"MYRiAD: a multi-array room acoustic database.","authors":"Thomas Dietzen,&nbsp;Randall Ali,&nbsp;Maja Taseska,&nbsp;Toon van Waterschoot","doi":"10.1186/s13636-023-00284-9","DOIUrl":"https://doi.org/10.1186/s13636-023-00284-9","url":null,"abstract":"<p><p>In the development of acoustic signal processing algorithms, their evaluation in various acoustic environments is of utmost importance. In order to advance evaluation in realistic and reproducible scenarios, several high-quality acoustic databases have been developed over the years. In this paper, we present another complementary database of acoustic recordings, referred to as the Multi-arraY Room Acoustic Database (MYRiAD). The MYRiAD database is unique in its diversity of microphone configurations suiting a wide range of enhancement and reproduction applications (such as assistive hearing, teleconferencing, or sound zoning), the acoustics of the two recording spaces, and the variety of contained signals including 1214 room impulse responses (RIRs), reproduced speech, music, and stationary noise, as well as recordings of live cocktail parties held in both rooms. The microphone configurations comprise a dummy head (DH) with in-ear omnidirectional microphones, two behind-the-ear (BTE) pieces equipped with 2 omnidirectional microphones each, 5 external omnidirectional microphones (XMs), and two concentric circular microphone arrays (CMAs) consisting of 12 omnidirectional microphones in total. The two recording spaces, namely the SONORA Audio Laboratory (SAL) and the Alamire Interactive Laboratory (AIL), have reverberation times of 2.1 s and 0.5 s, respectively. Audio signals were reproduced using 10 movable loudspeakers in the SAL and a built-in array of 24 loudspeakers in the AIL. MATLAB and Python scripts are included for accessing the signals as well as microphone and loudspeaker coordinates. The database is publicly available (https://zenodo.org/record/7389996).</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2023 1","pages":"17"},"PeriodicalIF":2.4,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10133077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9760637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Explicit-memory multiresolution adaptive framework for speech and music separation. 用于语音和音乐分离的显式记忆多分辨率自适应框架。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2023-01-01 Epub Date: 2023-05-09 DOI: 10.1186/s13636-023-00286-7
Ashwin Bellur, Karan Thakkar, Mounya Elhilali

The human auditory system employs a number of principles to facilitate the selection of perceptually separated streams from a complex sound mixture. The brain leverages multi-scale redundant representations of the input and uses memory (or priors) to guide the selection of a target sound from the input mixture. Moreover, feedback mechanisms refine the memory constructs resulting in further improvement of selectivity of a particular sound object amidst dynamic backgrounds. The present study proposes a unified end-to-end computational framework that mimics these principles for sound source separation applied to both speech and music mixtures. While the problems of speech enhancement and music separation have often been tackled separately due to constraints and specificities of each signal domain, the current work posits that common principles for sound source separation are domain-agnostic. In the proposed scheme, parallel and hierarchical convolutional paths map input mixtures onto redundant but distributed higher-dimensional subspaces and utilize the concept of temporal coherence to gate the selection of embeddings belonging to a target stream abstracted in memory. These explicit memories are further refined through self-feedback from incoming observations in order to improve the system's selectivity when faced with unknown backgrounds. The model yields stable outcomes of source separation for both speech and music mixtures and demonstrates benefits of explicit memory as a powerful representation of priors that guide information selection from complex inputs.

人类听觉系统采用了许多原理来促进从复杂的声音混合物中选择在感知上分离的流。大脑利用输入的多尺度冗余表示,并使用记忆(或先验)来指导从输入混合物中选择目标声音。此外,反馈机制细化了记忆结构,从而进一步提高了特定声音对象在动态背景中的选择性。本研究提出了一个统一的端到端计算框架,该框架模拟了应用于语音和音乐混合物的声源分离的这些原理。虽然由于每个信号域的限制和特殊性,语音增强和音乐分离的问题通常被单独解决,但当前的工作认为声源分离的通用原则是域不可知的。在所提出的方案中,并行和分层卷积路径将输入混合物映射到冗余但分布的高维子空间上,并利用时间相干性的概念来选择属于在存储器中抽象的目标流的嵌入。这些显式记忆通过来自输入观测的自反馈进一步细化,以提高系统在面对未知背景时的选择性。该模型为语音和音乐混合产生了稳定的源分离结果,并证明了显式记忆作为先验的强大表示的好处,先验指导从复杂输入中选择信息。
{"title":"Explicit-memory multiresolution adaptive framework for speech and music separation.","authors":"Ashwin Bellur, Karan Thakkar, Mounya Elhilali","doi":"10.1186/s13636-023-00286-7","DOIUrl":"10.1186/s13636-023-00286-7","url":null,"abstract":"<p><p>The human auditory system employs a number of principles to facilitate the selection of perceptually separated streams from a complex sound mixture. The brain leverages multi-scale redundant representations of the input and uses memory (or priors) to guide the selection of a target sound from the input mixture. Moreover, feedback mechanisms refine the memory constructs resulting in further improvement of selectivity of a particular sound object amidst dynamic backgrounds. The present study proposes a unified end-to-end computational framework that mimics these principles for sound source separation applied to both speech and music mixtures. While the problems of speech enhancement and music separation have often been tackled separately due to constraints and specificities of each signal domain, the current work posits that common principles for sound source separation are domain-agnostic. In the proposed scheme, parallel and hierarchical convolutional paths map input mixtures onto redundant but distributed higher-dimensional subspaces and utilize the concept of temporal coherence to gate the selection of embeddings belonging to a target stream abstracted in memory. These explicit memories are further refined through self-feedback from incoming observations in order to improve the system's selectivity when faced with unknown backgrounds. The model yields stable outcomes of source separation for both speech and music mixtures and demonstrates benefits of explicit memory as a powerful representation of priors that guide information selection from complex inputs.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2023 1","pages":"20"},"PeriodicalIF":2.4,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10169896/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10301080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Time-frequency scattering accurately models auditory similarities between instrumental playing techniques. 时频散射准确地模拟了乐器演奏技术之间的听觉相似性。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2021-01-01 Epub Date: 2021-01-11 DOI: 10.1186/s13636-020-00187-z
Vincent Lostanlen, Christian El-Hajj, Mathias Rossignol, Grégoire Lafay, Joakim Andén, Mathieu Lagrange

Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called "ordinary" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time-frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99.0%±1. An ablation study demonstrates that removing either the joint time-frequency scattering transform or the metric learning algorithm noticeably degrades performance.

器乐演奏技巧,如颤音、滑音和颤音,在古典和民间语境中通常表示音乐表现力。然而,大多数现有的音乐相似性检索方法无法描述所谓的“普通”技术之外的音色,使用乐器身份作为音质的代理,并且不允许定制新主题的感知特性。在本文中,我们要求31名人类参与者将78个孤立的音符组织成一组音色簇。分析他们的反应表明,音色感知在一种更灵活的分类中运作,而不仅仅是由乐器或演奏技术提供的。此外,我们提出了一个机器聆听模型来恢复跨乐器,静音和技术的听觉相似性的聚类图。我们的模型依赖于联合时频散射特征来提取作为声学特征的光谱时间调制。在此基础上,利用大边界最近邻(LMNN)度量学习算法最小化聚类图中的三元组损失。在9346个独立音符的数据集上,我们报告了排名5 (AP@5)的最先进的平均精度为99.0%±1。一项消融研究表明,去除联合时频散射变换或度量学习算法都会显著降低性能。
{"title":"Time-frequency scattering accurately models auditory similarities between instrumental playing techniques.","authors":"Vincent Lostanlen,&nbsp;Christian El-Hajj,&nbsp;Mathias Rossignol,&nbsp;Grégoire Lafay,&nbsp;Joakim Andén,&nbsp;Mathieu Lagrange","doi":"10.1186/s13636-020-00187-z","DOIUrl":"https://doi.org/10.1186/s13636-020-00187-z","url":null,"abstract":"<p><p>Instrumentalplaying techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called \"ordinary\" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human participants to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time-frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of 99<i>.</i>0<i>%</i>±1. An ablation study demonstrates that removing either the joint time-frequency scattering transform or the metric learning algorithm noticeably degrades performance.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"3"},"PeriodicalIF":2.4,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-020-00187-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38854143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network. 基于上下文叠加扩展卷积神经网络的端到端语音情感识别。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2021-01-01 Epub Date: 2021-05-12 DOI: 10.1186/s13636-021-00208-5
Duowei Tang, Peter Kuppens, Luc Geurts, Toon van Waterschoot

Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

在语音信号的各种特征中,情感的表达是表现出最慢的时间动态特征之一。因此,一个高性能的语音情感识别(SER)系统需要一个预测模型,该模型能够在分析的语音信号中学习足够长的时间依赖性。因此,在这项工作中,我们提出了一种基于扩展因果卷积与上下文堆叠概念的新型端到端神经网络架构。首先,该模型仅由可并行层组成,因此适合并行处理,同时避免了递归神经网络(RNN)层固有的缺乏并行性。其次,设计专用的扩展因果卷积块,使模型具有与输入序列长度一样大的接受域,同时保持较低的计算成本。第三,通过引入上下文堆叠结构,所提出的模型能够利用长期时间依赖性,从而为使用RNN层提供了一种替代方案。我们在SER回归和分类任务中评估了所提出的模型,并与最先进的端到端SER模型进行了比较。实验结果表明,该模型所需的模型参数数量仅为现有模型的1/3,同时显著提高了SER性能。进一步的实验报告,以了解使用不同类型的输入表示(即原始音频样本与对数mel-谱图)的影响,并说明端到端方法比使用手工制作的音频特征的好处。此外,我们还证明了该模型可以有效地学习中间嵌入,并保留语音情感信息。
{"title":"End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network.","authors":"Duowei Tang,&nbsp;Peter Kuppens,&nbsp;Luc Geurts,&nbsp;Toon van Waterschoot","doi":"10.1186/s13636-021-00208-5","DOIUrl":"https://doi.org/10.1186/s13636-021-00208-5","url":null,"abstract":"<p><p>Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"18"},"PeriodicalIF":2.4,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-021-00208-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39683580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
Eurasip Journal on Audio Speech and Music Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1