Pub Date : 2024-08-02DOI: 10.1109/TASLP.2024.3436702
Matteo Scerbo;Lauri Savioja;Enzo De Sena
Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.
室内声学合成可用于虚拟现实(VR)、增强现实(AR)和游戏应用,以增强听众的沉浸感、真实感和外在化。一种常见的方法是使用几何声学(GA)模型以交互速度计算脉冲响应,并使用快速卷积方法实时应用上述响应。另外,基于延迟网络的模型也能对房间声学的某些方面进行建模,但计算成本要低得多。为了缩小这两类模型之间的差距,最近的工作引入了近似声辐射传递(ART)的延迟网络设计,这是一种几何声学(GA)模型,用于模拟环境中离散表面斑块之间的声能传递。本文介绍了此类设计的两个关键扩展。第一个扩展是对反馈矩阵进行新的基于物理和保持稳定的设计,从而能够更精确地控制散射,并更广泛地控制后期混响特性。第二个扩展允许对任意数量的早期反射进行高精度建模,这意味着可以在计算成本和早期混响精度之间随意调整网络规模。我们将所提出的扩展功能与基准 ART 近似延迟网络以及两个参考 GA 模型进行了比较。评估基于感知相关特征的客观测量,包括频率相关混响时间、回声密度积累和早期衰减时间。结果表明,与基线模型相比,所提出的扩展方案有了显著的改进,尤其是在非凸几何形状或墙壁吸收分布不均的情况下,这两种情况都具有广泛的实际意义。
{"title":"Room Acoustic Rendering Networks With Control of Scattering and Early Reflections","authors":"Matteo Scerbo;Lauri Savioja;Enzo De Sena","doi":"10.1109/TASLP.2024.3436702","DOIUrl":"10.1109/TASLP.2024.3436702","url":null,"abstract":"Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3745-3758"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.
{"title":"A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism","authors":"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng","doi":"10.1109/TASLP.2024.3426303","DOIUrl":"10.1109/TASLP.2024.3426303","url":null,"abstract":"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3618-3630"},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1109/TASLP.2024.3434497
Anurag Das;Ricardo Gutierrez-Osuna
Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.
{"title":"Improving Mispronunciation Detection Using Speech Reconstruction","authors":"Anurag Das;Ricardo Gutierrez-Osuna","doi":"10.1109/TASLP.2024.3434497","DOIUrl":"10.1109/TASLP.2024.3434497","url":null,"abstract":"Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4420-4433"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA