首页 > 最新文献

Eurasip Journal on Audio Speech and Music Processing最新文献

英文 中文
Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density 根据旋律生成和弦进行,和声节奏灵活,和声密度可控
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2024-01-15 DOI: 10.1186/s13636-023-00314-6
Shangda Wu, Yue Yang, Zhaowen Wang, Xiaobing Li, Maosong Sun
Melody harmonization, which involves generating a chord progression that complements a user-provided melody, continues to pose a significant challenge. A chord progression must not only be in harmony with the melody, but also interdependent on its rhythmic pattern. While previous neural network-based systems have been successful in producing chord progressions for given melodies, they have not adequately addressed controllable melody harmonization, nor have they focused on generating harmonic rhythms with flexibility in the rates or patterns of chord changes. This paper presents AutoHarmonizer, a novel system for harmonic density-controllable melody harmonization with such a flexible harmonic rhythm. AutoHarmonizer is equipped with an extensive vocabulary of 1462 chord types and can generate chord progressions that vary in harmonic density for a given melody. Experimental results indicate that the AutoHarmonizer-generated chord progressions exhibit a diverse range of harmonic rhythms and that the system’s controllable harmonic density is effective.
旋律和声是指生成一个和弦进行,以补充用户提供的旋律,这仍然是一个重大挑战。和弦进行不仅必须与旋律和谐,还必须与旋律的节奏型相互依存。虽然以前基于神经网络的系统能成功地为给定旋律生成和弦进行,但它们并没有充分解决可控旋律和声的问题,也没有专注于生成和弦变化率或模式灵活的和声节奏。本文介绍的 AutoHarmonizer 是一种新颖的系统,用于和声密度可控的旋律和声,并具有这种灵活的和声节奏。AutoHarmonizer 配备了由 1462 种和弦类型组成的丰富词汇,可以为给定的旋律生成和声密度不同的和弦行进。实验结果表明,AutoHarmonizer 生成的和弦行进表现出多种多样的和声节奏,而且系统的可控和声密度非常有效。
{"title":"Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density","authors":"Shangda Wu, Yue Yang, Zhaowen Wang, Xiaobing Li, Maosong Sun","doi":"10.1186/s13636-023-00314-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00314-6","url":null,"abstract":"Melody harmonization, which involves generating a chord progression that complements a user-provided melody, continues to pose a significant challenge. A chord progression must not only be in harmony with the melody, but also interdependent on its rhythmic pattern. While previous neural network-based systems have been successful in producing chord progressions for given melodies, they have not adequately addressed controllable melody harmonization, nor have they focused on generating harmonic rhythms with flexibility in the rates or patterns of chord changes. This paper presents AutoHarmonizer, a novel system for harmonic density-controllable melody harmonization with such a flexible harmonic rhythm. AutoHarmonizer is equipped with an extensive vocabulary of 1462 chord types and can generate chord progressions that vary in harmonic density for a given melody. Experimental results indicate that the AutoHarmonizer-generated chord progressions exhibit a diverse range of harmonic rhythms and that the system’s controllable harmonic density is effective.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139470654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios 更正:使用扬声器嵌入的特设麦克风聚类的鲁棒性:在现实和挑战性场景下的评估
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2024-01-15 DOI: 10.1186/s13636-023-00319-1
Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu

Correction: EURASIP Journal on Audio, Speech, and Music Processing 2023, 46 (2023)

https://doi.org/10.1186/s13636-023-00310-w

Following publication of the original article [1], we have been notified that Figure 14, for each cluster subfigure, there was an additional bottom row. These have been removed.

Originally published Figure 14:

figure a

Corrected Figure 14:

figure b

The original article has been corrected.

  1. Kindt et al., Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios. EURASIP J. Audio Speech Music Process. 2023, 46 (2023). https://doi.org/10.1186/s13636-023-00310-w

    Article Google Scholar

Download references

Authors and Affiliations

  1. IDLab, Department of Electronics and Information Systems, Ghent University - Imec, Ghent, Belgium

    Stijn Kindt, Jenthe Thienpondt & Nilesh Madhu

  2. Institute of Communication Acoustics, Ruhr-Universität Bochum, Bochum, Germany

    Luca Becker

Authors
  1. Stijn KindtView author publications

    You can also search for this author in PubMed Google Scholar

  2. Jenthe ThienpondtView author publications

    You can also search for this author in PubMed Google Scholar

  3. Luca BeckerView author publications

    You can also search for this author in PubMed Google Scholar

  4. Nilesh MadhuView author publications

    You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stijn Kindt.

Open Access This article is licensed under a Creative Commons Attribution 4.0 Internati

更正:EURASIP Journal on Audio, Speech, and Music Processing 2023, 46 (2023)https://doi.org/10.1186/s13636-023-00310-wFollowing 原文[1]发表后,我们被告知图 14 中每个聚类子图的底部多了一行。图 14:原文已更正。Kindt 等人,Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios.EURASIP J. Audio Speech Music Process.2023, 46 (2023). https://doi.org/10.1186/s13636-023-00310-wArticle Google Scholar Download referencesAuthors and AffiliationsIDLab, Department of Electronics and Information Systems, Ghent University - Imec, Ghent, BelgiumStijn Kindt, Jenthe Thienpondt &;Nilesh MadhuInstitute of Communication Acoustics, Ruhr-Universität Bochum, Bochum、德国Luca Becker作者Stijn Kindt查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Jenthe Thienpondt查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Luca Becker查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Nilesh Madhu查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者通信作者:Stijn Kindt。开放存取 本文采用知识共享署名 4.0 国际许可协议进行许可,该协议允许以任何媒介或格式使用、共享、改编、分发和复制本文,但需注明原作者和出处,提供知识共享许可协议链接,并说明是否进行了修改。本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中,除非在材料的署名栏中另有说明。如果材料未包含在文章的知识共享许可协议中,且您打算使用的材料不符合法律规定或超出许可使用范围,则您需要直接从版权所有者处获得许可。要查看该许可的副本,请访问 http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsCite this articleKindt, S., Thienpondt, J., Becker, L. et al. Correction:使用扬声器嵌入的特设麦克风聚类的鲁棒性:在现实和挑战场景下的评估。J audio speech music proc.2024, 5 (2024). https://doi.org/10.1186/s13636-023-00319-1Download citationPublished: 15 January 2024DOI: https://doi.org/10.1186/s13636-023-00319-1Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard Provided by the Springer Nature SharedIt content-sharing initiative
{"title":"Correction: Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios","authors":"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu","doi":"10.1186/s13636-023-00319-1","DOIUrl":"https://doi.org/10.1186/s13636-023-00319-1","url":null,"abstract":"<p><b>Correction: EURASIP Journal on Audio, Speech, and Music Processing 2023, 46 (2023)</b></p><p><b>https://doi.org/10.1186/s13636-023-00310-w</b></p><p>Following publication of the original article [1], we have been notified that Figure 14, for each cluster subfigure, there was an additional bottom row. These have been removed.</p><p>Originally published Figure 14:</p><figure><picture><source srcset=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figa_HTML.png?as=webp\" type=\"image/webp\"/><img alt=\"figure a\" aria-describedby=\"Figa\" height=\"949\" loading=\"lazy\" src=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figa_HTML.png\" width=\"427\"/></picture></figure><p>Corrected Figure 14:</p><figure><picture><source srcset=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figb_HTML.png?as=webp\" type=\"image/webp\"/><img alt=\"figure b\" aria-describedby=\"Figb\" height=\"844\" loading=\"lazy\" src=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figb_HTML.png\" width=\"685\"/></picture></figure><p>The original article has been corrected.</p><ol data-track-component=\"outbound reference\"><li data-counter=\"1.\"><p>Kindt et al., Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios. EURASIP J. Audio Speech Music Process. <b>2023</b>, 46 (2023). https://doi.org/10.1186/s13636-023-00310-w</p><p>Article Google Scholar </p></li></ol><p>Download references<svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" role=\"img\" width=\"16\"><use xlink:href=\"#icon-eds-i-download-medium\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"></use></svg></p><h3>Authors and Affiliations</h3><ol><li><p>IDLab, Department of Electronics and Information Systems, Ghent University - Imec, Ghent, Belgium</p><p>Stijn Kindt, Jenthe Thienpondt &amp; Nilesh Madhu</p></li><li><p>Institute of Communication Acoustics, Ruhr-Universität Bochum, Bochum, Germany</p><p>Luca Becker</p></li></ol><span>Authors</span><ol><li><span>Stijn Kindt</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Jenthe Thienpondt</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Luca Becker</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Nilesh Madhu</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li></ol><h3>Corresponding author</h3><p>Correspondence to Stijn Kindt.</p><p><b>Open Access</b> This article is licensed under a Creative Commons Attribution 4.0 Internati","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139470389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural electric bass guitar synthesis framework enabling attack-sustain-representation-based technique control 神经电贝司吉他合成框架,实现基于攻击-持续-再现的技术控制
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2024-01-11 DOI: 10.1186/s13636-024-00327-9
Junya Koguchi, Masanori Morise
Musical instrument sound synthesis (MISS) often utilizes a text-to-speech framework because of its similarity to speech in terms of generating sounds from symbols. Moreover, a plucked string instrument, such as electric bass guitar (EBG), shares acoustical similarities with speech. We propose an attack-sustain (AS) representation of the playing technique to take advantage of this similarity. The AS representation treats the attack segment as an unvoiced consonant and the sustain segment as a voiced vowel. In addition, we propose a MISS framework for an EBG that can control its playing techniques: (1) we constructed a EBG sound database containing a rich set of playing techniques, (2) we developed a dynamic time warping and timbre conversion to align the sounds and AS labels, (3) we extend an existing MISS framework to control playing techniques using AS representation as control symbols. The experimental evaluation suggests that our AS representation effectively controls the playing techniques and improves the naturalness of the synthetic sound.
乐器声音合成(MISS)通常使用文本到语音框架,因为它在从符号生成声音方面与语音相似。此外,电贝司吉他(EBG)等弹拨弦乐器与语音在声学上也有相似之处。为了利用这种相似性,我们提出了弹奏技巧的攻击-持续(AS)表示法。AS 表示法将攻击音段视为无声辅音,将延音音段视为有声元音。此外,我们还提出了一个可控制 EBG 演奏技巧的 MISS 框架:(1) 我们构建了一个包含丰富演奏技巧的 EBG 声音数据库;(2) 我们开发了一种动态时间扭曲和音色转换技术,以调整声音和 AS 标签;(3) 我们扩展了现有的 MISS 框架,以使用 AS 表示作为控制符号来控制演奏技巧。实验评估表明,我们的 AS 表示法能有效控制演奏技巧,并提高合成声音的自然度。
{"title":"Neural electric bass guitar synthesis framework enabling attack-sustain-representation-based technique control","authors":"Junya Koguchi, Masanori Morise","doi":"10.1186/s13636-024-00327-9","DOIUrl":"https://doi.org/10.1186/s13636-024-00327-9","url":null,"abstract":"Musical instrument sound synthesis (MISS) often utilizes a text-to-speech framework because of its similarity to speech in terms of generating sounds from symbols. Moreover, a plucked string instrument, such as electric bass guitar (EBG), shares acoustical similarities with speech. We propose an attack-sustain (AS) representation of the playing technique to take advantage of this similarity. The AS representation treats the attack segment as an unvoiced consonant and the sustain segment as a voiced vowel. In addition, we propose a MISS framework for an EBG that can control its playing techniques: (1) we constructed a EBG sound database containing a rich set of playing techniques, (2) we developed a dynamic time warping and timbre conversion to align the sounds and AS labels, (3) we extend an existing MISS framework to control playing techniques using AS representation as control symbols. The experimental evaluation suggests that our AS representation effectively controls the playing techniques and improves the naturalness of the synthetic sound.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139421015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Significance of relative phase features for shouted and normal speech classification 相对相位特征对喊叫和正常语音分类的意义
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2024-01-06 DOI: 10.1186/s13636-023-00324-4
Khomdet Phapatanaburi, Longbiao Wang, Meng Liu, Seiichi Nakagawa, Talit Jumphoo, Peerapong Uthansakul
Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.
在许多与语音相关的应用中,喊话和正常语音分类发挥着重要作用。现有研究通常基于幅度特征,而忽略了与幅度信息直接相关的相位特征。本文探讨了基于相位的特征对检测喊话语音的重要性。这项工作的新贡献如下。(1) 探索了三种基于相位的特征,即相对相位(RP)、基于线性预测分析估计语音的 RP(LPAES-RP)和基于线性预测残差的 RP(LPR-RP)特征,用于喊叫语音和正常语音的分类。(2) 我们提出了一种新的 RP 特征,称为基于声门源的 RP(GRP)特征。所提出的 GRP 特征的主要思想是利用 RP 和 LPAES-RP 特征之间的差异来检测喊叫语音。(3) 还采用了基于相位和幅度特征的得分组合,以进一步提高分类性能。利用喊话正常电图语音(SNE-Speech)语料库对所提出的特征和组合进行了评估。实验结果表明,RP、LPAES-RP 和 LPR-RP 特征在检测喊话语音方面效果良好。我们还发现,所提出的 GRP 特征比标准的 mel-frequency cepstral coefficient(MFCC)特征能提供更好的结果。此外,与使用单个特征相比,MFCC 和 RP/LPAES-RP/LPR-RP/GRP 特征的得分组合能提高检测性能。噪声环境下的性能分析表明,MFCC 和 RP/LPAES-RP/LPR-RP 特征的分数组合能提供更稳健的分类。这些结果表明了 RP 特征在区分喊叫语音和正常语音方面的重要性。
{"title":"Significance of relative phase features for shouted and normal speech classification","authors":"Khomdet Phapatanaburi, Longbiao Wang, Meng Liu, Seiichi Nakagawa, Talit Jumphoo, Peerapong Uthansakul","doi":"10.1186/s13636-023-00324-4","DOIUrl":"https://doi.org/10.1186/s13636-023-00324-4","url":null,"abstract":"Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep semantic learning for acoustic scene classification 声学场景分类的深度语义学习
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2024-01-03 DOI: 10.1186/s13636-023-00323-5
Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang
Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.
声学场景分类(ASC)是指识别记录音频信号的声学环境或场景的过程。在这项工作中,我们提出了一种基于编码器-解码器的 ASC 方法,该方法借鉴了图像语义分割任务中的 SegNet。我们还提出了一种名为 "混合归一化"(Mixup Normalization)的新型特征归一化方法,该方法结合了信道实例归一化和混合归一化方法,以学习场景的有用信息,并摒弃与不同设备相关的特定信息。此外,我们还提出了一个事件提取模块,可以从分割网络中提取准确的语义分割区域,以模仿图像分割对音频特征的影响。通过四种数据增强技术,我们的最佳单一系统在声学场景和事件检测与分类(DCASE)2020 ASC 任务 1A 数据集上的不同设备上取得了 71.26% 的平均准确率。该结果表明,与 DCASE 2020 挑战任务 1A 基准系统相比,最小差值为 17%。与其他最先进的 CNN 模型相比,该系统具有更低的复杂度和更高的性能,而且除官方挑战数据集外未使用任何补充数据。
{"title":"Deep semantic learning for acoustic scene classification","authors":"Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang","doi":"10.1186/s13636-023-00323-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00323-5","url":null,"abstract":"Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology 动态拓扑声学传感器网络的在线分布式波形同步
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2023-12-18 DOI: 10.1186/s13636-023-00311-9
Aleksej Chinaev, Niklas Knaepper, Gerald Enzner
Acoustic sensing by multiple devices connected in a wireless acoustic sensor network (WASN) creates new opportunities for multichannel signal processing. However, the autonomy of agents in such a network still necessitates the alignment of sensor signals to a common sampling rate. It has been demonstrated that waveform-based estimation of sampling rate offset (SRO) between any node pair can be retrieved from asynchronous signals already exchanged in the network, but connected online operation for network-wide distributed sampling-time synchronization still presents an open research task. This is especially true if the WASN experiences topology changes due to failure or appearance of nodes or connections. In this work, we rely on an online waveform-based closed-loop SRO estimation and compensation unit for nodes pairs. For WASNs hierarchically organized as a directed minimum spanning tree (MST), it is then shown how local synchronization propagates network-wide from the root node to the leaves. Moreover, we propose a network protocol for sustaining an existing network-wide synchronization in case of local topology changes. In doing so, the dynamic WASN maintains the MST topology after reorganization to support continued operation with minimum node distances. Experimental evaluation in a simulated apartment with several rooms proves the ability of our methods to reach and sustain accurate SRO estimation and compensation in dynamic WASNs.
通过无线声学传感器网络(WASN)连接的多个设备进行声学传感,为多通道信号处理创造了新的机遇。然而,由于这种网络中的代理具有自主性,因此仍然需要将传感器信号调整为共同的采样率。已有研究表明,可以从网络中已经交换的异步信号中检索到任何节点对之间基于波形的采样率偏移(SRO)估计,但全网分布式采样时间同步的连接在线操作仍是一项开放式研究任务。特别是当 WASN 因节点或连接的故障或出现而发生拓扑变化时,情况更是如此。在这项工作中,我们依靠基于波形的在线闭环 SRO 估计和节点对补偿单元。对于以有向最小生成树(MST)形式分层组织的 WASN,我们展示了本地同步如何从根节点向叶子进行全网传播。此外,我们还提出了一种网络协议,用于在局部拓扑发生变化时维持现有的全网同步。这样,动态 WASN 就能在重组后保持 MST 拓扑,以最小的节点距离支持持续运行。在一个有多个房间的模拟公寓中进行的实验评估证明,我们的方法能够在动态 WASN 中实现并维持精确的 SRO 估算和补偿。
{"title":"Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology","authors":"Aleksej Chinaev, Niklas Knaepper, Gerald Enzner","doi":"10.1186/s13636-023-00311-9","DOIUrl":"https://doi.org/10.1186/s13636-023-00311-9","url":null,"abstract":"Acoustic sensing by multiple devices connected in a wireless acoustic sensor network (WASN) creates new opportunities for multichannel signal processing. However, the autonomy of agents in such a network still necessitates the alignment of sensor signals to a common sampling rate. It has been demonstrated that waveform-based estimation of sampling rate offset (SRO) between any node pair can be retrieved from asynchronous signals already exchanged in the network, but connected online operation for network-wide distributed sampling-time synchronization still presents an open research task. This is especially true if the WASN experiences topology changes due to failure or appearance of nodes or connections. In this work, we rely on an online waveform-based closed-loop SRO estimation and compensation unit for nodes pairs. For WASNs hierarchically organized as a directed minimum spanning tree (MST), it is then shown how local synchronization propagates network-wide from the root node to the leaves. Moreover, we propose a network protocol for sustaining an existing network-wide synchronization in case of local topology changes. In doing so, the dynamic WASN maintains the MST topology after reorganization to support continued operation with minimum node distances. Experimental evaluation in a simulated apartment with several rooms proves the ability of our methods to reach and sustain accurate SRO estimation and compensation in dynamic WASNs.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Signal processing and machine learning for speech and audio in acoustic sensor networks 声学传感器网络中的语音和音频信号处理与机器学习
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2023-12-17 DOI: 10.1186/s13636-023-00322-6
Walter Kellermann, Rainer Martin, Nobutaka Ono

Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.

Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.

For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.

The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed se

将来自分布式麦克风阵列的多个麦克风信号与描述场景声学特性的信息相结合,以改进声源定位。这些信息包括麦克风的位置、房间大小和混响时间。他们提出的双输入神经网络(DI-NN)是一种简单高效的技术,用于构建能够处理两种不同数据类型的神经网络。他们在不同的场景中对其进行了测试,并将其与传统的最小二乘法和卷积递归神经网络等其他模型进行了比较。虽然拟议的 DI-NN 并不针对每个新场景进行再训练,但作者的研究结果证明了拟议的 DI-NN 的优越性,在合成数据和真实录音数据集上实现了定位误差的大幅降低。作者将传统方法和基于学习的方法相结合,以加强这些任务,并实现对未知房间脉冲响应 (RIR) 和阵列配置的鲁棒性。他们提出了一种三阶段方法,该方法需要计算空间相干矩阵(SCM),其基础是作为定向声源空间特征的白化相对传递函数(wRTF)。他们通过评估空间相干矩阵和局部相干函数来检测目标扬声器的活动。然后,将 SCM 的特征值和两个扬声器之间帧间全局活动分布的最大相似度输入扬声器计数网络(SCnet)。为了提取每个独立的说话者信号,采用了全局和局部活动驱动网络(GLADnet)。Kawamura 等人撰写的最后一篇论文题为 "Acoustic object canceller: removing a known signal from monaural recording using blind synchronization"(声学对象消除器:利用盲同步从单声道录音中消除已知信号),解决了在有干扰参考信号的情况下从单个麦克风信号中消除不期望干扰的问题。作者提出的方法是将干扰视为一个声学对象,其信号在到达接收麦克风之前经过线性滤波。假定声学物体和麦克风的信号表现出不同的采样率,首先对信号进行同步,然后使用大化最小化算法通过最大似然估计确定从物体到麦克风传播路径的频率响应,研究和评估应保留的理想信号的各种统计模型。作者和单位德国埃尔兰根-纽伦堡弗里德里希-亚历山大大学沃尔特-凯勒曼德国波鸿鲁尔大学雷纳-马丁日本日野市东京都立大学大野信孝日本Nobutaka Ono作者Walter Kellermann查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Rainer Martin查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Nobutaka Ono查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者通讯作者给Walter Kellermann的回信。开放获取本文采用知识共享署名 4.0 国际许可协议进行许可,该协议允许以任何媒介或格式使用、共享、改编、分发和复制,只要您适当注明原作者和来源,提供知识共享许可协议的链接,并说明是否进行了修改。本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中,除非在材料的署名栏中另有说明。如果材料未包含在文章的知识共享许可协议中,且您打算使用的材料不符合法律规定或超出许可使用范围,您需要直接从版权所有者处获得许可。要查看该许可的副本,请访问 http://creativecommons.org/licenses/by/4.0/.Reprints and PermissionsCite this articleKellermann, W., Martin, R. &amp; Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks.J audio speech music proc.2023, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6Download citationPublished: 17 December 2023DOI: https://doi.org/10.
{"title":"Signal processing and machine learning for speech and audio in acoustic sensor networks","authors":"Walter Kellermann, Rainer Martin, Nobutaka Ono","doi":"10.1186/s13636-023-00322-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00322-6","url":null,"abstract":"<p>Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.</p><p>Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.</p><p>For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.</p><p>The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed se","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight target speaker separation network based on joint training 基于联合训练的轻量级目标扬声器分离网络
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2023-12-06 DOI: 10.1186/s13636-023-00317-3
Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu
Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.
目标说话人分离旨在从混合语音中分离出目标说话人的语音成分,并去除噪声等无关成分。近年来,基于深度学习的语音分离方法取得了重大突破,并逐渐成为主流。然而,由于模型规模较大,这些现有方法普遍面临系统延迟和性能上限等问题。为解决这些问题,本文提出了网络结构和训练方法的改进措施,以提高模型的性能。本文提出了一种基于长短期记忆(LSTM)的轻量级目标扬声器分离网络,它能在保持分离性能的同时减小模型体积和计算延迟。在此基础上,提出了一种基于联合训练的目标扬声器分离方法,以实现目标扬声器分离系统的整体训练和优化。为进一步提高系统性能,还提出了基于扬声器注册和扬声器分离的联合损失函数,用于网络的联合训练。实验结果表明,本文提出的轻量级目标扬声器分离网络在轻量级的同时具有更好的性能,利用我们提出的损失函数对目标扬声器分离网络进行联合训练可以进一步提高原始模型的分离性能。
{"title":"Lightweight target speaker separation network based on joint training","authors":"Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu","doi":"10.1186/s13636-023-00317-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00317-3","url":null,"abstract":"Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138546436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model 利用可微谐波加噪声模型对音乐信号进行有效的带宽扩展
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2023-12-05 DOI: 10.1186/s13636-023-00315-5
Pierre-Amaury Grumiaux, Mathieu Lagrange
The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.
带宽扩展的任务是根据声音的低频部分的知识来处理音频信号中缺失的高频的生成。该任务适用于各种问题,如音频编码或音频恢复。在本文中,我们重点研究了利用可微数字信号处理(DDSP)模型对单音和复音音乐信号进行有效的带宽扩展。该模型由一个训练参数相对较少的神经网络部分组成,用于推断可微数字信号处理模型的参数,从而有效地生成输出的全频段音频信号。我们首先讨论了单音信号的带宽扩展,然后提出了两种显式处理复音信号的方法。提出的模型的优点首先在单音和复音合成数据上针对基线和基于深度学习的ResNet模型进行了演示。模型是下一步评估记录单音和复调数据,为各种各样的乐器和音乐流派。我们表明,所有提出的模型都超过了在频域计算的客观度量的更高复杂性深度学习模型。一项MUSHRA听力测试证实了该方法在感知质量方面的优越性。
{"title":"Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model","authors":"Pierre-Amaury Grumiaux, Mathieu Lagrange","doi":"10.1186/s13636-023-00315-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00315-5","url":null,"abstract":"The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Piano score rearrangement into multiple difficulty levels via notation-to-notation approach 钢琴乐谱重新排列成多个难度级别,通过符号到符号的方法
IF 2.4 3区 计算机科学 Q2 Physics and Astronomy Pub Date : 2023-12-05 DOI: 10.1186/s13636-023-00321-7
Masahiro Suzuki
Musical score rearrangement is an emerging area in symbolic music processing, which aims to transform a musical score into a different style. This study focuses on the task of changing the playing difficulty of piano scores, addressing two challenges in musical score rearrangement. First, we address the challenge of handling musical notation on scores. While symbolic music research often relies on note-level (MIDI-equivalent) information, musical scores contain notation that cannot be adequately represented at the note level. We propose an end-to-end framework that utilizes tokenized representations of notation to directly rearrange musical scores at the notation level. We also propose the ST+ representation, which includes a novel structure and token types for better score rearrangement. Second, we address the challenge of rearranging musical scores across multiple difficulty levels. We introduce a difficulty conditioning scheme to train a single sequence model capable of handling various difficulty levels, while leveraging scores from various levels in model training. We collect commercial-quality pop piano scores at four difficulty levels and train a MEGA model (with 0.3M parameters) to rearrange between these levels. Objective evaluation shows that our method successfully rearranges piano scores into other three difficulty levels, achieving comparable difficulty to human-made scores. Additionally, our method successfully generates musical notation including articulations. Subjective evaluation (by score experts and musicians) also reveals that our generated scores generally surpass the quality of previous rule-based or note-level methods on several criteria. Our framework enables novel notation-to-notation processing of scores and can be applied to various score rearrangement tasks.
乐谱重排是符号音乐处理中的一个新兴领域,其目的是将乐谱转换成不同的风格。本研究以钢琴乐谱演奏难度的改变为研究对象,探讨了乐谱重排中存在的两大挑战。首先,我们解决在乐谱上处理乐谱的挑战。虽然符号音乐研究通常依赖于音符级别(midi等效)的信息,乐谱包含不能在音符级别充分表示的符号。我们提出了一个端到端框架,它利用符号的标记化表示来直接在符号级别重新排列乐谱。我们还提出了ST+表示,它包括一种新的结构和标记类型,以便更好地重新排列分数。其次,我们解决了跨多个难度级别重新安排乐谱的挑战。我们引入了一种难度调节方案来训练能够处理不同难度级别的单序列模型,同时利用模型训练中不同级别的分数。我们收集了四个难度级别的商业质量流行钢琴乐谱,并训练了一个MEGA模型(具有0.3M参数)在这些级别之间重新排列。客观评价表明,我们的方法成功地将钢琴乐谱重新排列到其他三个难度级别,达到了与人造乐谱相当的难度。此外,我们的方法成功地生成了包含发音的乐谱。主观评估(由分数专家和音乐家)也表明,我们生成的分数通常在几个标准上超过了以前基于规则或音符级方法的质量。我们的框架支持对乐谱进行新颖的符号到符号的处理,并可应用于各种乐谱重排任务。
{"title":"Piano score rearrangement into multiple difficulty levels via notation-to-notation approach","authors":"Masahiro Suzuki","doi":"10.1186/s13636-023-00321-7","DOIUrl":"https://doi.org/10.1186/s13636-023-00321-7","url":null,"abstract":"Musical score rearrangement is an emerging area in symbolic music processing, which aims to transform a musical score into a different style. This study focuses on the task of changing the playing difficulty of piano scores, addressing two challenges in musical score rearrangement. First, we address the challenge of handling musical notation on scores. While symbolic music research often relies on note-level (MIDI-equivalent) information, musical scores contain notation that cannot be adequately represented at the note level. We propose an end-to-end framework that utilizes tokenized representations of notation to directly rearrange musical scores at the notation level. We also propose the ST+ representation, which includes a novel structure and token types for better score rearrangement. Second, we address the challenge of rearranging musical scores across multiple difficulty levels. We introduce a difficulty conditioning scheme to train a single sequence model capable of handling various difficulty levels, while leveraging scores from various levels in model training. We collect commercial-quality pop piano scores at four difficulty levels and train a MEGA model (with 0.3M parameters) to rearrange between these levels. Objective evaluation shows that our method successfully rearranges piano scores into other three difficulty levels, achieving comparable difficulty to human-made scores. Additionally, our method successfully generates musical notation including articulations. Subjective evaluation (by score experts and musicians) also reveals that our generated scores generally surpass the quality of previous rule-based or note-level methods on several criteria. Our framework enables novel notation-to-notation processing of scores and can be applied to various score rearrangement tasks.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138492482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Eurasip Journal on Audio Speech and Music Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1