首页 > 最新文献

Eurasip Journal on Audio Speech and Music Processing最新文献

英文 中文
Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices. 基于深度多实例学习的可穿戴设备环境音频前景语音定位。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2021-01-01 Epub Date: 2021-02-03 DOI: 10.1186/s13636-020-00194-0
Rajat Hebbar, Pavlos Papadopoulos, Ramon Reyes, Alexander F Danvers, Angelina J Polsinelli, Suzanne A Moseley, David A Sbarra, Matthias R Mehl, Shrikanth Narayanan

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

近年来,机器学习技术已被用于在几个音频相关任务中产生最先进的结果。这些方法的成功很大程度上归功于对大量开源数据集的访问和计算资源的增强。然而,这些方法的一个缺点是,由于领域不匹配,它们往往不能很好地推广到现实生活场景中的任务。其中一项任务是来自可穿戴音频设备的前景语音检测。一些干扰因素,如动态变化的环境条件,包括背景扬声器,电视或广播音频,使前景语音检测成为一项具有挑战性的任务。此外,获取音频流的精确时刻注释用于分析和模型训练也非常耗时和昂贵。在这项工作中,我们使用多实例学习(MIL)来促进这种模型的开发,使用在较低时间分辨率(粗标记)下可用的注释。我们展示了如何应用MIL来定位粗标记音频中的前景语音,并显示了包级和实例级的结果。我们还研究了不同的池化方法,以及它们如何适应应用程序中观察到的密集分布事件。最后,我们展示了使用语音活动检测嵌入作为前景检测特征的改进。
{"title":"Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices.","authors":"Rajat Hebbar,&nbsp;Pavlos Papadopoulos,&nbsp;Ramon Reyes,&nbsp;Alexander F Danvers,&nbsp;Angelina J Polsinelli,&nbsp;Suzanne A Moseley,&nbsp;David A Sbarra,&nbsp;Matthias R Mehl,&nbsp;Shrikanth Narayanan","doi":"10.1186/s13636-020-00194-0","DOIUrl":"https://doi.org/10.1186/s13636-020-00194-0","url":null,"abstract":"<p><p>Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2021 1","pages":"7"},"PeriodicalIF":2.4,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-020-00194-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25367001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Articulation constrained learning with application to speech emotion recognition. 发音约束学习在语音情感识别中的应用。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2019-01-01 Epub Date: 2019-08-20 DOI: 10.1186/s13636-019-0157-9
Mohit Shah, Ming Tu, Visar Berisha, Chaitali Chakrabarti, Andreas Spanias

Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional 1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/,/AE/,/IY/,/UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

结合发音信息和声学特征的语音情感识别方法已被证明可以提高识别性能。在许多情况下,大规模收集发音数据可能是不可行的,从而限制了这些方法的范围和适用性。本文提出了一种基于语音和发音信息的情感识别判别学习方法。将传统的1-正则化逻辑回归代价函数扩展到包含附加约束,以强制模型重构铰接数据。这导致同时为两个任务联合优化稀疏和可解释的表示。此外,该模型在训练过程中只需要发音特征;对样本外数据的推断只需要语音特征。实验评估了对元音/AA/、/AE/、/IY/、/UW/和完整语音的情绪识别性能。结合发音信息可以显着提高基于值的分类的性能。在语料库内和跨语料库的分类情绪识别结果表明,该方法在区分快乐和其他情绪方面更有效。
{"title":"Articulation constrained learning with application to speech emotion recognition.","authors":"Mohit Shah,&nbsp;Ming Tu,&nbsp;Visar Berisha,&nbsp;Chaitali Chakrabarti,&nbsp;Andreas Spanias","doi":"10.1186/s13636-019-0157-9","DOIUrl":"https://doi.org/10.1186/s13636-019-0157-9","url":null,"abstract":"<p><p>Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional <i>ℓ</i> <sub>1</sub>-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels <i>/AA/,/AE/,/IY/,/UW/</i> and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2019 ","pages":""},"PeriodicalIF":2.4,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-019-0157-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37471483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass 从原始音频到无缝混合:为鼓和贝斯创建一个自动化的DJ系统
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2018-09-24 DOI: 10.1186/s13636-018-0134-8
Len Vande Veire, Tijl De Bie
{"title":"From raw audio to a seamless mix: creating an automated DJ system for Drum and Bass","authors":"Len Vande Veire, Tijl De Bie","doi":"10.1186/s13636-018-0134-8","DOIUrl":"https://doi.org/10.1186/s13636-018-0134-8","url":null,"abstract":"","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"98 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2018-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73628910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Biomimetic spectro-temporal features for music instrument recognition in isolated notes and solo phrases. 在孤立音符和独奏乐句中识别乐器的仿生光谱-时间特征。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2015-01-01 DOI: 10.1186/s13636-015-0070-9
Kailash Patil, Mounya Elhilali

The identity of musical instruments is reflected in the acoustic attributes of musical notes played with them. Recently, it has been argued that these characteristics of musical identity (or timbre) can be best captured through an analysis that encompasses both time and frequency domains; with a focus on the modulations or changes in the signal in the spectrotemporal space. This representation mimics the spectrotemporal receptive field (STRF) analysis believed to underlie processing in the central mammalian auditory system, particularly at the level of primary auditory cortex. How well does this STRF representation capture timbral identity of musical instruments in continuous solo recordings remains unclear. The current work investigates the applicability of the STRF feature space for instrument recognition in solo musical phrases and explores best approaches to leveraging knowledge from isolated musical notes for instrument recognition in solo recordings. The study presents an approach for parsing solo performances into their individual note constituents and adapting back-end classifiers using support vector machines to achieve a generalization of instrument recognition to off-the-shelf, commercially available solo music.

乐器的身份体现在用乐器演奏的音符的声学属性上。最近,有人认为,音乐身份(或音色)的这些特征可以通过包括时域和频域的分析来最好地捕捉;聚焦于信号在光谱时间空间中的调制或变化。这种表征模仿了谱颞感受野(STRF)分析,该分析被认为是哺乳动物中枢听觉系统处理的基础,特别是在初级听觉皮层的水平。在连续的独奏录音中,这种STRF表示如何很好地捕捉乐器的音色身份仍然不清楚。目前的工作调查了STRF特征空间在独奏乐句中乐器识别的适用性,并探索了在独奏录音中利用孤立音符知识进行乐器识别的最佳方法。该研究提出了一种将独奏表演解析为单个音符成分的方法,并使用支持向量机调整后端分类器,以实现对现成的、市售的独奏音乐的乐器识别的泛化。
{"title":"Biomimetic spectro-temporal features for music instrument recognition in isolated notes and solo phrases.","authors":"Kailash Patil,&nbsp;Mounya Elhilali","doi":"10.1186/s13636-015-0070-9","DOIUrl":"https://doi.org/10.1186/s13636-015-0070-9","url":null,"abstract":"<p><p>The identity of musical instruments is reflected in the acoustic attributes of musical notes played with them. Recently, it has been argued that these characteristics of musical identity (or timbre) can be best captured through an analysis that encompasses both time and frequency domains; with a focus on the modulations or changes in the signal in the spectrotemporal space. This representation mimics the spectrotemporal receptive field (STRF) analysis believed to underlie processing in the central mammalian auditory system, particularly at the level of primary auditory cortex. How well does this STRF representation capture timbral identity of musical instruments in continuous solo recordings remains unclear. The current work investigates the applicability of the STRF feature space for instrument recognition in solo musical phrases and explores best approaches to leveraging knowledge from isolated musical notes for instrument recognition in solo recordings. The study presents an approach for parsing solo performances into their individual note constituents and adapting back-end classifiers using support vector machines to achieve a generalization of instrument recognition to off-the-shelf, commercially available solo music.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2015 ","pages":""},"PeriodicalIF":2.4,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13636-015-0070-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36776486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Biomimetic multi-resolution analysis for robust speaker recognition. 鲁棒说话人识别的仿生多分辨率分析。
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2012-01-01 Epub Date: 2012-09-07 DOI: 10.1186/1687-4722-2012-22
Sridhar Krishna Nemala, Dmitry N Zotkin, Ramani Duraiswami, Mounya Elhilali

Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.

人类表现出一种非凡的能力,即使在噪音很大的环境中也能可靠地对声源进行分类。相比之下,当语音信号被信道或背景失真破坏时,大多数工程系统的性能会急剧下降。我们的大脑配备了复杂的语音分析和特征提取机制,这对于提高语音自动处理系统在不利条件下的性能具有重要的借鉴意义。本文介绍的工作探索了一种生物驱动的多分辨率说话人信息表示,该信息表示是通过对语音信号的信息丰富的光谱时间属性进行复杂但计算效率高的分析获得的。我们在NIST SRE 2010数据上执行的说话人验证任务中评估了所提出的特征。仿生方法在存在非平稳噪声和混响时具有显著的鲁棒性,为获得可靠的说话人识别和语音处理特征提供了新的框架。
{"title":"Biomimetic multi-resolution analysis for robust speaker recognition.","authors":"Sridhar Krishna Nemala,&nbsp;Dmitry N Zotkin,&nbsp;Ramani Duraiswami,&nbsp;Mounya Elhilali","doi":"10.1186/1687-4722-2012-22","DOIUrl":"https://doi.org/10.1186/1687-4722-2012-22","url":null,"abstract":"<p><p>Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2012 ","pages":""},"PeriodicalIF":2.4,"publicationDate":"2012-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1687-4722-2012-22","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36781151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Eurasip Journal on Audio Speech and Music Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1