arXiv - EE - Audio and Speech Processing最新文献

英文中文

Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models 通过扩散模型检测和防御对自动语音识别的恶意攻击

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07936

Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan

Automatic speech recognition (ASR) systems are known to be vulnerable toadversarial attacks. This paper addresses detection and defence againsttargeted white-box attacks on speech signals for ASR systems. While existingwork has utilised diffusion models (DMs) to purify adversarial examples,achieving state-of-the-art results in keyword spotting tasks, theireffectiveness for more complex tasks such as sentence-level ASR remainsunexplored. Additionally, the impact of the number of forward diffusion stepson performance is not well understood. In this paper, we systematicallyinvestigate the use of DMs for defending against adversarial attacks onsentences and examine the effect of varying forward diffusion steps. Throughcomprehensive experiments on the Mozilla Common Voice dataset, we demonstratethat two forward diffusion steps can completely defend against adversarialattacks on sentences. Moreover, we introduce a novel, training-free approachfor detecting adversarial attacks by leveraging a pre-trained DM. Ourexperimental results show that this method can detect adversarial attacks withhigh accuracy.

众所周知，自动语音识别（ASR）系统容易受到对抗性攻击。本文论述了自动语音识别（ASR）系统如何检测和防御针对语音信号的白盒攻击。虽然现有工作利用扩散模型（DM）来净化对抗性示例，在关键词识别任务中取得了最先进的结果，但对于句子级 ASR 等更复杂的任务，其有效性仍有待探索。此外，人们对前向扩散步数对性能的影响也不甚了解。在本文中，我们系统地研究了如何使用 DM 来抵御对句子的恶意攻击，并考察了改变前向扩散步数的效果。通过对 Mozilla Common Voice 数据集的全面实验，我们证明了两个前向扩散步骤可以完全抵御对句子的恶意攻击。此外，我们还引入了一种新颖的、无需训练的方法，利用预先训练的 DM 来检测对抗性攻击。实验结果表明，这种方法可以高精度地检测出对抗性攻击。

{"title":"Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models","authors":"Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan","doi":"arxiv-2409.07936","DOIUrl":"https://doi.org/arxiv-2409.07936","url":null,"abstract":"Automatic speech recognition (ASR) systems are known to be vulnerable to\u0000adversarial attacks. This paper addresses detection and defence against\u0000targeted white-box attacks on speech signals for ASR systems. While existing\u0000work has utilised diffusion models (DMs) to purify adversarial examples,\u0000achieving state-of-the-art results in keyword spotting tasks, their\u0000effectiveness for more complex tasks such as sentence-level ASR remains\u0000unexplored. Additionally, the impact of the number of forward diffusion steps\u0000on performance is not well understood. In this paper, we systematically\u0000investigate the use of DMs for defending against adversarial attacks on\u0000sentences and examine the effect of varying forward diffusion steps. Through\u0000comprehensive experiments on the Mozilla Common Voice dataset, we demonstrate\u0000that two forward diffusion steps can completely defend against adversarial\u0000attacks on sentences. Moreover, we introduce a novel, training-free approach\u0000for detecting adversarial attacks by leveraging a pre-trained DM. Our\u0000experimental results show that this method can detect adversarial attacks with\u0000high accuracy.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin 基于语料库的台湾普通话会话单音节词音高轮廓研究

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07891

Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen

In Mandarin, the tonal contours of monosyllabic words produced in isolationor in careful speech are characterized by four lexical tones: a high-level tone(T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However,in spontaneous speech, the actual tonal realization of monosyllabic words candeviate significantly from these canonical tones due to intra-syllabicco-articulation and inter-syllabic co-articulation with adjacent tones. Inaddition, Chuang et al. (2024) recently reported that the tonal contours ofdisyllabic Mandarin words with T2-T4 tone pattern are co-determined by theirmeanings. Following up on their research, we present a corpus-basedinvestigation of how the pitch contours of monosyllabic words are realized inspontaneous conversational Mandarin, focusing on the effects of contextualpredictors on the one hand, and the way in words' meanings co-determine pitchcontours on the other hand. We analyze the F0 contours of 3824 tokens of 63different word types in a spontaneous Taiwan Mandarin corpus, using thegeneralized additive (mixed) model to decompose a given observed pitch contourinto a set of component pitch contours. We show that the tonal contextsubstantially modify a word's canonical tone. Once the effect of tonal contextis controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as ahigh tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0),which in standard descriptions, is realized based on the preceding tone,emerges as a low tone in its own right, modified by the other predictors in thesame way as the standard tones T1, T2, T3, and T4. We also show that word, andeven more so, word sense, co-determine words' F0 contours. Analyses of variableimportance using random forests further supported the substantial effect oftonal context and an effect of word sense.

在普通话中，单音节词的声调轮廓在单独说话或仔细说话时有四个词调：高调（T1）、升调（T2）、降调（T3）和降调（T4）。然而，在自发言语中，单音节词的实际声调实现与这些标准声调有很大差异，这是由于音节内共发音和音节间相邻声调共发音造成的。此外，Chuang 等人（2024 年）最近报告说，具有 T2-T4 声调模式的双音节普通话词语的声调轮廓是由其意义共同决定的。继他们的研究之后，我们基于语料库对单音节词的音调轮廓如何在即兴普通话会话中实现进行了研究，一方面关注语境预测因素的影响，另一方面关注词义如何共同决定音调轮廓。我们分析了自发台湾普通话语料库中 63 个不同词类的 3824 个标记的 F0 等高线，使用广义加法（混合）模型将给定的观察音高等高线分解为一组音高等高线成分。我们的研究表明，音调语境实质上改变了词语的标准音调。一旦音调语境的影响得到控制，T2 和 T3 就会变成低平音，与作为高音的 T1 和作为中高音的 T4 形成对比。在标准描述中，中性音调（T0）是在前一音调的基础上实现的，而在其他预测因子的作用下，中性音调（T0）与标准音调 T1、T2、T3 和 T4 一样，本身也是一种低调。我们还发现，词和词义共同决定了词的 F0 轮廓。使用随机森林对变量重要性的分析进一步证实了音调语境的重大影响和词义的影响。

{"title":"A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin","authors":"Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen","doi":"arxiv-2409.07891","DOIUrl":"https://doi.org/arxiv-2409.07891","url":null,"abstract":"In Mandarin, the tonal contours of monosyllabic words produced in isolation\u0000or in careful speech are characterized by four lexical tones: a high-level tone\u0000(T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However,\u0000in spontaneous speech, the actual tonal realization of monosyllabic words can\u0000deviate significantly from these canonical tones due to intra-syllabic\u0000co-articulation and inter-syllabic co-articulation with adjacent tones. In\u0000addition, Chuang et al. (2024) recently reported that the tonal contours of\u0000disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their\u0000meanings. Following up on their research, we present a corpus-based\u0000investigation of how the pitch contours of monosyllabic words are realized in\u0000spontaneous conversational Mandarin, focusing on the effects of contextual\u0000predictors on the one hand, and the way in words' meanings co-determine pitch\u0000contours on the other hand. We analyze the F0 contours of 3824 tokens of 63\u0000different word types in a spontaneous Taiwan Mandarin corpus, using the\u0000generalized additive (mixed) model to decompose a given observed pitch contour\u0000into a set of component pitch contours. We show that the tonal context\u0000substantially modify a word's canonical tone. Once the effect of tonal context\u0000is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a\u0000high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0),\u0000which in standard descriptions, is realized based on the preceding tone,\u0000emerges as a low tone in its own right, modified by the other predictors in the\u0000same way as the standard tones T1, T2, T3, and T4. We also show that word, and\u0000even more so, word sense, co-determine words' F0 contours. Analyses of variable\u0000importance using random forests further supported the substantial effect of\u0000tonal context and an effect of word sense.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TSELM: Target Speaker Extraction using Discrete Tokens and Language Models TSELM：使用离散时标和语言模型提取目标发言人

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07841

Beilong Tang, Bang Zeng, Ming Li

We propose TSELM, a novel target speaker extraction network that leveragesdiscrete tokens and language models. TSELM utilizes multiple discretized layersfrom WavLM as input tokens and incorporates cross-attention mechanisms tointegrate target speaker information. Language models are employed to capturethe sequence dependencies, while a scalable HiFi-GAN is used to reconstruct theaudio from the tokens. By applying a cross-entropy loss, TSELM models theprobability distribution of output tokens, thus converting the complexregression problem of audio generation into a classification task. Experimentalresults show that TSELM achieves excellent results in speech quality andcomparable results in speech intelligibility.

我们提出的 TSELM 是一种利用离散标记和语言模型的新型目标发言人提取网络。TSELM 利用来自 WavLM 的多个离散层作为输入标记，并结合交叉注意机制来整合目标扬声器信息。语言模型用于捕捉序列依赖关系，而可扩展的 HiFi-GAN 则用于从标记重建音频。通过应用交叉熵损失，TSELM 对输出标记的概率分布进行建模，从而将复杂的音频生成回归问题转换为分类任务。实验结果表明，TSELM 在语音质量和语音可懂度方面都取得了优异的成绩。

引用次数: 0

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations 零镜头歌唱语音转换：建立在基于聚类的音素表征基础上

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.08039

Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu

This study presents an innovative Zero-Shot any-to-any Singing VoiceConversion (SVC) method, leveraging a novel clustering-based phonemerepresentation to effectively separate content, timbre, and singing style. Thisapproach enables precise voice characteristic manipulation. We discovered thatdatasets with fewer recordings per artist are more susceptible to timbreleakage. Extensive testing on over 10,000 hours of singing and user feedbackrevealed our model significantly improves sound quality and timbre accuracy,aligning with our objectives and advancing voice conversion technology.Furthermore, this research advances zero-shot SVC and sets the stage for futurework on discrete speech representation, emphasizing the preservation of rhyme.

本研究提出了一种创新的 "零镜头"（Zero-Shot）任意对任意歌唱语音转换（SVC）方法，利用一种新颖的基于聚类的语音呈现方式，有效地将内容、音色和演唱风格分离开来。这种方法可实现精确的语音特征处理。我们发现，每个歌手录音较少的数据集更容易受到音色泄漏的影响。通过对超过 10,000 小时的演唱和用户反馈进行广泛测试，我们发现我们的模型显著提高了音质和音色的准确性，这与我们的目标一致，并推动了语音转换技术的发展。

引用次数: 0

Hierarchical Symbolic Pop Music Generation with Graph Neural Networks 利用图神经网络生成分层符号流行音乐

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.08155

Wen Qing Lim, Jinhua Liang, Huan Zhang

Music is inherently made up of complex structures, and representing them asgraphs helps to capture multiple levels of relationships. While musicgeneration has been explored using various deep generation techniques, researchon graph-related music generation is sparse. Earlier graph-based musicgeneration worked only on generating melodies, and recent works to generatepolyphonic music do not account for longer-term structure. In this paper, weexplore a multi-graph approach to represent both the rhythmic patterns andphrase structure of Chinese pop music. Consequently, we propose a two-stepapproach that aims to generate polyphonic music with coherent rhythm andlong-term structure. We train two Variational Auto-Encoder networks - one on aMIDI dataset to generate 4-bar phrases, and another on song structure labels togenerate full song structure. Our work shows that the models are able to learnmost of the structural nuances in the training dataset, including chord andpitch frequency distributions, and phrase attributes.

音乐本来就是由复杂的结构组成的，用图表示它们有助于捕捉多层次的关系。虽然人们已经使用各种深度生成技术对音乐生成进行了探索，但与图相关的音乐生成研究却很少。早期基于图的音乐生成技术仅用于生成旋律，而近期用于生成复调音乐的工作并未考虑长期结构。在本文中，我们探索了一种多图方法来表示中国流行音乐的节奏模式和乐句结构。因此，我们提出了一种分两步走的方法，旨在生成具有连贯节奏和长期结构的复调音乐。我们训练了两个变异自动编码器网络--一个在 MIDI 数据集上生成 4 小节的乐句，另一个在歌曲结构标签上生成完整的歌曲结构。我们的工作表明，这些模型能够学习训练数据集中的大部分结构细微差别，包括和弦和音高频率分布以及乐句属性。

引用次数: 0

Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings 绘画与音乐的桥梁 -- 通过绘画探索基于情感的音乐创作

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07827

Tanisha Hisariya, Huan Zhang, Jinhua Liang

Rapid advancements in artificial intelligence have significantly enhancedgenerative tasks involving music and images, employing both unimodal andmultimodal approaches. This research develops a model capable of generatingmusic that resonates with the emotions depicted in visual arts, integratingemotion labeling, image captioning, and language models to transform visualinputs into musical compositions. Addressing the scarcity of aligned art andmusic data, we curated the Emotion Painting Music Dataset, pairing paintingswith corresponding music for effective training and evaluation. Our dual-stageframework converts images to text descriptions of emotional content and thentransforms these descriptions into music, facilitating efficient learning withminimal data. Performance is evaluated using metrics such as Fr'echet AudioDistance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KLdivergence, with audio-emotion text similarity confirmed by the pre-trainedCLAP model to demonstrate high alignment between generated music and text. Thissynthesis tool bridges visual art and music, enhancing accessibility for thevisually impaired and opening avenues in educational and therapeuticapplications by providing enriched multi-sensory experiences.

人工智能的飞速发展极大地增强了涉及音乐和图像的生成任务，并同时采用了单模态和多模态方法。本研究开发了一种能够生成与视觉艺术中描绘的情感产生共鸣的音乐的模型，它整合了情感标签、图像标题和语言模型，将视觉输入转化为音乐作品。为了解决艺术与音乐数据不匹配的问题，我们策划了情感绘画音乐数据集，将绘画与相应的音乐配对，以进行有效的训练和评估。我们的双阶段框架工作将图像转换为情感内容的文本描述，然后将这些描述转换为音乐，从而以最少的数据促进高效学习。性能评估指标包括音频距离（FAD）、总谐波失真（THD）、入门分数（IS）和 KLdivergence，音频-情感文本相似性由预先训练的CLAP 模型确认，以证明生成的音乐和文本之间高度一致。该合成工具是视觉艺术和音乐的桥梁，通过提供丰富的多感官体验，提高了视障人士的可及性，并为教育和治疗应用开辟了途径。

{"title":"Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings","authors":"Tanisha Hisariya, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07827","DOIUrl":"https://doi.org/arxiv-2409.07827","url":null,"abstract":"Rapid advancements in artificial intelligence have significantly enhanced\u0000generative tasks involving music and images, employing both unimodal and\u0000multimodal approaches. This research develops a model capable of generating\u0000music that resonates with the emotions depicted in visual arts, integrating\u0000emotion labeling, image captioning, and language models to transform visual\u0000inputs into musical compositions. Addressing the scarcity of aligned art and\u0000music data, we curated the Emotion Painting Music Dataset, pairing paintings\u0000with corresponding music for effective training and evaluation. Our dual-stage\u0000framework converts images to text descriptions of emotional content and then\u0000transforms these descriptions into music, facilitating efficient learning with\u0000minimal data. Performance is evaluated using metrics such as Fr'echet Audio\u0000Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL\u0000divergence, with audio-emotion text similarity confirmed by the pre-trained\u0000CLAP model to demonstrate high alignment between generated music and text. This\u0000synthesis tool bridges visual art and music, enhancing accessibility for the\u0000visually impaired and opening avenues in educational and therapeutic\u0000applications by providing enriched multi-sensory experiences.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification 利用自适应局部竞争算法进行高效稀疏编码以实现语音分类

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.08188

Soufiyan Bahadi, Eric Plourde, Jean Rouat

Researchers are exploring novel computational paradigms such as sparse codingand neuromorphic computing to bridge the efficiency gap between the human brainand conventional computers in complex tasks. A key area of focus isneuromorphic audio processing. While the Locally Competitive Algorithm hasemerged as a promising solution for sparse coding, offering potential forreal-time and low-power processing on neuromorphic hardware, its applicationsin neuromorphic speech classification have not been thoroughly studied. TheAdaptive Locally Competitive Algorithm builds upon the Locally CompetitiveAlgorithm by dynamically adjusting the modulation parameters of the filter bankto fine-tune the filters' sensitivity. This adaptability enhances lateralinhibition, improving reconstruction quality, sparsity, and convergence time,which is crucial for real-time applications. This paper demonstrates thepotential of the Locally Competitive Algorithm and its adaptive variant asrobust feature extractors for neuromorphic speech classification. Results showthat the Locally Competitive Algorithm achieves better speech classificationaccuracy at the expense of higher power consumption compared to the LAUSCHERcochlea model used for benchmarking. On the other hand, the Adaptive LocallyCompetitive Algorithm mitigates this power consumption issue withoutcompromising the accuracy. The dynamic power consumption is reduced to a rangeof 0.004 to 13 milliwatts on neuromorphic hardware, three orders of magnitudeless than setups using Graphics Processing Units. These findings position theAdaptive Locally Competitive Algorithm as a compelling solution for efficientspeech classification systems, promising substantial advancements in balancingspeech classification accuracy and power efficiency.

研究人员正在探索稀疏编码和神经形态计算等新型计算范式，以缩小人脑与传统计算机在复杂任务中的效率差距。其中一个重点领域是神经形态音频处理。虽然局部竞争算法（Locally Competitive Algorithm）已成为稀疏编码的一种有前途的解决方案，为神经形态硬件上的实时和低功耗处理提供了潜力，但其在神经形态语音分类中的应用尚未得到深入研究。自适应局部竞争算法以局部竞争算法为基础，通过动态调整滤波器组的调制参数来微调滤波器的灵敏度。这种适应性增强了后期抑制，提高了重建质量、稀疏性和收敛时间，这对实时应用至关重要。本文展示了局部竞争算法及其自适应变体作为神经形态语音分类的稳健特征提取器的潜力。结果表明，与用于基准测试的 LAUSCHER 耳蜗模型相比，局部竞争算法以更高的功耗为代价，获得了更好的语音分类准确性。另一方面，自适应局部竞争算法在不影响准确性的情况下缓解了功耗问题。在神经形态硬件上，动态功耗降低到 0.004 到 13 毫瓦之间，比使用图形处理器的设置低三个数量级。这些发现将自适应局部竞争算法定位为高效语音分类系统的一个引人注目的解决方案，有望在平衡语音分类准确性和能效方面取得重大进展。

{"title":"Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification","authors":"Soufiyan Bahadi, Eric Plourde, Jean Rouat","doi":"arxiv-2409.08188","DOIUrl":"https://doi.org/arxiv-2409.08188","url":null,"abstract":"Researchers are exploring novel computational paradigms such as sparse coding\u0000and neuromorphic computing to bridge the efficiency gap between the human brain\u0000and conventional computers in complex tasks. A key area of focus is\u0000neuromorphic audio processing. While the Locally Competitive Algorithm has\u0000emerged as a promising solution for sparse coding, offering potential for\u0000real-time and low-power processing on neuromorphic hardware, its applications\u0000in neuromorphic speech classification have not been thoroughly studied. The\u0000Adaptive Locally Competitive Algorithm builds upon the Locally Competitive\u0000Algorithm by dynamically adjusting the modulation parameters of the filter bank\u0000to fine-tune the filters' sensitivity. This adaptability enhances lateral\u0000inhibition, improving reconstruction quality, sparsity, and convergence time,\u0000which is crucial for real-time applications. This paper demonstrates the\u0000potential of the Locally Competitive Algorithm and its adaptive variant as\u0000robust feature extractors for neuromorphic speech classification. Results show\u0000that the Locally Competitive Algorithm achieves better speech classification\u0000accuracy at the expense of higher power consumption compared to the LAUSCHER\u0000cochlea model used for benchmarking. On the other hand, the Adaptive Locally\u0000Competitive Algorithm mitigates this power consumption issue without\u0000compromising the accuracy. The dynamic power consumption is reduced to a range\u0000of 0.004 to 13 milliwatts on neuromorphic hardware, three orders of magnitude\u0000less than setups using Graphics Processing Units. These findings position the\u0000Adaptive Locally Competitive Algorithm as a compelling solution for efficient\u0000speech classification systems, promising substantial advancements in balancing\u0000speech classification accuracy and power efficiency.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dark Experience for Incremental Keyword Spotting 发现增量关键字的黑暗体验

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.08153

Tianyi Peng, Yang Xiao

Spoken keyword spotting (KWS) is crucial for identifying keywords withinaudio inputs and is widely used in applications like Apple Siri and GoogleHome, particularly on edge devices. Current deep learning-based KWS systems,which are typically trained on a limited set of keywords, can suffer fromperformance degradation when encountering new domains, a challenge oftenaddressed through few-shot fine-tuning. However, this adaptation frequentlyleads to catastrophic forgetting, where the model's performance on originaldata deteriorates. Progressive continual learning (CL) strategies have beenproposed to overcome this, but they face limitations such as the need fortask-ID information and increased storage, making them less practical forlightweight devices. To address these challenges, we introduce Dark Experiencefor Keyword Spotting (DE-KWS), a novel CL approach that leverages darkknowledge to distill past experiences throughout the training process. DE-KWScombines rehearsal and distillation, using both ground truth labels and logitsstored in a memory buffer to maintain model performance across tasks.Evaluations on the Google Speech Command dataset show that DE-KWS outperformsexisting CL baselines in average accuracy without increasing model size,offering an effective solution for resource-constrained edge devices. Thescripts are available on GitHub for the future research.

口语关键词识别（KWS）对于利用音频输入识别关键词至关重要，被广泛应用于苹果 Siri 和 GoogleHome 等应用中，尤其是在边缘设备上。目前基于深度学习的 KWS 系统通常是在有限的关键词集上进行训练的，在遇到新领域时可能会出现性能下降的问题，而这一挑战通常是通过少量的微调来解决的。然而，这种调整经常会导致灾难性遗忘，即模型在原始数据上的性能下降。有人提出了渐进式持续学习（CL）策略来克服这一问题，但这些策略面临着一些限制，例如需要掩码识别信息和增加存储空间，因此对于轻型设备来说不太实用。为了应对这些挑战，我们引入了关键词定位的黑暗经验（DE-KWS），这是一种新颖的持续学习方法，在整个训练过程中利用黑暗知识提炼过去的经验。在谷歌语音命令数据集上进行的评估表明，DE-KWS 的平均准确率优于现有的 CL 基线，而且不会增加模型大小，为资源有限的边缘设备提供了有效的解决方案。这些脚本可在 GitHub 上下载，供未来研究使用。

{"title":"Dark Experience for Incremental Keyword Spotting","authors":"Tianyi Peng, Yang Xiao","doi":"arxiv-2409.08153","DOIUrl":"https://doi.org/arxiv-2409.08153","url":null,"abstract":"Spoken keyword spotting (KWS) is crucial for identifying keywords within\u0000audio inputs and is widely used in applications like Apple Siri and Google\u0000Home, particularly on edge devices. Current deep learning-based KWS systems,\u0000which are typically trained on a limited set of keywords, can suffer from\u0000performance degradation when encountering new domains, a challenge often\u0000addressed through few-shot fine-tuning. However, this adaptation frequently\u0000leads to catastrophic forgetting, where the model's performance on original\u0000data deteriorates. Progressive continual learning (CL) strategies have been\u0000proposed to overcome this, but they face limitations such as the need for\u0000task-ID information and increased storage, making them less practical for\u0000lightweight devices. To address these challenges, we introduce Dark Experience\u0000for Keyword Spotting (DE-KWS), a novel CL approach that leverages dark\u0000knowledge to distill past experiences throughout the training process. DE-KWS\u0000combines rehearsal and distillation, using both ground truth labels and logits\u0000stored in a memory buffer to maintain model performance across tasks.\u0000Evaluations on the Google Speech Command dataset show that DE-KWS outperforms\u0000existing CL baselines in average accuracy without increasing model size,\u0000offering an effective solution for resource-constrained edge devices. The\u0000scripts are available on GitHub for the future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio Decoding by Inverse Problem Solving 通过逆向解题进行音频解码

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07858

Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin

We consider audio decoding as an inverse problem and solve it throughdiffusion posterior sampling. Explicit conditioning functions are developed forinput signal measurements provided by an example of a transform domainperceptual audio codec. Viability is demonstrated by evaluating arbitrarypairings of a set of bitrates and task-agnostic prior models. For instance, weobserve significant improvements on piano while maintaining speech performancewhen a speech model is replaced by a joint model trained on both speech andpiano. With a more general music model, improved decoding compared to legacymethods is obtained for a broad range of content types and bitrates. The noisymean model, underlying the proposed derivation of conditioning, enables asignificant reduction of gradient evaluations for diffusion posterior sampling,compared to methods based on Tweedie's mean. Combining Tweedie's mean with ourconditioning functions improves the objective performance. An audio demo isavailable at https://dpscodec-demo.github.io/.

我们将音频解码视为一个逆问题，并通过扩散后验采样来解决这个问题。我们针对变换域感知音频编解码器示例提供的输入信号测量结果开发了显式调节函数。通过评估一组比特率和任务无关先验模型的任意配对，证明了该方法的可行性。例如，当语音模型被同时在语音和钢琴上训练的联合模型所取代时，我们发现钢琴的性能有了显著提高，同时语音性能保持不变。使用更通用的音乐模型，在广泛的内容类型和比特率下，解码效果都比传统方法有所改进。与基于特威迪均值的方法相比，噪声均值模型是所提出的条件推导的基础，能显著减少扩散后验采样的梯度评估。将特威迪均值与我们的条件函数相结合，可以提高目标性能。音频演示见 https://dpscodec-demo.github.io/。

引用次数: 0

Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning Tidal MerzA：通过强化学习将情感建模与自主代码生成相结合

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-12 DOI: arxiv-2409.07918

Elizabeth Wilson, György Fazekas, Geraint Wiggins

This paper presents Tidal-MerzA, a novel system designed for collaborativeperformances between humans and a machine agent in the context of live coding,specifically focusing on the generation of musical patterns. Tidal-MerzA fusestwo foundational models: ALCAA (Affective Live Coding Autonomous Agent) andTidal Fuzz, a computational framework. By integrating affective modelling withcomputational generation, this system leverages reinforcement learningtechniques to dynamically adapt music composition parameters within theTidalCycles framework, ensuring both affective qualities to the patterns andsyntactical correctness. The development of Tidal-MerzA introduces two distinctagents: one focusing on the generation of mini-notation strings for musicalexpression, and another on the alignment of music with targeted affectivestates through reinforcement learning. This approach enhances the adaptabilityand creative potential of live coding practices and allows exploration ofhuman-machine creative interactions. Tidal-MerzA advances the field ofcomputational music generation, presenting a novel methodology forincorporating artificial intelligence into artistic practices.

本文介绍了 Tidal-MerzA，这是一个新颖的系统，专为现场编码背景下人类与机器代理之间的协作表演而设计，尤其侧重于音乐模式的生成。Tidal-MerzA 融合了两个基础模型：ALCAA（情感现场编码自主代理）和潮汐模糊（一种计算框架）。通过将情感建模与计算生成相结合，该系统利用强化学习技术在 TidalCycles 框架内动态调整音乐创作参数，既保证了模式的情感品质，又保证了句法的正确性。Tidal-MerzA 的开发引入了两个不同的代理：一个侧重于生成用于音乐表达的迷你注释字符串，另一个侧重于通过强化学习使音乐与目标情感相一致。这种方法增强了现场编码实践的适应性和创造潜力，并允许探索人机之间的创造性互动。Tidal-MerzA 推动了计算机音乐生成领域的发展，为将人工智能融入艺术实践提供了一种新颖的方法。

{"title":"Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning","authors":"Elizabeth Wilson, György Fazekas, Geraint Wiggins","doi":"arxiv-2409.07918","DOIUrl":"https://doi.org/arxiv-2409.07918","url":null,"abstract":"This paper presents Tidal-MerzA, a novel system designed for collaborative\u0000performances between humans and a machine agent in the context of live coding,\u0000specifically focusing on the generation of musical patterns. Tidal-MerzA fuses\u0000two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and\u0000Tidal Fuzz, a computational framework. By integrating affective modelling with\u0000computational generation, this system leverages reinforcement learning\u0000techniques to dynamically adapt music composition parameters within the\u0000TidalCycles framework, ensuring both affective qualities to the patterns and\u0000syntactical correctness. The development of Tidal-MerzA introduces two distinct\u0000agents: one focusing on the generation of mini-notation strings for musical\u0000expression, and another on the alignment of music with targeted affective\u0000states through reinforcement learning. This approach enhances the adaptability\u0000and creative potential of live coding practices and allows exploration of\u0000human-machine creative interactions. Tidal-MerzA advances the field of\u0000computational music generation, presenting a novel methodology for\u0000incorporating artificial intelligence into artistic practices.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - EE - Audio and Speech Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀