Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan
Automatic speech recognition (ASR) systems are known to be vulnerable to adversarial attacks. This paper addresses detection and defence against targeted white-box attacks on speech signals for ASR systems. While existing work has utilised diffusion models (DMs) to purify adversarial examples, achieving state-of-the-art results in keyword spotting tasks, their effectiveness for more complex tasks such as sentence-level ASR remains unexplored. Additionally, the impact of the number of forward diffusion steps on performance is not well understood. In this paper, we systematically investigate the use of DMs for defending against adversarial attacks on sentences and examine the effect of varying forward diffusion steps. Through comprehensive experiments on the Mozilla Common Voice dataset, we demonstrate that two forward diffusion steps can completely defend against adversarial attacks on sentences. Moreover, we introduce a novel, training-free approach for detecting adversarial attacks by leveraging a pre-trained DM. Our experimental results show that this method can detect adversarial attacks with high accuracy.
众所周知,自动语音识别(ASR)系统容易受到对抗性攻击。本文论述了自动语音识别(ASR)系统如何检测和防御针对语音信号的白盒攻击。虽然现有工作利用扩散模型(DM)来净化对抗性示例,在关键词识别任务中取得了最先进的结果,但对于句子级 ASR 等更复杂的任务,其有效性仍有待探索。此外,人们对前向扩散步数对性能的影响也不甚了解。在本文中,我们系统地研究了如何使用 DM 来抵御对句子的恶意攻击,并考察了改变前向扩散步数的效果。通过对 Mozilla Common Voice 数据集的全面实验,我们证明了两个前向扩散步骤可以完全抵御对句子的恶意攻击。此外,我们还引入了一种新颖的、无需训练的方法,利用预先训练的 DM 来检测对抗性攻击。实验结果表明,这种方法可以高精度地检测出对抗性攻击。
{"title":"Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models","authors":"Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan","doi":"arxiv-2409.07936","DOIUrl":"https://doi.org/arxiv-2409.07936","url":null,"abstract":"Automatic speech recognition (ASR) systems are known to be vulnerable to\u0000adversarial attacks. This paper addresses detection and defence against\u0000targeted white-box attacks on speech signals for ASR systems. While existing\u0000work has utilised diffusion models (DMs) to purify adversarial examples,\u0000achieving state-of-the-art results in keyword spotting tasks, their\u0000effectiveness for more complex tasks such as sentence-level ASR remains\u0000unexplored. Additionally, the impact of the number of forward diffusion steps\u0000on performance is not well understood. In this paper, we systematically\u0000investigate the use of DMs for defending against adversarial attacks on\u0000sentences and examine the effect of varying forward diffusion steps. Through\u0000comprehensive experiments on the Mozilla Common Voice dataset, we demonstrate\u0000that two forward diffusion steps can completely defend against adversarial\u0000attacks on sentences. Moreover, we introduce a novel, training-free approach\u0000for detecting adversarial attacks by leveraging a pre-trained DM. Our\u0000experimental results show that this method can detect adversarial attacks with\u0000high accuracy.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In Mandarin, the tonal contours of monosyllabic words produced in isolation or in careful speech are characterized by four lexical tones: a high-level tone (T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However, in spontaneous speech, the actual tonal realization of monosyllabic words can deviate significantly from these canonical tones due to intra-syllabic co-articulation and inter-syllabic co-articulation with adjacent tones. In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on their research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words' meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a spontaneous Taiwan Mandarin corpus, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. We show that the tonal context substantially modify a word's canonical tone. Once the effect of tonal context is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions, is realized based on the preceding tone, emerges as a low tone in its own right, modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. We also show that word, and even more so, word sense, co-determine words' F0 contours. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense.
{"title":"A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin","authors":"Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen","doi":"arxiv-2409.07891","DOIUrl":"https://doi.org/arxiv-2409.07891","url":null,"abstract":"In Mandarin, the tonal contours of monosyllabic words produced in isolation\u0000or in careful speech are characterized by four lexical tones: a high-level tone\u0000(T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However,\u0000in spontaneous speech, the actual tonal realization of monosyllabic words can\u0000deviate significantly from these canonical tones due to intra-syllabic\u0000co-articulation and inter-syllabic co-articulation with adjacent tones. In\u0000addition, Chuang et al. (2024) recently reported that the tonal contours of\u0000disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their\u0000meanings. Following up on their research, we present a corpus-based\u0000investigation of how the pitch contours of monosyllabic words are realized in\u0000spontaneous conversational Mandarin, focusing on the effects of contextual\u0000predictors on the one hand, and the way in words' meanings co-determine pitch\u0000contours on the other hand. We analyze the F0 contours of 3824 tokens of 63\u0000different word types in a spontaneous Taiwan Mandarin corpus, using the\u0000generalized additive (mixed) model to decompose a given observed pitch contour\u0000into a set of component pitch contours. We show that the tonal context\u0000substantially modify a word's canonical tone. Once the effect of tonal context\u0000is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a\u0000high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0),\u0000which in standard descriptions, is realized based on the preceding tone,\u0000emerges as a low tone in its own right, modified by the other predictors in the\u0000same way as the standard tones T1, T2, T3, and T4. We also show that word, and\u0000even more so, word sense, co-determine words' F0 contours. Analyses of variable\u0000importance using random forests further supported the substantial effect of\u0000tonal context and an effect of word sense.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
{"title":"TSELM: Target Speaker Extraction using Discrete Tokens and Language Models","authors":"Beilong Tang, Bang Zeng, Ming Li","doi":"arxiv-2409.07841","DOIUrl":"https://doi.org/arxiv-2409.07841","url":null,"abstract":"We propose TSELM, a novel target speaker extraction network that leverages\u0000discrete tokens and language models. TSELM utilizes multiple discretized layers\u0000from WavLM as input tokens and incorporates cross-attention mechanisms to\u0000integrate target speaker information. Language models are employed to capture\u0000the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the\u0000audio from the tokens. By applying a cross-entropy loss, TSELM models the\u0000probability distribution of output tokens, thus converting the complex\u0000regression problem of audio generation into a classification task. Experimental\u0000results show that TSELM achieves excellent results in speech quality and\u0000comparable results in speech intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu
This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on over 10,000 hours of singing and user feedback revealed our model significantly improves sound quality and timbre accuracy, aligning with our objectives and advancing voice conversion technology. Furthermore, this research advances zero-shot SVC and sets the stage for future work on discrete speech representation, emphasizing the preservation of rhyme.
{"title":"Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations","authors":"Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu","doi":"arxiv-2409.08039","DOIUrl":"https://doi.org/arxiv-2409.08039","url":null,"abstract":"This study presents an innovative Zero-Shot any-to-any Singing Voice\u0000Conversion (SVC) method, leveraging a novel clustering-based phoneme\u0000representation to effectively separate content, timbre, and singing style. This\u0000approach enables precise voice characteristic manipulation. We discovered that\u0000datasets with fewer recordings per artist are more susceptible to timbre\u0000leakage. Extensive testing on over 10,000 hours of singing and user feedback\u0000revealed our model significantly improves sound quality and timbre accuracy,\u0000aligning with our objectives and advancing voice conversion technology.\u0000Furthermore, this research advances zero-shot SVC and sets the stage for future\u0000work on discrete speech representation, emphasizing the preservation of rhyme.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Music is inherently made up of complex structures, and representing them as graphs helps to capture multiple levels of relationships. While music generation has been explored using various deep generation techniques, research on graph-related music generation is sparse. Earlier graph-based music generation worked only on generating melodies, and recent works to generate polyphonic music do not account for longer-term structure. In this paper, we explore a multi-graph approach to represent both the rhythmic patterns and phrase structure of Chinese pop music. Consequently, we propose a two-step approach that aims to generate polyphonic music with coherent rhythm and long-term structure. We train two Variational Auto-Encoder networks - one on a MIDI dataset to generate 4-bar phrases, and another on song structure labels to generate full song structure. Our work shows that the models are able to learn most of the structural nuances in the training dataset, including chord and pitch frequency distributions, and phrase attributes.
{"title":"Hierarchical Symbolic Pop Music Generation with Graph Neural Networks","authors":"Wen Qing Lim, Jinhua Liang, Huan Zhang","doi":"arxiv-2409.08155","DOIUrl":"https://doi.org/arxiv-2409.08155","url":null,"abstract":"Music is inherently made up of complex structures, and representing them as\u0000graphs helps to capture multiple levels of relationships. While music\u0000generation has been explored using various deep generation techniques, research\u0000on graph-related music generation is sparse. Earlier graph-based music\u0000generation worked only on generating melodies, and recent works to generate\u0000polyphonic music do not account for longer-term structure. In this paper, we\u0000explore a multi-graph approach to represent both the rhythmic patterns and\u0000phrase structure of Chinese pop music. Consequently, we propose a two-step\u0000approach that aims to generate polyphonic music with coherent rhythm and\u0000long-term structure. We train two Variational Auto-Encoder networks - one on a\u0000MIDI dataset to generate 4-bar phrases, and another on song structure labels to\u0000generate full song structure. Our work shows that the models are able to learn\u0000most of the structural nuances in the training dataset, including chord and\u0000pitch frequency distributions, and phrase attributes.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rapid advancements in artificial intelligence have significantly enhanced generative tasks involving music and images, employing both unimodal and multimodal approaches. This research develops a model capable of generating music that resonates with the emotions depicted in visual arts, integrating emotion labeling, image captioning, and language models to transform visual inputs into musical compositions. Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music Dataset, pairing paintings with corresponding music for effective training and evaluation. Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data. Performance is evaluated using metrics such as Fr'echet Audio Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL divergence, with audio-emotion text similarity confirmed by the pre-trained CLAP model to demonstrate high alignment between generated music and text. This synthesis tool bridges visual art and music, enhancing accessibility for the visually impaired and opening avenues in educational and therapeutic applications by providing enriched multi-sensory experiences.
{"title":"Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings","authors":"Tanisha Hisariya, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07827","DOIUrl":"https://doi.org/arxiv-2409.07827","url":null,"abstract":"Rapid advancements in artificial intelligence have significantly enhanced\u0000generative tasks involving music and images, employing both unimodal and\u0000multimodal approaches. This research develops a model capable of generating\u0000music that resonates with the emotions depicted in visual arts, integrating\u0000emotion labeling, image captioning, and language models to transform visual\u0000inputs into musical compositions. Addressing the scarcity of aligned art and\u0000music data, we curated the Emotion Painting Music Dataset, pairing paintings\u0000with corresponding music for effective training and evaluation. Our dual-stage\u0000framework converts images to text descriptions of emotional content and then\u0000transforms these descriptions into music, facilitating efficient learning with\u0000minimal data. Performance is evaluated using metrics such as Fr'echet Audio\u0000Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL\u0000divergence, with audio-emotion text similarity confirmed by the pre-trained\u0000CLAP model to demonstrate high alignment between generated music and text. This\u0000synthesis tool bridges visual art and music, enhancing accessibility for the\u0000visually impaired and opening avenues in educational and therapeutic\u0000applications by providing enriched multi-sensory experiences.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Researchers are exploring novel computational paradigms such as sparse coding and neuromorphic computing to bridge the efficiency gap between the human brain and conventional computers in complex tasks. A key area of focus is neuromorphic audio processing. While the Locally Competitive Algorithm has emerged as a promising solution for sparse coding, offering potential for real-time and low-power processing on neuromorphic hardware, its applications in neuromorphic speech classification have not been thoroughly studied. The Adaptive Locally Competitive Algorithm builds upon the Locally Competitive Algorithm by dynamically adjusting the modulation parameters of the filter bank to fine-tune the filters' sensitivity. This adaptability enhances lateral inhibition, improving reconstruction quality, sparsity, and convergence time, which is crucial for real-time applications. This paper demonstrates the potential of the Locally Competitive Algorithm and its adaptive variant as robust feature extractors for neuromorphic speech classification. Results show that the Locally Competitive Algorithm achieves better speech classification accuracy at the expense of higher power consumption compared to the LAUSCHER cochlea model used for benchmarking. On the other hand, the Adaptive Locally Competitive Algorithm mitigates this power consumption issue without compromising the accuracy. The dynamic power consumption is reduced to a range of 0.004 to 13 milliwatts on neuromorphic hardware, three orders of magnitude less than setups using Graphics Processing Units. These findings position the Adaptive Locally Competitive Algorithm as a compelling solution for efficient speech classification systems, promising substantial advancements in balancing speech classification accuracy and power efficiency.
{"title":"Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification","authors":"Soufiyan Bahadi, Eric Plourde, Jean Rouat","doi":"arxiv-2409.08188","DOIUrl":"https://doi.org/arxiv-2409.08188","url":null,"abstract":"Researchers are exploring novel computational paradigms such as sparse coding\u0000and neuromorphic computing to bridge the efficiency gap between the human brain\u0000and conventional computers in complex tasks. A key area of focus is\u0000neuromorphic audio processing. While the Locally Competitive Algorithm has\u0000emerged as a promising solution for sparse coding, offering potential for\u0000real-time and low-power processing on neuromorphic hardware, its applications\u0000in neuromorphic speech classification have not been thoroughly studied. The\u0000Adaptive Locally Competitive Algorithm builds upon the Locally Competitive\u0000Algorithm by dynamically adjusting the modulation parameters of the filter bank\u0000to fine-tune the filters' sensitivity. This adaptability enhances lateral\u0000inhibition, improving reconstruction quality, sparsity, and convergence time,\u0000which is crucial for real-time applications. This paper demonstrates the\u0000potential of the Locally Competitive Algorithm and its adaptive variant as\u0000robust feature extractors for neuromorphic speech classification. Results show\u0000that the Locally Competitive Algorithm achieves better speech classification\u0000accuracy at the expense of higher power consumption compared to the LAUSCHER\u0000cochlea model used for benchmarking. On the other hand, the Adaptive Locally\u0000Competitive Algorithm mitigates this power consumption issue without\u0000compromising the accuracy. The dynamic power consumption is reduced to a range\u0000of 0.004 to 13 milliwatts on neuromorphic hardware, three orders of magnitude\u0000less than setups using Graphics Processing Units. These findings position the\u0000Adaptive Locally Competitive Algorithm as a compelling solution for efficient\u0000speech classification systems, promising substantial advancements in balancing\u0000speech classification accuracy and power efficiency.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spoken keyword spotting (KWS) is crucial for identifying keywords within audio inputs and is widely used in applications like Apple Siri and Google Home, particularly on edge devices. Current deep learning-based KWS systems, which are typically trained on a limited set of keywords, can suffer from performance degradation when encountering new domains, a challenge often addressed through few-shot fine-tuning. However, this adaptation frequently leads to catastrophic forgetting, where the model's performance on original data deteriorates. Progressive continual learning (CL) strategies have been proposed to overcome this, but they face limitations such as the need for task-ID information and increased storage, making them less practical for lightweight devices. To address these challenges, we introduce Dark Experience for Keyword Spotting (DE-KWS), a novel CL approach that leverages dark knowledge to distill past experiences throughout the training process. DE-KWS combines rehearsal and distillation, using both ground truth labels and logits stored in a memory buffer to maintain model performance across tasks. Evaluations on the Google Speech Command dataset show that DE-KWS outperforms existing CL baselines in average accuracy without increasing model size, offering an effective solution for resource-constrained edge devices. The scripts are available on GitHub for the future research.
{"title":"Dark Experience for Incremental Keyword Spotting","authors":"Tianyi Peng, Yang Xiao","doi":"arxiv-2409.08153","DOIUrl":"https://doi.org/arxiv-2409.08153","url":null,"abstract":"Spoken keyword spotting (KWS) is crucial for identifying keywords within\u0000audio inputs and is widely used in applications like Apple Siri and Google\u0000Home, particularly on edge devices. Current deep learning-based KWS systems,\u0000which are typically trained on a limited set of keywords, can suffer from\u0000performance degradation when encountering new domains, a challenge often\u0000addressed through few-shot fine-tuning. However, this adaptation frequently\u0000leads to catastrophic forgetting, where the model's performance on original\u0000data deteriorates. Progressive continual learning (CL) strategies have been\u0000proposed to overcome this, but they face limitations such as the need for\u0000task-ID information and increased storage, making them less practical for\u0000lightweight devices. To address these challenges, we introduce Dark Experience\u0000for Keyword Spotting (DE-KWS), a novel CL approach that leverages dark\u0000knowledge to distill past experiences throughout the training process. DE-KWS\u0000combines rehearsal and distillation, using both ground truth labels and logits\u0000stored in a memory buffer to maintain model performance across tasks.\u0000Evaluations on the Google Speech Command dataset show that DE-KWS outperforms\u0000existing CL baselines in average accuracy without increasing model size,\u0000offering an effective solution for resource-constrained edge devices. The\u0000scripts are available on GitHub for the future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"94 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin
We consider audio decoding as an inverse problem and solve it through diffusion posterior sampling. Explicit conditioning functions are developed for input signal measurements provided by an example of a transform domain perceptual audio codec. Viability is demonstrated by evaluating arbitrary pairings of a set of bitrates and task-agnostic prior models. For instance, we observe significant improvements on piano while maintaining speech performance when a speech model is replaced by a joint model trained on both speech and piano. With a more general music model, improved decoding compared to legacy methods is obtained for a broad range of content types and bitrates. The noisy mean model, underlying the proposed derivation of conditioning, enables a significant reduction of gradient evaluations for diffusion posterior sampling, compared to methods based on Tweedie's mean. Combining Tweedie's mean with our conditioning functions improves the objective performance. An audio demo is available at https://dpscodec-demo.github.io/.
{"title":"Audio Decoding by Inverse Problem Solving","authors":"Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin","doi":"arxiv-2409.07858","DOIUrl":"https://doi.org/arxiv-2409.07858","url":null,"abstract":"We consider audio decoding as an inverse problem and solve it through\u0000diffusion posterior sampling. Explicit conditioning functions are developed for\u0000input signal measurements provided by an example of a transform domain\u0000perceptual audio codec. Viability is demonstrated by evaluating arbitrary\u0000pairings of a set of bitrates and task-agnostic prior models. For instance, we\u0000observe significant improvements on piano while maintaining speech performance\u0000when a speech model is replaced by a joint model trained on both speech and\u0000piano. With a more general music model, improved decoding compared to legacy\u0000methods is obtained for a broad range of content types and bitrates. The noisy\u0000mean model, underlying the proposed derivation of conditioning, enables a\u0000significant reduction of gradient evaluations for diffusion posterior sampling,\u0000compared to methods based on Tweedie's mean. Combining Tweedie's mean with our\u0000conditioning functions improves the objective performance. An audio demo is\u0000available at https://dpscodec-demo.github.io/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents Tidal-MerzA, a novel system designed for collaborative performances between humans and a machine agent in the context of live coding, specifically focusing on the generation of musical patterns. Tidal-MerzA fuses two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and Tidal Fuzz, a computational framework. By integrating affective modelling with computational generation, this system leverages reinforcement learning techniques to dynamically adapt music composition parameters within the TidalCycles framework, ensuring both affective qualities to the patterns and syntactical correctness. The development of Tidal-MerzA introduces two distinct agents: one focusing on the generation of mini-notation strings for musical expression, and another on the alignment of music with targeted affective states through reinforcement learning. This approach enhances the adaptability and creative potential of live coding practices and allows exploration of human-machine creative interactions. Tidal-MerzA advances the field of computational music generation, presenting a novel methodology for incorporating artificial intelligence into artistic practices.
{"title":"Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning","authors":"Elizabeth Wilson, György Fazekas, Geraint Wiggins","doi":"arxiv-2409.07918","DOIUrl":"https://doi.org/arxiv-2409.07918","url":null,"abstract":"This paper presents Tidal-MerzA, a novel system designed for collaborative\u0000performances between humans and a machine agent in the context of live coding,\u0000specifically focusing on the generation of musical patterns. Tidal-MerzA fuses\u0000two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and\u0000Tidal Fuzz, a computational framework. By integrating affective modelling with\u0000computational generation, this system leverages reinforcement learning\u0000techniques to dynamically adapt music composition parameters within the\u0000TidalCycles framework, ensuring both affective qualities to the patterns and\u0000syntactical correctness. The development of Tidal-MerzA introduces two distinct\u0000agents: one focusing on the generation of mini-notation strings for musical\u0000expression, and another on the alignment of music with targeted affective\u0000states through reinforcement learning. This approach enhances the adaptability\u0000and creative potential of live coding practices and allows exploration of\u0000human-machine creative interactions. Tidal-MerzA advances the field of\u0000computational music generation, presenting a novel methodology for\u0000incorporating artificial intelligence into artistic practices.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}