Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu
This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech - far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than the state-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data.
{"title":"Sample-Efficient Diffusion for Text-To-Speech Synthesis","authors":"Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q. Weinberger, Felix Wu","doi":"arxiv-2409.03717","DOIUrl":"https://doi.org/arxiv-2409.03717","url":null,"abstract":"This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm\u0000for effective speech synthesis in modest data regimes through latent diffusion.\u0000It is based on a novel diffusion architecture, that we call U-Audio Transformer\u0000(U-AT), that efficiently scales to long sequences and operates in the latent\u0000space of a pre-trained audio autoencoder. Conditioned on character-aware\u0000language model representations, SESD achieves impressive results despite\u0000training on less than 1k hours of speech - far less than current\u0000state-of-the-art systems. In fact, it synthesizes more intelligible speech than\u0000the state-of-the-art auto-regressive model, VALL-E, while using less than 2%\u0000the training data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optical Music Recognition (OMR) automates the transcription of musical notation from images into machine-readable formats like MusicXML, MEI, or MIDI, significantly reducing the costs and time of manual transcription. This study explores knowledge discovery in OMR by applying instance segmentation using Mask R-CNN to enhance the detection and delineation of musical symbols in sheet music. Unlike Optical Character Recognition (OCR), OMR must handle the intricate semantics of Common Western Music Notation (CWMN), where symbol meanings depend on shape, position, and context. Our approach leverages instance segmentation to manage the density and overlap of musical symbols, facilitating more precise information retrieval from music scores. Evaluations on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with our method achieving a mean Average Precision (mAP) of up to 59.70% in dense symbol environments, achieving comparable results to object detection. Furthermore, using traditional computer vision techniques, we add a parallel step for staff detection to infer the pitch for the recognised symbols. This study emphasises the role of pixel-wise segmentation in advancing accurate music symbol recognition, contributing to knowledge discovery in OMR. Our findings indicate that instance segmentation provides more precise representations of musical symbols, particularly in densely populated scores, advancing OMR technology. We make our implementation, pre-processing scripts, trained models, and evaluation results publicly available to support further research and development.
{"title":"Knowledge Discovery in Optical Music Recognition: Enhancing Information Retrieval with Instance Segmentation","authors":"Elona Shatri, George Fazekas","doi":"arxiv-2408.15002","DOIUrl":"https://doi.org/arxiv-2408.15002","url":null,"abstract":"Optical Music Recognition (OMR) automates the transcription of musical\u0000notation from images into machine-readable formats like MusicXML, MEI, or MIDI,\u0000significantly reducing the costs and time of manual transcription. This study\u0000explores knowledge discovery in OMR by applying instance segmentation using\u0000Mask R-CNN to enhance the detection and delineation of musical symbols in sheet\u0000music. Unlike Optical Character Recognition (OCR), OMR must handle the\u0000intricate semantics of Common Western Music Notation (CWMN), where symbol\u0000meanings depend on shape, position, and context. Our approach leverages\u0000instance segmentation to manage the density and overlap of musical symbols,\u0000facilitating more precise information retrieval from music scores. Evaluations\u0000on the DoReMi and MUSCIMA++ datasets demonstrate substantial improvements, with\u0000our method achieving a mean Average Precision (mAP) of up to 59.70% in dense\u0000symbol environments, achieving comparable results to object detection.\u0000Furthermore, using traditional computer vision techniques, we add a parallel\u0000step for staff detection to infer the pitch for the recognised symbols. This\u0000study emphasises the role of pixel-wise segmentation in advancing accurate\u0000music symbol recognition, contributing to knowledge discovery in OMR. Our\u0000findings indicate that instance segmentation provides more precise\u0000representations of musical symbols, particularly in densely populated scores,\u0000advancing OMR technology. We make our implementation, pre-processing scripts,\u0000trained models, and evaluation results publicly available to support further\u0000research and development.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li
To address the limitation in multimodal emotion recognition (MER) performance arising from inter-modal information fusion, we propose a novel MER framework based on multitask learning where fusion occurs after alignment, called Foal-Net. The framework is designed to enhance the effectiveness of modality fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL) and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of emotional information in audio-video representations through contrastive learning. Then, a modal fusion network integrates the aligned features. Meanwhile, MEM assesses whether the emotions of the current sample pair are the same, providing assistance for modal information fusion and guiding the model to focus more on emotional information. The experimental results conducted on IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and emotion alignment is necessary before modal fusion.
{"title":"Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition","authors":"Qifei Li, Yingming Gao, Yuhua Wen, Cong Wang, Ya Li","doi":"arxiv-2408.09438","DOIUrl":"https://doi.org/arxiv-2408.09438","url":null,"abstract":"To address the limitation in multimodal emotion recognition (MER) performance\u0000arising from inter-modal information fusion, we propose a novel MER framework\u0000based on multitask learning where fusion occurs after alignment, called\u0000Foal-Net. The framework is designed to enhance the effectiveness of modality\u0000fusion and includes two auxiliary tasks: audio-video emotion alignment (AVEL)\u0000and cross-modal emotion label matching (MEM). First, AVEL achieves alignment of\u0000emotional information in audio-video representations through contrastive\u0000learning. Then, a modal fusion network integrates the aligned features.\u0000Meanwhile, MEM assesses whether the emotions of the current sample pair are the\u0000same, providing assistance for modal information fusion and guiding the model\u0000to focus more on emotional information. The experimental results conducted on\u0000IEMOCAP corpus show that Foal-Net outperforms the state-of-the-art methods and\u0000emotion alignment is necessary before modal fusion.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin
Schenkerian Analysis (SchA) is a uniquely expressive method of music analysis, combining elements of melody, harmony, counterpoint, and form to describe the hierarchical structure supporting a work of music. However, despite its powerful analytical utility and potential to improve music understanding and generation, SchA has rarely been utilized by the computer music community. This is in large part due to the paucity of available high-quality data in a computer-readable format. With a larger corpus of Schenkerian data, it may be possible to infuse machine learning models with a deeper understanding of musical structure, thus leading to more "human" results. To encourage further research in Schenkerian analysis and its potential benefits for music informatics and generation, this paper presents three main contributions: 1) a new and growing dataset of SchAs, the largest in human- and computer-readable formats to date (>140 excerpts), 2) a novel software for visualization and collection of SchA data, and 3) a novel, flexible representation of SchA as a heterogeneous-edge graph data structure.
{"title":"A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis","authors":"Stephen Ni-Hahn, Weihan Xu, Jerry Yin, Rico Zhu, Simon Mak, Yue Jiang, Cynthia Rudin","doi":"arxiv-2408.07184","DOIUrl":"https://doi.org/arxiv-2408.07184","url":null,"abstract":"Schenkerian Analysis (SchA) is a uniquely expressive method of music\u0000analysis, combining elements of melody, harmony, counterpoint, and form to\u0000describe the hierarchical structure supporting a work of music. However,\u0000despite its powerful analytical utility and potential to improve music\u0000understanding and generation, SchA has rarely been utilized by the computer\u0000music community. This is in large part due to the paucity of available\u0000high-quality data in a computer-readable format. With a larger corpus of\u0000Schenkerian data, it may be possible to infuse machine learning models with a\u0000deeper understanding of musical structure, thus leading to more \"human\"\u0000results. To encourage further research in Schenkerian analysis and its\u0000potential benefits for music informatics and generation, this paper presents\u0000three main contributions: 1) a new and growing dataset of SchAs, the largest in\u0000human- and computer-readable formats to date (>140 excerpts), 2) a novel\u0000software for visualization and collection of SchA data, and 3) a novel,\u0000flexible representation of SchA as a heterogeneous-edge graph data structure.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142198489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon
Guitar tablatures enrich the structure of traditional music notation by assigning each note to a string and fret of a guitar in a particular tuning, indicating precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the guitar, multiple string-fret assignments are possible for most pitches, which leads to a large combinatorial space that prevents exhaustive search approaches. Most modern methods use constraint-based dynamic programming to minimize some cost function (e.g. hand position movement). In this work, we introduce a novel deep learning solution to symbolic guitar tablature estimation. We train an encoder-decoder Transformer model in a masked language modeling paradigm to assign notes to strings. The model is first pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on a curated set of professionally transcribed guitar performances. Given the subjective nature of assessing tablature quality, we conduct a user study amongst guitarists, wherein we ask participants to rate the playability of multiple versions of tablature for the same four-bar excerpt. The results indicate our system significantly outperforms competing algorithms.
{"title":"MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling","authors":"Drew Edwards, Xavier Riley, Pedro Sarmento, Simon Dixon","doi":"arxiv-2408.05024","DOIUrl":"https://doi.org/arxiv-2408.05024","url":null,"abstract":"Guitar tablatures enrich the structure of traditional music notation by\u0000assigning each note to a string and fret of a guitar in a particular tuning,\u0000indicating precisely where to play the note on the instrument. The problem of\u0000generating tablature from a symbolic music representation involves inferring\u0000this string and fret assignment per note across an entire composition or\u0000performance. On the guitar, multiple string-fret assignments are possible for\u0000most pitches, which leads to a large combinatorial space that prevents\u0000exhaustive search approaches. Most modern methods use constraint-based dynamic\u0000programming to minimize some cost function (e.g. hand position movement). In\u0000this work, we introduce a novel deep learning solution to symbolic guitar\u0000tablature estimation. We train an encoder-decoder Transformer model in a masked\u0000language modeling paradigm to assign notes to strings. The model is first\u0000pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on\u0000a curated set of professionally transcribed guitar performances. Given the\u0000subjective nature of assessing tablature quality, we conduct a user study\u0000amongst guitarists, wherein we ask participants to rate the playability of\u0000multiple versions of tablature for the same four-bar excerpt. The results\u0000indicate our system significantly outperforms competing algorithms.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma
Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.
{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":"https://doi.org/arxiv-2408.00370","url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\u0000creation, where current methods predominantly utilize Transformer-based\u0000architectures that necessitate extensive memory and are characterized by slow\u0000inference speeds. In response to these limitations, we propose\u0000textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\u0000highly personalized 3D full-body gestures solely from raw speech audio,\u0000employing Mamba-based architectures. This model integrates a Mamba-based fuzzy\u0000feature extractor with a non-autoregressive Adaptive Layer Normalization\u0000(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\u0000framework and a WavLM pre-trained model, autonomously derives implicit,\u0000continuous fuzzy features, which are then unified into a singular latent\u0000feature. This feature is processed by the AdaLN Mamba-2, which implements a\u0000uniform conditional mechanism across all tokens to robustly model the interplay\u0000between the fuzzy features and the resultant gesture sequence. This innovative\u0000approach guarantees high fidelity in gesture-speech synchronization while\u0000maintaining the naturalness of the gestures. Employing a diffusion model for\u0000training and inference, our framework has undergone extensive subjective and\u0000objective evaluations on the ZEGGS and BEAT datasets. These assessments\u0000substantiate our model's enhanced performance relative to contemporary\u0000state-of-the-art methods, demonstrating competitive outcomes with the DiTs\u0000architecture (Persona-Gestors) while optimizing memory usage and accelerating\u0000inference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Riyansha SinghIIT Kanpur, India, Parinita NemaIISER Bhopal, India, Vinod K KurmiIISER Bhopal, India
In machine learning applications, gradual data ingress is common, especially in audio processing where incremental learning is vital for real-time analytics. Few-shot class-incremental learning addresses challenges arising from limited incoming data. Existing methods often integrate additional trainable components or rely on a fixed embedding extractor post-training on base sessions to mitigate concerns related to catastrophic forgetting and the dangers of model overfitting. However, using cross-entropy loss alone during base session training is suboptimal for audio data. To address this, we propose incorporating supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization since it facilitates seamless integration of incremental classes, upon arrival. Experimental results on NSynth and LibriSpeech datasets with 100 classes, as well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art performance.
{"title":"Towards Robust Few-shot Class Incremental Learning in Audio Classification using Contrastive Representation","authors":"Riyansha SinghIIT Kanpur, India, Parinita NemaIISER Bhopal, India, Vinod K KurmiIISER Bhopal, India","doi":"arxiv-2407.19265","DOIUrl":"https://doi.org/arxiv-2407.19265","url":null,"abstract":"In machine learning applications, gradual data ingress is common, especially\u0000in audio processing where incremental learning is vital for real-time\u0000analytics. Few-shot class-incremental learning addresses challenges arising\u0000from limited incoming data. Existing methods often integrate additional\u0000trainable components or rely on a fixed embedding extractor post-training on\u0000base sessions to mitigate concerns related to catastrophic forgetting and the\u0000dangers of model overfitting. However, using cross-entropy loss alone during\u0000base session training is suboptimal for audio data. To address this, we propose\u0000incorporating supervised contrastive learning to refine the representation\u0000space, enhancing discriminative power and leading to better generalization\u0000since it facilitates seamless integration of incremental classes, upon arrival.\u0000Experimental results on NSynth and LibriSpeech datasets with 100 classes, as\u0000well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art\u0000performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt
This paper explores the application of artificial intelligence techniques in audio and voice processing, focusing on the integration of wake words and speaker recognition for secure access in embedded systems. With the growing prevalence of voice-activated devices such as Amazon Alexa, ensuring secure and user-specific interactions has become paramount. Our study aims to enhance the security framework of these systems by leveraging wake words for initial activation and speaker recognition to validate user permissions. By incorporating these AI-driven methodologies, we propose a robust solution that restricts system usage to authorized individuals, thereby mitigating unauthorized access risks. This research delves into the algorithms and technologies underpinning wake word detection and speaker recognition, evaluates their effectiveness in real-world applications, and discusses the potential for their implementation in various embedded systems, emphasizing security and user convenience. The findings underscore the feasibility and advantages of employing these AI techniques to create secure, user-friendly voice-activated systems.
{"title":"Implementation and Applications of WakeWords Integrated with Speaker Recognition: A Case Study","authors":"Alexandre Costa Ferro Filho, Elisa Ayumi Masasi de Oliveira, Iago Alves Brito, Pedro Martins Bittencourt","doi":"arxiv-2407.18985","DOIUrl":"https://doi.org/arxiv-2407.18985","url":null,"abstract":"This paper explores the application of artificial intelligence techniques in\u0000audio and voice processing, focusing on the integration of wake words and\u0000speaker recognition for secure access in embedded systems. With the growing\u0000prevalence of voice-activated devices such as Amazon Alexa, ensuring secure and\u0000user-specific interactions has become paramount. Our study aims to enhance the\u0000security framework of these systems by leveraging wake words for initial\u0000activation and speaker recognition to validate user permissions. By\u0000incorporating these AI-driven methodologies, we propose a robust solution that\u0000restricts system usage to authorized individuals, thereby mitigating\u0000unauthorized access risks. This research delves into the algorithms and\u0000technologies underpinning wake word detection and speaker recognition,\u0000evaluates their effectiveness in real-world applications, and discusses the\u0000potential for their implementation in various embedded systems, emphasizing\u0000security and user convenience. The findings underscore the feasibility and\u0000advantages of employing these AI techniques to create secure, user-friendly\u0000voice-activated systems.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141864588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study aims to develop an auxiliary diagnostic system for classifying abnormal lung respiratory sounds, enhancing the accuracy of automatic abnormal breath sound classification through an innovative multi-label learning approach and multi-head attention mechanism. Addressing the issue of class imbalance and lack of diversity in existing respiratory sound datasets, our study employs a lightweight and highly accurate model, using a two-dimensional label set to represent multiple respiratory sound characteristics. Our method achieved a 59.2% ICBHI score in the four-category task on the ICBHI2017 dataset, demonstrating its advantages in terms of lightweight and high accuracy. This study not only improves the accuracy of automatic diagnosis of lung respiratory sound abnormalities but also opens new possibilities for clinical applications.
{"title":"Towards Enhanced Classification of Abnormal Lung sound in Multi-breath: A Light Weight Multi-label and Multi-head Attention Classification Method","authors":"Yi-Wei Chua, Yun-Chien Cheng","doi":"arxiv-2407.10828","DOIUrl":"https://doi.org/arxiv-2407.10828","url":null,"abstract":"This study aims to develop an auxiliary diagnostic system for classifying\u0000abnormal lung respiratory sounds, enhancing the accuracy of automatic abnormal\u0000breath sound classification through an innovative multi-label learning approach\u0000and multi-head attention mechanism. Addressing the issue of class imbalance and\u0000lack of diversity in existing respiratory sound datasets, our study employs a\u0000lightweight and highly accurate model, using a two-dimensional label set to\u0000represent multiple respiratory sound characteristics. Our method achieved a\u000059.2% ICBHI score in the four-category task on the ICBHI2017 dataset,\u0000demonstrating its advantages in terms of lightweight and high accuracy. This\u0000study not only improves the accuracy of automatic diagnosis of lung respiratory\u0000sound abnormalities but also opens new possibilities for clinical applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang
Replicating analog device circuits through neural audio effect modeling has garnered increasing interest in recent years. Existing work has predominantly focused on a one-to-one emulation strategy, modeling specific devices individually. In this paper, we tackle the less-explored scenario of one-to-many emulation, utilizing conditioning mechanisms to emulate multiple guitar amplifiers through a single neural model. For condition representation, we use contrastive learning to build a tone embedding encoder that extracts style-related features of various amplifiers, leveraging a dataset of comprehensive amplifier settings. Targeting zero-shot application scenarios, we also examine various strategies for tone embedding representation, evaluating referenced tone embedding against two retrieval-based embedding methods for amplifiers unseen in the training time. Our findings showcase the efficacy and potential of the proposed methods in achieving versatile one-to-many amplifier modeling, contributing a foundational step towards zero-shot audio modeling applications.
{"title":"Towards zero-shot amplifier modeling: One-to-many amplifier modeling via tone embedding control","authors":"Yu-Hua Chen, Yen-Tung Yeh, Yuan-Chiao Cheng, Jui-Te Wu, Yu-Hsiang Ho, Jyh-Shing Roger Jang, Yi-Hsuan Yang","doi":"arxiv-2407.10646","DOIUrl":"https://doi.org/arxiv-2407.10646","url":null,"abstract":"Replicating analog device circuits through neural audio effect modeling has\u0000garnered increasing interest in recent years. Existing work has predominantly\u0000focused on a one-to-one emulation strategy, modeling specific devices\u0000individually. In this paper, we tackle the less-explored scenario of\u0000one-to-many emulation, utilizing conditioning mechanisms to emulate multiple\u0000guitar amplifiers through a single neural model. For condition representation,\u0000we use contrastive learning to build a tone embedding encoder that extracts\u0000style-related features of various amplifiers, leveraging a dataset of\u0000comprehensive amplifier settings. Targeting zero-shot application scenarios, we\u0000also examine various strategies for tone embedding representation, evaluating\u0000referenced tone embedding against two retrieval-based embedding methods for\u0000amplifiers unseen in the training time. Our findings showcase the efficacy and\u0000potential of the proposed methods in achieving versatile one-to-many amplifier\u0000modeling, contributing a foundational step towards zero-shot audio modeling\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}