Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most popular algorithm in TTS to estimate unknown alignments between text and speech. Since this algorithm needs to search for the most probable alignment with dynamic programming by caching all paths, the time complexity of the algorithm is $O(T times S)$. The authors of Glow-TTS run this algorithm on CPU, and while they mentioned it is difficult to parallelize, we found that MAS can be parallelized in text-length dimension and CPU execution consumes an inordinate amount of time for inter-device copy. Therefore, we implemented a Triton kernel and PyTorch JIT script to accelerate MAS on GPU without inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times faster in the extreme-length case. The code is available at url{https://github.com/supertone-inc/super-monotonic-align}.
{"title":"Super Monotonic Alignment Search","authors":"Junhyeok Lee, Hyeongju Kim","doi":"arxiv-2409.07704","DOIUrl":"https://doi.org/arxiv-2409.07704","url":null,"abstract":"Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most\u0000popular algorithm in TTS to estimate unknown alignments between text and\u0000speech. Since this algorithm needs to search for the most probable alignment\u0000with dynamic programming by caching all paths, the time complexity of the\u0000algorithm is $O(T times S)$. The authors of Glow-TTS run this algorithm on\u0000CPU, and while they mentioned it is difficult to parallelize, we found that MAS\u0000can be parallelized in text-length dimension and CPU execution consumes an\u0000inordinate amount of time for inter-device copy. Therefore, we implemented a\u0000Triton kernel and PyTorch JIT script to accelerate MAS on GPU without\u0000inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times\u0000faster in the extreme-length case. The code is available at\u0000url{https://github.com/supertone-inc/super-monotonic-align}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies have identified that language models, pretrained on text-only datasets, often lack elementary visual knowledge, textit{e.g.,} colors of everyday objects. Motivated by this observation, we ask whether a similar shortcoming exists in terms of the textit{auditory} knowledge. To answer this question, we construct a new dataset called AuditoryBench, which consists of two novel tasks for evaluating auditory knowledge. Based on our analysis using the benchmark, we find that language models also suffer from a severe lack of auditory knowledge. To address this limitation, we propose AudioBERT, a novel method to augment the auditory knowledge of BERT through a retrieval-based approach. First, we detect auditory knowledge spans in prompts to query our retrieval model efficiently. Then, we inject audio knowledge into BERT and switch on low-rank adaptation for effective adaptation when audio knowledge is required. Our experiments demonstrate that AudioBERT is quite effective, achieving superior performance on the AuditoryBench. The dataset and code are available at bulurl{https://github.com/HJ-Ok/AudioBERT}.
{"title":"AudioBERT: Audio Knowledge Augmented Language Model","authors":"Hyunjong Ok, Suho Yoo, Jaeho Lee","doi":"arxiv-2409.08199","DOIUrl":"https://doi.org/arxiv-2409.08199","url":null,"abstract":"Recent studies have identified that language models, pretrained on text-only\u0000datasets, often lack elementary visual knowledge, textit{e.g.,} colors of\u0000everyday objects. Motivated by this observation, we ask whether a similar\u0000shortcoming exists in terms of the textit{auditory} knowledge. To answer this\u0000question, we construct a new dataset called AuditoryBench, which consists of\u0000two novel tasks for evaluating auditory knowledge. Based on our analysis using\u0000the benchmark, we find that language models also suffer from a severe lack of\u0000auditory knowledge. To address this limitation, we propose AudioBERT, a novel\u0000method to augment the auditory knowledge of BERT through a retrieval-based\u0000approach. First, we detect auditory knowledge spans in prompts to query our\u0000retrieval model efficiently. Then, we inject audio knowledge into BERT and\u0000switch on low-rank adaptation for effective adaptation when audio knowledge is\u0000required. Our experiments demonstrate that AudioBERT is quite effective,\u0000achieving superior performance on the AuditoryBench. The dataset and code are\u0000available at bulurl{https://github.com/HJ-Ok/AudioBERT}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
{"title":"Full-text Error Correction for Chinese Speech Recognition with Large Language Model","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":"https://doi.org/arxiv-2409.07790","url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\u0000error correction in Automatic Speech Recognition (ASR). However, most research\u0000focuses on utterances from short-duration speech recordings, which are the\u0000predominant form of speech data for supervised ASR training. This paper\u0000investigates the effectiveness of LLMs for error correction in full-text\u0000generated by ASR systems from longer speech recordings, such as transcripts\u0000from podcasts, news broadcasts, and meetings. First, we develop a Chinese\u0000dataset for full-text error correction, named ChFT, utilizing a pipeline that\u0000involves text-to-speech synthesis, ASR, and error-correction pair extractor.\u0000This dataset enables us to correct errors across contexts, including both\u0000full-text and segment, and to address a broader range of error types, such as\u0000punctuation restoration and inverse text normalization, thus making the\u0000correction process comprehensive. Second, we fine-tune a pre-trained LLM on the\u0000constructed dataset using a diverse set of prompts and target formats, and\u0000evaluate its performance on full-text error correction. Specifically, we design\u0000prompts based on full-text and segment, considering various output formats,\u0000such as directly corrected text and JSON-based error-correction pairs. Through\u0000various test settings, including homogeneous, up-to-date, and hard test sets,\u0000we find that the fine-tuned LLMs perform well in the full-text setting with\u0000different prompts, each presenting its own strengths and weaknesses. This\u0000establishes a promising baseline for further research. The dataset is available\u0000on the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the realm of digital music, using tags to efficiently organize and retrieve music from extensive databases is crucial for music catalog owners. Human tagging by experts is labor-intensive but mostly accurate, whereas automatic tagging through supervised learning has approached satisfying accuracy but is restricted to a predefined set of training tags. Few-shot learning offers a viable solution to expand beyond this small set of predefined tags by enabling models to learn from only a few human-provided examples to understand tag meanings and subsequently apply these tags autonomously. We propose to integrate few-shot learning methodology into multi-label music auto-tagging by using features from pre-trained models as inputs to a lightweight linear classifier, also known as a linear probe. We investigate different popular pre-trained features, as well as different few-shot parametrizations with varying numbers of classes and samples per class. Our experiments demonstrate that a simple model with pre-trained features can achieve performance close to state-of-the-art models while using significantly less training data, such as 20 samples per tag. Additionally, our linear probe performs competitively with leading models when trained on the entire training dataset. The results show that this transfer learning-based few-shot approach could effectively address the issue of automatically assigning long-tail tags with only limited labeled data.
{"title":"Music auto-tagging in the long tail: A few-shot approach","authors":"T. Aleksandra Ma, Alexander Lerch","doi":"arxiv-2409.07730","DOIUrl":"https://doi.org/arxiv-2409.07730","url":null,"abstract":"In the realm of digital music, using tags to efficiently organize and\u0000retrieve music from extensive databases is crucial for music catalog owners.\u0000Human tagging by experts is labor-intensive but mostly accurate, whereas\u0000automatic tagging through supervised learning has approached satisfying\u0000accuracy but is restricted to a predefined set of training tags. Few-shot\u0000learning offers a viable solution to expand beyond this small set of predefined\u0000tags by enabling models to learn from only a few human-provided examples to\u0000understand tag meanings and subsequently apply these tags autonomously. We\u0000propose to integrate few-shot learning methodology into multi-label music\u0000auto-tagging by using features from pre-trained models as inputs to a\u0000lightweight linear classifier, also known as a linear probe. We investigate\u0000different popular pre-trained features, as well as different few-shot\u0000parametrizations with varying numbers of classes and samples per class. Our\u0000experiments demonstrate that a simple model with pre-trained features can\u0000achieve performance close to state-of-the-art models while using significantly\u0000less training data, such as 20 samples per tag. Additionally, our linear probe\u0000performs competitively with leading models when trained on the entire training\u0000dataset. The results show that this transfer learning-based few-shot approach\u0000could effectively address the issue of automatically assigning long-tail tags\u0000with only limited labeled data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the speech signal, acoustic landmarks identify times when the acoustic manifestations of the linguistically motivated distinctive features are most salient. Acoustic landmarks have been widely applied in various domains, including speech recognition, speech depression detection, clinical analysis of speech abnormalities, and the detection of disordered speech. However, there is currently no dataset available that provides precise timing information for landmarks, which has been proven to be crucial for downstream applications involving landmarks. In this paper, we selected the most useful acoustic landmarks based on previous research and annotated the TIMIT dataset with them, based on a combination of phoneme boundary information and manual inspection. Moreover, previous landmark extraction tools were not open source or benchmarked, so to address this, we developed an open source Python-based landmark extraction tool and established a series of landmark detection baselines. The first of their kinds, the dataset with landmark precise timing information, landmark extraction tool and baselines are designed to support a wide variety of future research.
{"title":"Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction","authors":"Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tuende Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07969","DOIUrl":"https://doi.org/arxiv-2409.07969","url":null,"abstract":"In the speech signal, acoustic landmarks identify times when the acoustic\u0000manifestations of the linguistically motivated distinctive features are most\u0000salient. Acoustic landmarks have been widely applied in various domains,\u0000including speech recognition, speech depression detection, clinical analysis of\u0000speech abnormalities, and the detection of disordered speech. However, there is\u0000currently no dataset available that provides precise timing information for\u0000landmarks, which has been proven to be crucial for downstream applications\u0000involving landmarks. In this paper, we selected the most useful acoustic\u0000landmarks based on previous research and annotated the TIMIT dataset with them,\u0000based on a combination of phoneme boundary information and manual inspection.\u0000Moreover, previous landmark extraction tools were not open source or\u0000benchmarked, so to address this, we developed an open source Python-based\u0000landmark extraction tool and established a series of landmark detection\u0000baselines. The first of their kinds, the dataset with landmark precise timing\u0000information, landmark extraction tool and baselines are designed to support a\u0000wide variety of future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli
Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training for such models. We evaluate our models on a variety of public benchmarks, where they reduce the number of decoder calls by ~3.2x while maintaining or improving WER performance.
{"title":"Faster Speech-LLaMA Inference with Multi-token Prediction","authors":"Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli","doi":"arxiv-2409.08148","DOIUrl":"https://doi.org/arxiv-2409.08148","url":null,"abstract":"Large language models (LLMs) have become proficient at solving a wide variety\u0000of tasks, including those involving multi-modal inputs. In particular,\u0000instantiating an LLM (such as LLaMA) with a speech encoder and training it on\u0000paired data imparts speech recognition (ASR) abilities to the decoder-only\u0000model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of\u0000auto-regressive inference and the relatively large decoder, Speech-LLaMA models\u0000require relatively high inference time. In this work, we propose to speed up\u0000Speech-LLaMA inference by predicting multiple tokens in the same decoding step.\u0000We explore several model architectures that enable this, and investigate their\u0000performance using threshold-based and verification-based inference strategies.\u0000We also propose a prefix-based beam search decoding method that allows\u0000efficient minimum word error rate (MWER) training for such models. We evaluate\u0000our models on a variety of public benchmarks, where they reduce the number of\u0000decoder calls by ~3.2x while maintaining or improving WER performance.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provenc{c}al. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.
{"title":"The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language","authors":"Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar","doi":"arxiv-2409.08103","DOIUrl":"https://doi.org/arxiv-2409.08103","url":null,"abstract":"We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark\u0000corpus designed to push the limits of current approaches to low-resource speech\u0000recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy,\u0000has no standard orthography, has virtually no existing textual or speech\u0000resources other than what is included in the benchmark, and is quite different\u0000from other forms of Franco-Provenc{c}al. The corpus comes from field\u0000recordings, most of which are noisy, for which only 5 hrs have matching\u0000transcriptions, and for which forced alignment is of variable quality. The\u0000corpus contains an additional 20 hrs of unlabelled speech. We report baseline\u0000results from state-of-the-art multilingual speech foundation models with a best\u0000phone error rate of 30.4%, using a pipeline that continues pre-training on the\u0000foundation model using the unlabelled set.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. In this study, we analyze the approaches toward such a paradigm and underline the significance of interlayer information processing as a result. Accordingly, we present a novel approach for exploiting the multilayered nature of pretrained models for ASV, which comprises a layer/frame-level network and two steps of pooling architectures for each layer and frame axis. Specifically, we let convolutional architecture directly processes a stack of layer outputs.Then, we present a channel attention-based scheme of gauging layer significance and squeeze the layer level with the most representative value. Finally, attentive statistics over frame-level representations yield a single vector speaker embedding. Comparative experiments are designed using versatile data environments and diverse pretraining models to validate the proposed approach. The experimental results demonstrate the stability of the approach using multi-layer outputs in leveraging pretrained architectures. Then, we verify the superiority of the proposed ASV backend structure, which involves layer-wise operations, in terms of performance improvement along with cost efficiency compared to the conventional method. The ablation study shows how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
通过利用大规模预训练网络,说话人自动验证(ASV)研究取得了最新进展。在本研究中,我们分析了实现这种模式的方法,并强调了层间信息处理的重要性。因此,我们提出了一种利用预训练模型的多层特性进行 ASV 的新方法,它包括一个层/帧级网络和针对每个层和帧轴的两步池化架构。具体来说,我们让卷积架构直接处理层输出的堆叠。然后,我们提出了一种基于通道注意力的方案来衡量层的重要性,并挤压出最具代表性值的层级。最后,通过对帧级表征的注意统计,得出单个矢量的扬声器嵌入。我们设计了多种数据环境和不同的预训练模型来验证所提出的方法。实验结果表明,在杠杆化预训练架构中使用多层输出的方法具有稳定性。然后,我们验证了所提出的 ASV 后端结构的优越性,与传统方法相比,该结构涉及分层操作,在提高性能的同时还节约了成本。消融研究表明,所提出的层间处理方法有助于最大限度地发挥利用预训练模型的优势。
{"title":"Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification","authors":"Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han","doi":"arxiv-2409.07770","DOIUrl":"https://doi.org/arxiv-2409.07770","url":null,"abstract":"Recent advancements in automatic speaker verification (ASV) studies have been\u0000achieved by leveraging large-scale pretrained networks. In this study, we\u0000analyze the approaches toward such a paradigm and underline the significance of\u0000interlayer information processing as a result. Accordingly, we present a novel\u0000approach for exploiting the multilayered nature of pretrained models for ASV,\u0000which comprises a layer/frame-level network and two steps of pooling\u0000architectures for each layer and frame axis. Specifically, we let convolutional\u0000architecture directly processes a stack of layer outputs.Then, we present a\u0000channel attention-based scheme of gauging layer significance and squeeze the\u0000layer level with the most representative value. Finally, attentive statistics\u0000over frame-level representations yield a single vector speaker embedding.\u0000Comparative experiments are designed using versatile data environments and\u0000diverse pretraining models to validate the proposed approach. The experimental\u0000results demonstrate the stability of the approach using multi-layer outputs in\u0000leveraging pretrained architectures. Then, we verify the superiority of the\u0000proposed ASV backend structure, which involves layer-wise operations, in terms\u0000of performance improvement along with cost efficiency compared to the\u0000conventional method. The ablation study shows how the proposed interlayer\u0000processing aids in maximizing the advantage of utilizing pretrained models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the promising performance of state of the art approaches for Parkinsons Disease (PD) detection, these approaches often analyze individual speech segments in isolation, which can lead to suboptimal results. Dysarthric cues that characterize speech impairments from PD patients are expected to be related across segments from different speakers. Isolated segment analysis fails to exploit these inter segment relationships. Additionally, not all speech segments from PD patients exhibit clear dysarthric symptoms, introducing label noise that can negatively affect the performance and generalizability of current approaches. To address these challenges, we propose a novel PD detection framework utilizing Graph Convolutional Networks (GCNs). By representing speech segments as nodes and capturing the similarity between segments through edges, our GCN model facilitates the aggregation of dysarthric cues across the graph, effectively exploiting segment relationships and mitigating the impact of label noise. Experimental results demonstrate theadvantages of the proposed GCN model for PD detection and provide insights into its underlying mechanisms
{"title":"Graph Neural Networks for Parkinsons Disease Detection","authors":"Shakeel A. Sheikh, Yacouba Kaloga, Ina Kodrasi","doi":"arxiv-2409.07884","DOIUrl":"https://doi.org/arxiv-2409.07884","url":null,"abstract":"Despite the promising performance of state of the art approaches for\u0000Parkinsons Disease (PD) detection, these approaches often analyze individual\u0000speech segments in isolation, which can lead to suboptimal results. Dysarthric\u0000cues that characterize speech impairments from PD patients are expected to be\u0000related across segments from different speakers. Isolated segment analysis\u0000fails to exploit these inter segment relationships. Additionally, not all\u0000speech segments from PD patients exhibit clear dysarthric symptoms, introducing\u0000label noise that can negatively affect the performance and generalizability of\u0000current approaches. To address these challenges, we propose a novel PD\u0000detection framework utilizing Graph Convolutional Networks (GCNs). By\u0000representing speech segments as nodes and capturing the similarity between\u0000segments through edges, our GCN model facilitates the aggregation of dysarthric\u0000cues across the graph, effectively exploiting segment relationships and\u0000mitigating the impact of label noise. Experimental results demonstrate\u0000theadvantages of the proposed GCN model for PD detection and provide insights\u0000into its underlying mechanisms","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anomalous Sound Detection (ASD) has gained significant interest through the application of various Artificial Intelligence (AI) technologies in industrial settings. Though possessing great potential, ASD systems can hardly be readily deployed in real production sites due to the generalization problem, which is primarily caused by the difficulty of data collection and the complexity of environmental factors. This paper introduces a robust ASD model that leverages audio pre-trained models. Specifically, we fine-tune these models using machine operation data, employing SpecAug as a data augmentation strategy. Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA) tuning instead of full fine-tuning to address the problem of limited data for fine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a new benchmark of 77.75% on the evaluation set, with a significant improvement of 6.48% compared with previous state-of-the-art (SOTA) models, including top-tier traditional convolutional networks and speech pre-trained models, which demonstrates the effectiveness of audio pre-trained models with LoRA tuning. Ablation studies are also conducted to showcase the efficacy of the proposed scheme.
{"title":"Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models","authors":"Xinhu Zheng, Anbai Jiang, Bing Han, Yanmin Qian, Pingyi Fan, Jia Liu, Wei-Qiang Zhang","doi":"arxiv-2409.07016","DOIUrl":"https://doi.org/arxiv-2409.07016","url":null,"abstract":"Anomalous Sound Detection (ASD) has gained significant interest through the\u0000application of various Artificial Intelligence (AI) technologies in industrial\u0000settings. Though possessing great potential, ASD systems can hardly be readily\u0000deployed in real production sites due to the generalization problem, which is\u0000primarily caused by the difficulty of data collection and the complexity of\u0000environmental factors. This paper introduces a robust ASD model that leverages\u0000audio pre-trained models. Specifically, we fine-tune these models using machine\u0000operation data, employing SpecAug as a data augmentation strategy.\u0000Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA)\u0000tuning instead of full fine-tuning to address the problem of limited data for\u0000fine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a new\u0000benchmark of 77.75% on the evaluation set, with a significant improvement of\u00006.48% compared with previous state-of-the-art (SOTA) models, including top-tier\u0000traditional convolutional networks and speech pre-trained models, which\u0000demonstrates the effectiveness of audio pre-trained models with LoRA tuning.\u0000Ablation studies are also conducted to showcase the efficacy of the proposed\u0000scheme.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}