首页 > 最新文献

arXiv - EE - Audio and Speech Processing最新文献

英文 中文
Super Monotonic Alignment Search 超单调对齐搜索
Pub Date : 2024-09-12 DOI: arxiv-2409.07704
Junhyeok Lee, Hyeongju Kim
Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the mostpopular algorithm in TTS to estimate unknown alignments between text andspeech. Since this algorithm needs to search for the most probable alignmentwith dynamic programming by caching all paths, the time complexity of thealgorithm is $O(T times S)$. The authors of Glow-TTS run this algorithm onCPU, and while they mentioned it is difficult to parallelize, we found that MAScan be parallelized in text-length dimension and CPU execution consumes aninordinate amount of time for inter-device copy. Therefore, we implemented aTriton kernel and PyTorch JIT script to accelerate MAS on GPU withoutinter-device copy. As a result, Super-MAS Triton kernel is up to 72 timesfaster in the extreme-length case. The code is available aturl{https://github.com/supertone-inc/super-monotonic-align}.
单调对齐搜索(Monotonic alignment search,MAS)由Glow-TTS引入,是TTS中最流行的算法之一,用于估计文本和语音之间的未知对齐。由于该算法需要通过缓存所有路径,以动态编程的方式搜索最可能的对齐方式,因此该算法的时间复杂度为 $O(Ttimes S)$。Glow-TTS的作者在CPU上运行了该算法,虽然他们提到该算法很难并行化,但我们发现MAS可以在文本长度维度上并行化,CPU执行会消耗过多的时间用于设备间复制。因此,我们采用了Triton内核和PyTorch JIT脚本来加速GPU上的MAS,而无需进行设备间拷贝。结果,Super-MAS Triton 内核在极端长度情况下的速度提高了 72 倍。代码可在(url{https://github.com/supertone-inc/super-monotonic-align}.
{"title":"Super Monotonic Alignment Search","authors":"Junhyeok Lee, Hyeongju Kim","doi":"arxiv-2409.07704","DOIUrl":"https://doi.org/arxiv-2409.07704","url":null,"abstract":"Monotonic alignment search (MAS), introduced by Glow-TTS, is one of the most\u0000popular algorithm in TTS to estimate unknown alignments between text and\u0000speech. Since this algorithm needs to search for the most probable alignment\u0000with dynamic programming by caching all paths, the time complexity of the\u0000algorithm is $O(T times S)$. The authors of Glow-TTS run this algorithm on\u0000CPU, and while they mentioned it is difficult to parallelize, we found that MAS\u0000can be parallelized in text-length dimension and CPU execution consumes an\u0000inordinate amount of time for inter-device copy. Therefore, we implemented a\u0000Triton kernel and PyTorch JIT script to accelerate MAS on GPU without\u0000inter-device copy. As a result, Super-MAS Triton kernel is up to 72 times\u0000faster in the extreme-length case. The code is available at\u0000url{https://github.com/supertone-inc/super-monotonic-align}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AudioBERT: Audio Knowledge Augmented Language Model AudioBERT:音频知识增强语言模型
Pub Date : 2024-09-12 DOI: arxiv-2409.08199
Hyunjong Ok, Suho Yoo, Jaeho Lee
Recent studies have identified that language models, pretrained on text-onlydatasets, often lack elementary visual knowledge, textit{e.g.,} colors ofeveryday objects. Motivated by this observation, we ask whether a similarshortcoming exists in terms of the textit{auditory} knowledge. To answer thisquestion, we construct a new dataset called AuditoryBench, which consists oftwo novel tasks for evaluating auditory knowledge. Based on our analysis usingthe benchmark, we find that language models also suffer from a severe lack ofauditory knowledge. To address this limitation, we propose AudioBERT, a novelmethod to augment the auditory knowledge of BERT through a retrieval-basedapproach. First, we detect auditory knowledge spans in prompts to query ourretrieval model efficiently. Then, we inject audio knowledge into BERT andswitch on low-rank adaptation for effective adaptation when audio knowledge isrequired. Our experiments demonstrate that AudioBERT is quite effective,achieving superior performance on the AuditoryBench. The dataset and code areavailable at bulurl{https://github.com/HJ-Ok/AudioBERT}.
最近的研究发现,在纯文本数据集上进行预训练的语言模型往往缺乏基本的视觉知识,如日常物体的颜色。受这一观察结果的启发,我们提出了一个问题:在textit{听觉}知识方面是否也存在类似的缺陷?为了回答这个问题,我们构建了一个名为 "听觉基准"(AuditoryBench)的新数据集,其中包含两个用于评估听觉知识的新任务。根据我们对该基准的分析,我们发现语言模型也严重缺乏听觉知识。为了解决这一局限性,我们提出了 AudioBERT,这是一种通过基于检索的方法来增强 BERT 听觉知识的新方法。首先,我们检测提示中的听觉知识跨度,以便高效地查询我们的检索模型。然后,我们将音频知识注入 BERT,并在需要音频知识时切换到低阶适配,以实现有效的适配。我们的实验证明,AudioBERT 相当有效,在听觉基准测试(AuditoryBench)中取得了优异的性能。数据集和代码可在(bulurl{https://github.com/HJ-Ok/AudioBERT}.
{"title":"AudioBERT: Audio Knowledge Augmented Language Model","authors":"Hyunjong Ok, Suho Yoo, Jaeho Lee","doi":"arxiv-2409.08199","DOIUrl":"https://doi.org/arxiv-2409.08199","url":null,"abstract":"Recent studies have identified that language models, pretrained on text-only\u0000datasets, often lack elementary visual knowledge, textit{e.g.,} colors of\u0000everyday objects. Motivated by this observation, we ask whether a similar\u0000shortcoming exists in terms of the textit{auditory} knowledge. To answer this\u0000question, we construct a new dataset called AuditoryBench, which consists of\u0000two novel tasks for evaluating auditory knowledge. Based on our analysis using\u0000the benchmark, we find that language models also suffer from a severe lack of\u0000auditory knowledge. To address this limitation, we propose AudioBERT, a novel\u0000method to augment the auditory knowledge of BERT through a retrieval-based\u0000approach. First, we detect auditory knowledge spans in prompts to query our\u0000retrieval model efficiently. Then, we inject audio knowledge into BERT and\u0000switch on low-rank adaptation for effective adaptation when audio knowledge is\u0000required. Our experiments demonstrate that AudioBERT is quite effective,\u0000achieving superior performance on the AuditoryBench. The dataset and code are\u0000available at bulurl{https://github.com/HJ-Ok/AudioBERT}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Full-text Error Correction for Chinese Speech Recognition with Large Language Model 利用大语言模型为中文语音识别进行全文纠错
Pub Date : 2024-09-12 DOI: arxiv-2409.07790
Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang
Large Language Models (LLMs) have demonstrated substantial potential forerror correction in Automatic Speech Recognition (ASR). However, most researchfocuses on utterances from short-duration speech recordings, which are thepredominant form of speech data for supervised ASR training. This paperinvestigates the effectiveness of LLMs for error correction in full-textgenerated by ASR systems from longer speech recordings, such as transcriptsfrom podcasts, news broadcasts, and meetings. First, we develop a Chinesedataset for full-text error correction, named ChFT, utilizing a pipeline thatinvolves text-to-speech synthesis, ASR, and error-correction pair extractor.This dataset enables us to correct errors across contexts, including bothfull-text and segment, and to address a broader range of error types, such aspunctuation restoration and inverse text normalization, thus making thecorrection process comprehensive. Second, we fine-tune a pre-trained LLM on theconstructed dataset using a diverse set of prompts and target formats, andevaluate its performance on full-text error correction. Specifically, we designprompts based on full-text and segment, considering various output formats,such as directly corrected text and JSON-based error-correction pairs. Throughvarious test settings, including homogeneous, up-to-date, and hard test sets,we find that the fine-tuned LLMs perform well in the full-text setting withdifferent prompts, each presenting its own strengths and weaknesses. Thisestablishes a promising baseline for further research. The dataset is availableon the website.
大型语言模型(LLM)在自动语音识别(ASR)的纠错方面具有巨大的潜力。然而,大多数研究都集中在短时语音录音中的语句上,而短时语音录音是有监督自动语音识别(ASR)训练的主要语音数据形式。本文研究了 LLM 在 ASR 系统从较长的语音录音(如播客、新闻广播和会议的文字记录)生成的全文中进行纠错的有效性。首先,我们开发了一个用于全文纠错的中文数据集(名为 ChFT),该数据集采用了一个包含文本到语音合成、ASR 和纠错对提取器的管道。该数据集使我们能够跨语境纠错,包括全文和片段,并处理更广泛的错误类型,如标点符号恢复和反向文本规范化,从而使纠错过程更加全面。其次,我们使用一系列不同的提示和目标格式,在构建的数据集上对预先训练的 LLM 进行微调,并评估其在全文纠错方面的性能。具体来说,我们设计了基于全文和分段的提示,并考虑了各种输出格式,如直接纠错文本和基于 JSON 的纠错对。通过各种测试设置,包括同质测试集、最新测试集和困难测试集,我们发现经过微调的 LLM 在不同提示的全文设置中表现良好,每个提示都有自己的优缺点。这为进一步的研究奠定了良好的基础。该数据集可在网站上获取。
{"title":"Full-text Error Correction for Chinese Speech Recognition with Large Language Model","authors":"Zhiyuan Tang, Dong Wang, Shen Huang, Shidong Shang","doi":"arxiv-2409.07790","DOIUrl":"https://doi.org/arxiv-2409.07790","url":null,"abstract":"Large Language Models (LLMs) have demonstrated substantial potential for\u0000error correction in Automatic Speech Recognition (ASR). However, most research\u0000focuses on utterances from short-duration speech recordings, which are the\u0000predominant form of speech data for supervised ASR training. This paper\u0000investigates the effectiveness of LLMs for error correction in full-text\u0000generated by ASR systems from longer speech recordings, such as transcripts\u0000from podcasts, news broadcasts, and meetings. First, we develop a Chinese\u0000dataset for full-text error correction, named ChFT, utilizing a pipeline that\u0000involves text-to-speech synthesis, ASR, and error-correction pair extractor.\u0000This dataset enables us to correct errors across contexts, including both\u0000full-text and segment, and to address a broader range of error types, such as\u0000punctuation restoration and inverse text normalization, thus making the\u0000correction process comprehensive. Second, we fine-tune a pre-trained LLM on the\u0000constructed dataset using a diverse set of prompts and target formats, and\u0000evaluate its performance on full-text error correction. Specifically, we design\u0000prompts based on full-text and segment, considering various output formats,\u0000such as directly corrected text and JSON-based error-correction pairs. Through\u0000various test settings, including homogeneous, up-to-date, and hard test sets,\u0000we find that the fine-tuned LLMs perform well in the full-text setting with\u0000different prompts, each presenting its own strengths and weaknesses. This\u0000establishes a promising baseline for further research. The dataset is available\u0000on the website.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Music auto-tagging in the long tail: A few-shot approach 长尾中的音乐自动标记:寥寥数语的方法
Pub Date : 2024-09-12 DOI: arxiv-2409.07730
T. Aleksandra Ma, Alexander Lerch
In the realm of digital music, using tags to efficiently organize andretrieve music from extensive databases is crucial for music catalog owners.Human tagging by experts is labor-intensive but mostly accurate, whereasautomatic tagging through supervised learning has approached satisfyingaccuracy but is restricted to a predefined set of training tags. Few-shotlearning offers a viable solution to expand beyond this small set of predefinedtags by enabling models to learn from only a few human-provided examples tounderstand tag meanings and subsequently apply these tags autonomously. Wepropose to integrate few-shot learning methodology into multi-label musicauto-tagging by using features from pre-trained models as inputs to alightweight linear classifier, also known as a linear probe. We investigatedifferent popular pre-trained features, as well as different few-shotparametrizations with varying numbers of classes and samples per class. Ourexperiments demonstrate that a simple model with pre-trained features canachieve performance close to state-of-the-art models while using significantlyless training data, such as 20 samples per tag. Additionally, our linear probeperforms competitively with leading models when trained on the entire trainingdataset. The results show that this transfer learning-based few-shot approachcould effectively address the issue of automatically assigning long-tail tagswith only limited labeled data.
在数字音乐领域,使用标签从庞大的数据库中有效地组织和检索音乐对音乐目录所有者来说至关重要。由专家进行人工标记虽然耗费大量人力,但基本都很准确,而通过监督学习进行的自动标记已接近令人满意的准确度,但仅限于一组预定义的训练标签。少量标记学习提供了一种可行的解决方案,它能让模型仅从少量人类提供的示例中学习理解标记含义,然后自主应用这些标记,从而超越这一小批预定义标记的范围。我们建议将少量学习方法集成到多标签音乐自动标记中,将来自预训练模型的特征作为轻量级线性分类器(也称为线性探针)的输入。我们研究了不同的流行预训练特征,以及不同的少拍参数三元组,每类的类数和样本数也各不相同。我们的实验证明,使用预训练特征的简单模型可以达到接近最先进模型的性能,而使用的训练数据却少得多,比如每个标签只需 20 个样本。此外,当在整个训练数据集上进行训练时,我们的线性概率模型的表现与领先模型不相上下。研究结果表明,这种基于迁移学习的 "小试牛刀 "方法可以有效地解决仅使用有限标签数据自动分配长尾标签的问题。
{"title":"Music auto-tagging in the long tail: A few-shot approach","authors":"T. Aleksandra Ma, Alexander Lerch","doi":"arxiv-2409.07730","DOIUrl":"https://doi.org/arxiv-2409.07730","url":null,"abstract":"In the realm of digital music, using tags to efficiently organize and\u0000retrieve music from extensive databases is crucial for music catalog owners.\u0000Human tagging by experts is labor-intensive but mostly accurate, whereas\u0000automatic tagging through supervised learning has approached satisfying\u0000accuracy but is restricted to a predefined set of training tags. Few-shot\u0000learning offers a viable solution to expand beyond this small set of predefined\u0000tags by enabling models to learn from only a few human-provided examples to\u0000understand tag meanings and subsequently apply these tags autonomously. We\u0000propose to integrate few-shot learning methodology into multi-label music\u0000auto-tagging by using features from pre-trained models as inputs to a\u0000lightweight linear classifier, also known as a linear probe. We investigate\u0000different popular pre-trained features, as well as different few-shot\u0000parametrizations with varying numbers of classes and samples per class. Our\u0000experiments demonstrate that a simple model with pre-trained features can\u0000achieve performance close to state-of-the-art models while using significantly\u0000less training data, such as 20 samples per tag. Additionally, our linear probe\u0000performs competitively with leading models when trained on the entire training\u0000dataset. The results show that this transfer learning-based few-shot approach\u0000could effectively address the issue of automatically assigning long-tail tags\u0000with only limited labeled data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction 自动地标:声学地标数据集和地标提取开源工具包
Pub Date : 2024-09-12 DOI: arxiv-2409.07969
Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tuende Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps
In the speech signal, acoustic landmarks identify times when the acousticmanifestations of the linguistically motivated distinctive features are mostsalient. Acoustic landmarks have been widely applied in various domains,including speech recognition, speech depression detection, clinical analysis ofspeech abnormalities, and the detection of disordered speech. However, there iscurrently no dataset available that provides precise timing information forlandmarks, which has been proven to be crucial for downstream applicationsinvolving landmarks. In this paper, we selected the most useful acousticlandmarks based on previous research and annotated the TIMIT dataset with them,based on a combination of phoneme boundary information and manual inspection.Moreover, previous landmark extraction tools were not open source orbenchmarked, so to address this, we developed an open source Python-basedlandmark extraction tool and established a series of landmark detectionbaselines. The first of their kinds, the dataset with landmark precise timinginformation, landmark extraction tool and baselines are designed to support awide variety of future research.
在语音信号中,声学地标能识别出语言特征的声学表现最显著的时间。声学地标已被广泛应用于各种领域,包括语音识别、语音抑郁检测、语音异常临床分析和无序语音检测。然而,目前还没有数据集能提供地标的精确时序信息,而时序信息已被证明对涉及地标的下游应用至关重要。此外,以前的地标提取工具没有开源或基准,因此为了解决这个问题,我们开发了基于 Python 的开源地标提取工具,并建立了一系列地标检测基准。在同类工具中,我们首次设计了包含地标精确定时信息的数据集、地标提取工具和基线,以支持未来的各种研究。
{"title":"Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction","authors":"Xiangyu Zhang, Daijiao Liu, Tianyi Xiao, Cihan Xiao, Tuende Szalay, Mostafa Shahin, Beena Ahmed, Julien Epps","doi":"arxiv-2409.07969","DOIUrl":"https://doi.org/arxiv-2409.07969","url":null,"abstract":"In the speech signal, acoustic landmarks identify times when the acoustic\u0000manifestations of the linguistically motivated distinctive features are most\u0000salient. Acoustic landmarks have been widely applied in various domains,\u0000including speech recognition, speech depression detection, clinical analysis of\u0000speech abnormalities, and the detection of disordered speech. However, there is\u0000currently no dataset available that provides precise timing information for\u0000landmarks, which has been proven to be crucial for downstream applications\u0000involving landmarks. In this paper, we selected the most useful acoustic\u0000landmarks based on previous research and annotated the TIMIT dataset with them,\u0000based on a combination of phoneme boundary information and manual inspection.\u0000Moreover, previous landmark extraction tools were not open source or\u0000benchmarked, so to address this, we developed an open source Python-based\u0000landmark extraction tool and established a series of landmark detection\u0000baselines. The first of their kinds, the dataset with landmark precise timing\u0000information, landmark extraction tool and baselines are designed to support a\u0000wide variety of future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster Speech-LLaMA Inference with Multi-token Prediction 利用多标记预测实现更快的语音-LaMA 推断
Pub Date : 2024-09-12 DOI: arxiv-2409.08148
Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli
Large language models (LLMs) have become proficient at solving a wide varietyof tasks, including those involving multi-modal inputs. In particular,instantiating an LLM (such as LLaMA) with a speech encoder and training it onpaired data imparts speech recognition (ASR) abilities to the decoder-onlymodel, hence called Speech-LLaMA. Nevertheless, due to the sequential nature ofauto-regressive inference and the relatively large decoder, Speech-LLaMA modelsrequire relatively high inference time. In this work, we propose to speed upSpeech-LLaMA inference by predicting multiple tokens in the same decoding step.We explore several model architectures that enable this, and investigate theirperformance using threshold-based and verification-based inference strategies.We also propose a prefix-based beam search decoding method that allowsefficient minimum word error rate (MWER) training for such models. We evaluateour models on a variety of public benchmarks, where they reduce the number ofdecoder calls by ~3.2x while maintaining or improving WER performance.
大型语言模型(LLM)已经能够熟练地解决各种各样的任务,包括那些涉及多模态输入的任务。特别是,将 LLM(如 LLaMA)与语音编码器实例化,并在配对数据上对其进行训练,可使仅有解码器的模型具备语音识别(ASR)能力,因此被称为 Speech-LLaMA。然而,由于自回归推理的顺序性和相对较大的解码器,Speech-LaMA 模型需要相对较长的推理时间。在这项工作中,我们建议通过在同一解码步骤中预测多个词块来加快语音-LaMA 的推理速度。我们探索了几种能够实现这一点的模型架构,并使用基于阈值和基于验证的推理策略研究了它们的性能。我们还提出了一种基于前缀的波束搜索解码方法,该方法允许对此类模型进行高效的最小字错误率 (MWER) 训练。我们在各种公共基准上对这些模型进行了评估,结果表明它们在保持或提高 WER 性能的同时,将解码器调用次数减少了约 3.2 倍。
{"title":"Faster Speech-LLaMA Inference with Multi-token Prediction","authors":"Desh Raj, Gil Keren, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli","doi":"arxiv-2409.08148","DOIUrl":"https://doi.org/arxiv-2409.08148","url":null,"abstract":"Large language models (LLMs) have become proficient at solving a wide variety\u0000of tasks, including those involving multi-modal inputs. In particular,\u0000instantiating an LLM (such as LLaMA) with a speech encoder and training it on\u0000paired data imparts speech recognition (ASR) abilities to the decoder-only\u0000model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of\u0000auto-regressive inference and the relatively large decoder, Speech-LLaMA models\u0000require relatively high inference time. In this work, we propose to speed up\u0000Speech-LLaMA inference by predicting multiple tokens in the same decoding step.\u0000We explore several model architectures that enable this, and investigate their\u0000performance using threshold-based and verification-based inference strategies.\u0000We also propose a prefix-based beam search decoding method that allows\u0000efficient minimum word error rate (MWER) training for such models. We evaluate\u0000our models on a variety of public benchmarks, where they reduce the number of\u0000decoder calls by ~3.2x while maintaining or improving WER performance.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language Faetar 基准:资源极度匮乏语言的语音识别
Pub Date : 2024-09-12 DOI: arxiv-2409.08103
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmarkcorpus designed to push the limits of current approaches to low-resource speechrecognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy,has no standard orthography, has virtually no existing textual or speechresources other than what is included in the benchmark, and is quite differentfrom other forms of Franco-Provenc{c}al. The corpus comes from fieldrecordings, most of which are noisy, for which only 5 hrs have matchingtranscriptions, and for which forced alignment is of variable quality. Thecorpus contains an additional 20 hrs of unlabelled speech. We report baselineresults from state-of-the-art multilingual speech foundation models with a bestphone error rate of 30.4%, using a pipeline that continues pre-training on thefoundation model using the unlabelled set.
我们介绍了 Faetar 自动语音识别基准,这是一个旨在挑战当前低资源语音识别方法极限的基准语料库。Faetar是一种主要在意大利使用的法语-普罗旺斯语,它没有标准的正字法,除了基准语料之外几乎没有其他现存的文本或语音资源,而且与其他形式的法语-普罗旺斯语有很大不同。该语料库来自田野记录,其中大部分都有噪声,只有 5 小时有匹配的译文,强制对齐的质量也参差不齐。该语料库还包含 20 小时的未标记语音。我们报告了最先进的多语言语音基础模型的基准结果,最佳语音错误率为 30.4%,使用的方法是继续使用未标记集对基础模型进行预训练。
{"title":"The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language","authors":"Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar","doi":"arxiv-2409.08103","DOIUrl":"https://doi.org/arxiv-2409.08103","url":null,"abstract":"We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark\u0000corpus designed to push the limits of current approaches to low-resource speech\u0000recognition. Faetar, a Franco-Provenc{c}al variety spoken primarily in Italy,\u0000has no standard orthography, has virtually no existing textual or speech\u0000resources other than what is included in the benchmark, and is quite different\u0000from other forms of Franco-Provenc{c}al. The corpus comes from field\u0000recordings, most of which are noisy, for which only 5 hrs have matching\u0000transcriptions, and for which forced alignment is of variable quality. The\u0000corpus contains an additional 20 hrs of unlabelled speech. We report baseline\u0000results from state-of-the-art multilingual speech foundation models with a best\u0000phone error rate of 30.4%, using a pipeline that continues pre-training on the\u0000foundation model using the unlabelled set.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification 从预训练模型中提取多层特征的通用汇集法用于扬声器验证
Pub Date : 2024-09-12 DOI: arxiv-2409.07770
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han
Recent advancements in automatic speaker verification (ASV) studies have beenachieved by leveraging large-scale pretrained networks. In this study, weanalyze the approaches toward such a paradigm and underline the significance ofinterlayer information processing as a result. Accordingly, we present a novelapproach for exploiting the multilayered nature of pretrained models for ASV,which comprises a layer/frame-level network and two steps of poolingarchitectures for each layer and frame axis. Specifically, we let convolutionalarchitecture directly processes a stack of layer outputs.Then, we present achannel attention-based scheme of gauging layer significance and squeeze thelayer level with the most representative value. Finally, attentive statisticsover frame-level representations yield a single vector speaker embedding.Comparative experiments are designed using versatile data environments anddiverse pretraining models to validate the proposed approach. The experimentalresults demonstrate the stability of the approach using multi-layer outputs inleveraging pretrained architectures. Then, we verify the superiority of theproposed ASV backend structure, which involves layer-wise operations, in termsof performance improvement along with cost efficiency compared to theconventional method. The ablation study shows how the proposed interlayerprocessing aids in maximizing the advantage of utilizing pretrained models.
通过利用大规模预训练网络,说话人自动验证(ASV)研究取得了最新进展。在本研究中,我们分析了实现这种模式的方法,并强调了层间信息处理的重要性。因此,我们提出了一种利用预训练模型的多层特性进行 ASV 的新方法,它包括一个层/帧级网络和针对每个层和帧轴的两步池化架构。具体来说,我们让卷积架构直接处理层输出的堆叠。然后,我们提出了一种基于通道注意力的方案来衡量层的重要性,并挤压出最具代表性值的层级。最后,通过对帧级表征的注意统计,得出单个矢量的扬声器嵌入。我们设计了多种数据环境和不同的预训练模型来验证所提出的方法。实验结果表明,在杠杆化预训练架构中使用多层输出的方法具有稳定性。然后,我们验证了所提出的 ASV 后端结构的优越性,与传统方法相比,该结构涉及分层操作,在提高性能的同时还节约了成本。消融研究表明,所提出的层间处理方法有助于最大限度地发挥利用预训练模型的优势。
{"title":"Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification","authors":"Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han","doi":"arxiv-2409.07770","DOIUrl":"https://doi.org/arxiv-2409.07770","url":null,"abstract":"Recent advancements in automatic speaker verification (ASV) studies have been\u0000achieved by leveraging large-scale pretrained networks. In this study, we\u0000analyze the approaches toward such a paradigm and underline the significance of\u0000interlayer information processing as a result. Accordingly, we present a novel\u0000approach for exploiting the multilayered nature of pretrained models for ASV,\u0000which comprises a layer/frame-level network and two steps of pooling\u0000architectures for each layer and frame axis. Specifically, we let convolutional\u0000architecture directly processes a stack of layer outputs.Then, we present a\u0000channel attention-based scheme of gauging layer significance and squeeze the\u0000layer level with the most representative value. Finally, attentive statistics\u0000over frame-level representations yield a single vector speaker embedding.\u0000Comparative experiments are designed using versatile data environments and\u0000diverse pretraining models to validate the proposed approach. The experimental\u0000results demonstrate the stability of the approach using multi-layer outputs in\u0000leveraging pretrained architectures. Then, we verify the superiority of the\u0000proposed ASV backend structure, which involves layer-wise operations, in terms\u0000of performance improvement along with cost efficiency compared to the\u0000conventional method. The ablation study shows how the proposed interlayer\u0000processing aids in maximizing the advantage of utilizing pretrained models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph Neural Networks for Parkinsons Disease Detection 用于帕金森病检测的图神经网络
Pub Date : 2024-09-12 DOI: arxiv-2409.07884
Shakeel A. Sheikh, Yacouba Kaloga, Ina Kodrasi
Despite the promising performance of state of the art approaches forParkinsons Disease (PD) detection, these approaches often analyze individualspeech segments in isolation, which can lead to suboptimal results. Dysarthriccues that characterize speech impairments from PD patients are expected to berelated across segments from different speakers. Isolated segment analysisfails to exploit these inter segment relationships. Additionally, not allspeech segments from PD patients exhibit clear dysarthric symptoms, introducinglabel noise that can negatively affect the performance and generalizability ofcurrent approaches. To address these challenges, we propose a novel PDdetection framework utilizing Graph Convolutional Networks (GCNs). Byrepresenting speech segments as nodes and capturing the similarity betweensegments through edges, our GCN model facilitates the aggregation of dysarthriccues across the graph, effectively exploiting segment relationships andmitigating the impact of label noise. Experimental results demonstratetheadvantages of the proposed GCN model for PD detection and provide insightsinto its underlying mechanisms
尽管最先进的帕金森病(PD)检测方法性能良好,但这些方法往往孤立地分析单个语音片段,可能导致不理想的结果。帕金森氏症患者言语障碍的特征是,不同说话者的语音片段之间存在关联。孤立的片段分析无法利用这些片段间的关系。此外,并非所有帕金森氏症患者的语音片段都表现出明显的发音障碍症状,这就带来了标签噪声,可能会对当前方法的性能和普适性产生负面影响。为了应对这些挑战,我们提出了一种利用图卷积网络(GCN)的新型帕金森病检测框架。通过将语音片段表示为节点,并通过边缘捕捉片段之间的相似性,我们的 GCN 模型有助于在整个图中聚合失真,从而有效利用片段关系并减轻标签噪声的影响。实验结果表明了所提出的 GCN 模型在发音障碍检测方面的优势,并提供了对其潜在机制的见解
{"title":"Graph Neural Networks for Parkinsons Disease Detection","authors":"Shakeel A. Sheikh, Yacouba Kaloga, Ina Kodrasi","doi":"arxiv-2409.07884","DOIUrl":"https://doi.org/arxiv-2409.07884","url":null,"abstract":"Despite the promising performance of state of the art approaches for\u0000Parkinsons Disease (PD) detection, these approaches often analyze individual\u0000speech segments in isolation, which can lead to suboptimal results. Dysarthric\u0000cues that characterize speech impairments from PD patients are expected to be\u0000related across segments from different speakers. Isolated segment analysis\u0000fails to exploit these inter segment relationships. Additionally, not all\u0000speech segments from PD patients exhibit clear dysarthric symptoms, introducing\u0000label noise that can negatively affect the performance and generalizability of\u0000current approaches. To address these challenges, we propose a novel PD\u0000detection framework utilizing Graph Convolutional Networks (GCNs). By\u0000representing speech segments as nodes and capturing the similarity between\u0000segments through edges, our GCN model facilitates the aggregation of dysarthric\u0000cues across the graph, effectively exploiting segment relationships and\u0000mitigating the impact of label noise. Experimental results demonstrate\u0000theadvantages of the proposed GCN model for PD detection and provide insights\u0000into its underlying mechanisms","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models 通过预训练音频模型的低库适应性微调改进异常声音检测
Pub Date : 2024-09-11 DOI: arxiv-2409.07016
Xinhu Zheng, Anbai Jiang, Bing Han, Yanmin Qian, Pingyi Fan, Jia Liu, Wei-Qiang Zhang
Anomalous Sound Detection (ASD) has gained significant interest through theapplication of various Artificial Intelligence (AI) technologies in industrialsettings. Though possessing great potential, ASD systems can hardly be readilydeployed in real production sites due to the generalization problem, which isprimarily caused by the difficulty of data collection and the complexity ofenvironmental factors. This paper introduces a robust ASD model that leveragesaudio pre-trained models. Specifically, we fine-tune these models using machineoperation data, employing SpecAug as a data augmentation strategy.Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA)tuning instead of full fine-tuning to address the problem of limited data forfine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a newbenchmark of 77.75% on the evaluation set, with a significant improvement of6.48% compared with previous state-of-the-art (SOTA) models, including top-tiertraditional convolutional networks and speech pre-trained models, whichdemonstrates the effectiveness of audio pre-trained models with LoRA tuning.Ablation studies are also conducted to showcase the efficacy of the proposedscheme.
通过在工业环境中应用各种人工智能(AI)技术,异常声音检测(ASD)已获得了极大的关注。虽然 ASD 系统拥有巨大的潜力,但由于数据收集困难和环境因素复杂等原因造成的泛化问题,ASD 系统很难在实际生产现场中随时部署。本文介绍了一种利用音频预训练模型的鲁棒 ASD 模型。此外,我们还研究了利用低级适应(Low-Rank Adaptation,LoRA)调整代替完全微调的影响,以解决微调数据有限的问题。我们在 DCASE2023 Task 2 数据集上进行的实验在评估集上建立了 77.75% 的新基准,与包括顶级传统卷积网络和语音预训练模型在内的以往最先进(SOTA)模型相比,显著提高了 6.48%,这证明了采用 LoRA 调整的音频预训练模型的有效性。
{"title":"Improving Anomalous Sound Detection via Low-Rank Adaptation Fine-Tuning of Pre-Trained Audio Models","authors":"Xinhu Zheng, Anbai Jiang, Bing Han, Yanmin Qian, Pingyi Fan, Jia Liu, Wei-Qiang Zhang","doi":"arxiv-2409.07016","DOIUrl":"https://doi.org/arxiv-2409.07016","url":null,"abstract":"Anomalous Sound Detection (ASD) has gained significant interest through the\u0000application of various Artificial Intelligence (AI) technologies in industrial\u0000settings. Though possessing great potential, ASD systems can hardly be readily\u0000deployed in real production sites due to the generalization problem, which is\u0000primarily caused by the difficulty of data collection and the complexity of\u0000environmental factors. This paper introduces a robust ASD model that leverages\u0000audio pre-trained models. Specifically, we fine-tune these models using machine\u0000operation data, employing SpecAug as a data augmentation strategy.\u0000Additionally, we investigate the impact of utilizing Low-Rank Adaptation (LoRA)\u0000tuning instead of full fine-tuning to address the problem of limited data for\u0000fine-tuning. Our experiments on the DCASE2023 Task 2 dataset establish a new\u0000benchmark of 77.75% on the evaluation set, with a significant improvement of\u00006.48% compared with previous state-of-the-art (SOTA) models, including top-tier\u0000traditional convolutional networks and speech pre-trained models, which\u0000demonstrates the effectiveness of audio pre-trained models with LoRA tuning.\u0000Ablation studies are also conducted to showcase the efficacy of the proposed\u0000scheme.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - EE - Audio and Speech Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1