We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.
{"title":"ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration","authors":"Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe","doi":"arxiv-2409.09506","DOIUrl":"https://doi.org/arxiv-2409.09506","url":null,"abstract":"We introduce ESPnet-EZ, an extension of the open-source speech processing\u0000toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ\u0000focuses on two major aspects: (i) easy fine-tuning and inference of existing\u0000ESPnet models on various tasks and (ii) easy integration with popular deep\u0000neural network frameworks such as PyTorch-Lightning, Hugging Face transformers\u0000and datasets, and Lhotse. By replacing ESPnet design choices inherited from\u0000Kaldi with a Python-only, Bash-free interface, we dramatically reduce the\u0000effort required to build, debug, and use a new model. For example, to fine-tune\u0000a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of\u0000newly written code by 2.7x and the amount of dependent code by 6.7x while\u0000dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ\u0000is publicly available.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In accented voice conversion or accent conversion, we seek to convert the accent in speech from one another while preserving speaker identity and semantic content. In this study, we formulate a novel method for creating multi-accented speech samples, thus pairs of accented speech samples by the same speaker, through text transliteration for training accent conversion systems. We begin by generating transliterated text with Large Language Models (LLMs), which is then fed into multilingual TTS models to synthesize accented English speech. As a reference system, we built a sequence-to-sequence model on the synthetic parallel corpus for accent conversion. We validated the proposed method for both native and non-native English speakers. Subjective and objective evaluations further validate our dataset's effectiveness in accent conversion studies.
{"title":"MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion","authors":"Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li","doi":"arxiv-2409.09352","DOIUrl":"https://doi.org/arxiv-2409.09352","url":null,"abstract":"In accented voice conversion or accent conversion, we seek to convert the\u0000accent in speech from one another while preserving speaker identity and\u0000semantic content. In this study, we formulate a novel method for creating\u0000multi-accented speech samples, thus pairs of accented speech samples by the\u0000same speaker, through text transliteration for training accent conversion\u0000systems. We begin by generating transliterated text with Large Language Models\u0000(LLMs), which is then fed into multilingual TTS models to synthesize accented\u0000English speech. As a reference system, we built a sequence-to-sequence model on\u0000the synthetic parallel corpus for accent conversion. We validated the proposed\u0000method for both native and non-native English speakers. Subjective and\u0000objective evaluations further validate our dataset's effectiveness in accent\u0000conversion studies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li
Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
分析真实世界的多模态信号是智能语音助手(IVA)的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型,在 IVA 的各种下游任务中取得了显著的性能。然而,这些模型都是独立预训练的,而且通常是在与目标领域不同的任务上进行预训练,从而导致下游任务的模态表示不够理想。此外,在许多领域,收集足够的语言-音频对非常困难,而翻译原始音频也需要很高的专业技能,这使得联合预训练变得困难甚至不可行。为了解决这些问题,我们提出了 DSCLAP,这是一个简单有效的框架,只需输入原始音频信号就能实现语言和音频的预训练。具体来说,DSCLAP 通过 ASR 系统将原始音频信号转换为文本,并将对比学习目标和语言音频匹配目标结合起来,使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明,虽然概念简单,但 DSCLAP 在所有指标上都明显优于基线模型,为特定领域的 IVAs 应用展示了巨大的前景。
{"title":"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":"https://doi.org/arxiv-2409.09289","url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\u0000for intelligent voice assistants (IVAs). Mainstream approaches have achieved\u0000remarkable performance on various downstream tasks of IVAs with pre-trained\u0000audio models and text models. However, these models are pre-trained\u0000independently and usually on tasks different from target domains, resulting in\u0000sub-optimal modality representations for downstream tasks. Moreover, in many\u0000domains, collecting enough language-audio pairs is extremely hard, and\u0000transcribing raw audio also requires high professional skills, making it\u0000difficult or even infeasible to joint pre-training. To address these\u0000painpoints, we propose DSCLAP, a simple and effective framework that enables\u0000language-audio pre-training with only raw audio signal input. Specifically,\u0000DSCLAP converts raw audio signals into text via an ASR system and combines a\u0000contrastive learning objective and a language-audio matching objective to align\u0000the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\u0000in-vehicle domain audio. Empirical results on two downstream tasks show that\u0000while conceptually simple, DSCLAP significantly outperforms the baseline models\u0000in all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li
With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M$^{3}$V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time.
为了能与虚拟语音助手进行更自然、更像人的交互,该领域最近的研究重点是不依赖重复唤醒词的全双工交互模式。这就要求在声源复杂的场景中,语音助手必须将语音分为面向设备或不面向设备。由文本和语音共同建模的双编码器结构已成为设备导向语音检测的典范。为了应对这一挑战,我们提出了一种用于设备导向语音检测的多模态多视图方法--M$^{3}$V,它将问题框架化为一项多视图学习任务,在多模态之外,在网络中引入了单模态视图和文本-音频对齐视图。实验结果表明,M$^{3}$V 显著优于仅使用单模态或多模态训练的模型,并首次在 ASR 错误数据上超越了人类的判断性能。
{"title":"M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection","authors":"Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li","doi":"arxiv-2409.09284","DOIUrl":"https://doi.org/arxiv-2409.09284","url":null,"abstract":"With the goal of more natural and human-like interaction with virtual voice\u0000assistants, recent research in the field has focused on full duplex interaction\u0000mode without relying on repeated wake-up words. This requires that in scenes\u0000with complex sound sources, the voice assistant must classify utterances as\u0000device-oriented or non-device-oriented. The dual-encoder structure, which is\u0000jointly modeled by text and speech, has become the paradigm of device-directed\u0000speech detection. However, in practice, these models often produce incorrect\u0000predictions for unaligned input pairs due to the unavoidable errors of\u0000automatic speech recognition (ASR).To address this challenge, we propose\u0000M$^{3}$V, a multi-modal multi-view approach for device-directed speech\u0000detection, which frames we frame the problem as a multi-view learning task that\u0000introduces unimodal views and a text-audio alignment view in the network\u0000besides the multi-modal. Experimental results show that M$^{3}$V significantly\u0000outperforms models trained using only single or multi-modality and surpasses\u0000human judgment performance on ASR error data for the first time.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Solving the permutation problem is essential for determined blind source separation (BSS). Existing methods, such as independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA), tackle the permutation problem by modeling the co-occurrence of the frequency components of source signals. One of the remaining challenges in these methods is the block permutation problem, which may lead to poor separation results. In this paper, we propose a simple and effective technique for solving the block permutation problem. The proposed technique splits the entire frequencies into overlapping subbands and sequentially applies a BSS method (e.g., IVA, ILRMA, or any other method) to each subband. Since the problem size is reduced by the splitting, the BSS method can effectively work in each subband. Then, the permutations between the subbands are aligned by using the separation result in one subband as the initial values for the other subbands. Experimental results showed that the proposed technique remarkably improved the separation performance without increasing the total computational cost.
{"title":"Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation","authors":"Kazuki Matsumoto, Kohei Yatabe","doi":"arxiv-2409.09294","DOIUrl":"https://doi.org/arxiv-2409.09294","url":null,"abstract":"Solving the permutation problem is essential for determined blind source\u0000separation (BSS). Existing methods, such as independent vector analysis (IVA)\u0000and independent low-rank matrix analysis (ILRMA), tackle the permutation\u0000problem by modeling the co-occurrence of the frequency components of source\u0000signals. One of the remaining challenges in these methods is the block\u0000permutation problem, which may lead to poor separation results. In this paper,\u0000we propose a simple and effective technique for solving the block permutation\u0000problem. The proposed technique splits the entire frequencies into overlapping\u0000subbands and sequentially applies a BSS method (e.g., IVA, ILRMA, or any other\u0000method) to each subband. Since the problem size is reduced by the splitting,\u0000the BSS method can effectively work in each subband. Then, the permutations\u0000between the subbands are aligned by using the separation result in one subband\u0000as the initial values for the other subbands. Experimental results showed that\u0000the proposed technique remarkably improved the separation performance without\u0000increasing the total computational cost.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh
Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition (SER). However, unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability. Explaining these embeddings is crucial for building trust in healthcare and security applications and advancing the scientific understanding of the acoustic information that is encoded in them. This paper proposes a modified probing approach to explain deep learning embeddings in the SER space. We predict interpretable acoustic features (e.g., f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the embedding dimensions identified as most important for predicting each emotion. If the subset of the most important dimensions better predicts a given emotion than all dimensions and also predicts specific acoustic features more accurately, we infer those acoustic features are important for the embedding model for the given task. We conducted experiments using the WavLM embeddings and eGeMAPS acoustic features as audio representations, applying our method to the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we demonstrate that Energy, Frequency, Spectral, and Temporal categories of acoustic features provide diminishing information to SER in that order, demonstrating the utility of the probing classifier method to relate embeddings to interpretable acoustic features.
{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":"https://doi.org/arxiv-2409.09511","url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\u0000performance over handcrafted acoustic features in speech emotion recognition\u0000(SER). However, unlike acoustic features with clear physical meaning, these\u0000embeddings lack clear interpretability. Explaining these embeddings is crucial\u0000for building trust in healthcare and security applications and advancing the\u0000scientific understanding of the acoustic information that is encoded in them.\u0000This paper proposes a modified probing approach to explain deep learning\u0000embeddings in the SER space. We predict interpretable acoustic features (e.g.,\u0000f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\u0000embedding dimensions identified as most important for predicting each emotion.\u0000If the subset of the most important dimensions better predicts a given emotion\u0000than all dimensions and also predicts specific acoustic features more\u0000accurately, we infer those acoustic features are important for the embedding\u0000model for the given task. We conducted experiments using the WavLM embeddings\u0000and eGeMAPS acoustic features as audio representations, applying our method to\u0000the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\u0000demonstrate that Energy, Frequency, Spectral, and Temporal categories of\u0000acoustic features provide diminishing information to SER in that order,\u0000demonstrating the utility of the probing classifier method to relate embeddings\u0000to interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In tandem with the recent advancements in foundation model research, there has been a surge of generative music AI applications within the past few years. As the idea of AI-generated or AI-augmented music becomes more mainstream, many researchers in the music AI community may be wondering what avenues of research are left. With regards to music generative models, we outline the current areas of research with significant room for exploration. Firstly, we pose the question of foundational representation of these generative models and investigate approaches towards explainability. Next, we discuss the current state of music datasets and their limitations. We then overview different generative models, forms of evaluating these models, and their computational constraints/limitations. Subsequently, we highlight applications of these generative models towards extensions to multiple modalities and integration with artists' workflow as well as music education systems. Finally, we survey the potential copyright implications of generative music and discuss strategies for protecting the rights of musicians. While it is not meant to be exhaustive, our survey calls to attention a variety of research directions enabled by music foundation models.
{"title":"Prevailing Research Areas for Music AI in the Era of Foundation Models","authors":"Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans","doi":"arxiv-2409.09378","DOIUrl":"https://doi.org/arxiv-2409.09378","url":null,"abstract":"In tandem with the recent advancements in foundation model research, there\u0000has been a surge of generative music AI applications within the past few years.\u0000As the idea of AI-generated or AI-augmented music becomes more mainstream, many\u0000researchers in the music AI community may be wondering what avenues of research\u0000are left. With regards to music generative models, we outline the current areas\u0000of research with significant room for exploration. Firstly, we pose the\u0000question of foundational representation of these generative models and\u0000investigate approaches towards explainability. Next, we discuss the current\u0000state of music datasets and their limitations. We then overview different\u0000generative models, forms of evaluating these models, and their computational\u0000constraints/limitations. Subsequently, we highlight applications of these\u0000generative models towards extensions to multiple modalities and integration\u0000with artists' workflow as well as music education systems. Finally, we survey\u0000the potential copyright implications of generative music and discuss strategies\u0000for protecting the rights of musicians. While it is not meant to be exhaustive,\u0000our survey calls to attention a variety of research directions enabled by music\u0000foundation models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou
Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.
{"title":"Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation","authors":"Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou","doi":"arxiv-2409.09256","DOIUrl":"https://doi.org/arxiv-2409.09256","url":null,"abstract":"Most existing audio-text retrieval (ATR) approaches typically rely on a\u0000single-level interaction to associate audio and text, limiting their ability to\u0000align different modalities and leading to suboptimal matches. In this work, we\u0000present a novel ATR framework that leverages two-stream Transformers in\u0000conjunction with a Hierarchical Alignment (THA) module to identify multi-level\u0000correspondences of different Transformer blocks between audio and text.\u0000Moreover, current ATR methods mainly focus on learning a global-level\u0000representation, missing out on intricate details to capture audio occurrences\u0000that correspond to textual semantics. To bridge this gap, we introduce a\u0000Disentangled Cross-modal Representation (DCR) approach that disentangles\u0000high-dimensional features into compact latent factors to grasp fine-grained\u0000audio-text semantic correlations. Additionally, we develop a confidence-aware\u0000(CA) module to estimate the confidence of each latent factor pair and\u0000adaptively aggregate cross-modal latent factors to achieve local semantic\u0000alignment. Experiments show that our THA effectively boosts ATR performance,\u0000with the DCR approach further contributing to consistent performance gains.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan
Autism spectrum disorder (ASD) is a neurodevelopmental condition characterized by challenges in social communication, repetitive behavior, and sensory processing. One important research area in ASD is evaluating children's behavioral changes over time during treatment. The standard protocol with this objective is BOSCC, which involves dyadic interactions between a child and clinicians performing a pre-defined set of activities. A fundamental aspect of understanding children's behavior in these interactions is automatic speech understanding, particularly identifying who speaks and when. Conventional approaches in this area heavily rely on speech samples recorded from a spectator perspective, and there is limited research on egocentric speech modeling. In this study, we design an experiment to perform speech sampling in BOSCC interviews from an egocentric perspective using wearable sensors and explore pre-training Ego4D speech samples to enhance child-adult speaker classification in dyadic interactions. Our findings highlight the potential of egocentric speech collection and pre-training to improve speaker classification accuracy.
{"title":"Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling","authors":"Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan","doi":"arxiv-2409.09340","DOIUrl":"https://doi.org/arxiv-2409.09340","url":null,"abstract":"Autism spectrum disorder (ASD) is a neurodevelopmental condition\u0000characterized by challenges in social communication, repetitive behavior, and\u0000sensory processing. One important research area in ASD is evaluating children's\u0000behavioral changes over time during treatment. The standard protocol with this\u0000objective is BOSCC, which involves dyadic interactions between a child and\u0000clinicians performing a pre-defined set of activities. A fundamental aspect of\u0000understanding children's behavior in these interactions is automatic speech\u0000understanding, particularly identifying who speaks and when. Conventional\u0000approaches in this area heavily rely on speech samples recorded from a\u0000spectator perspective, and there is limited research on egocentric speech\u0000modeling. In this study, we design an experiment to perform speech sampling in\u0000BOSCC interviews from an egocentric perspective using wearable sensors and\u0000explore pre-training Ego4D speech samples to enhance child-adult speaker\u0000classification in dyadic interactions. Our findings highlight the potential of\u0000egocentric speech collection and pre-training to improve speaker classification\u0000accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.
{"title":"The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech","authors":"Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.09305","DOIUrl":"https://doi.org/arxiv-2409.09305","url":null,"abstract":"We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.\u0000Our system was designed for the VMC 2024 Track 1, which focused on the accurate\u0000prediction of naturalness mean opinion score (MOS) for high-quality synthetic\u0000speech. In addition to a pretrained self-supervised learning (SSL)-based speech\u0000feature extractor, our system incorporates a pretrained image feature extractor\u0000to capture the difference of synthetic speech observed in speech spectrograms.\u0000We first separately train two MOS predictors that use either of an SSL-based or\u0000spectrogram-based feature. Then, we fine-tune the two predictors for better MOS\u0000prediction using the fusion of two extracted features. In the VMC 2024 Track 1,\u0000our T05 system achieved first place in 7 out of 16 evaluation metrics and\u0000second place in the remaining 9 metrics, with a significant difference compared\u0000to those ranked third and below. We also report the results of our ablation\u0000study to investigate essential factors of our system.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}