首页 > 最新文献

Interspeech最新文献

英文 中文
Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data. 对比学习方法在舌癌患者语音准确性评估中的应用。
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2236
Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier

Magnetic Resonance Imaging (MRI) allows analyzing speech production by capturing high-resolution images of the dynamic processes in the vocal tract. In clinical applications, combining MRI with synchronized speech recordings leads to improved patient outcomes, especially if a phonological-based approach is used for assessment. However, when audio signals are unavailable, the recognition accuracy of sounds is decreased when using only MRI data. We propose a contrastive learning approach to improve the detection of phonological classes from MRI data when acoustic signals are not available at inference time. We demonstrate that frame-wise recognition of phonological classes improves from an f1 of 0.74 to 0.85 when the contrastive loss approach is implemented. Furthermore, we show the utility of our approach in the clinical application of using such phonological classes to assess speech disorders in patients with tongue cancer, yielding promising results in the recognition task.

磁共振成像(MRI)可以通过捕捉声道中动态过程的高分辨率图像来分析语音产生。在临床应用中,将MRI与同步语音记录相结合可以改善患者的预后,特别是如果使用基于语音的方法进行评估。然而,当音频信号不可用时,仅使用MRI数据对声音的识别精度会降低。我们提出了一种对比学习方法,以提高在推理时声学信号不可用时从MRI数据中检测语音类别的能力。我们证明,当实现对比损失方法时,逐帧语音分类识别的f1从0.74提高到0.85。此外,我们展示了我们的方法在临床应用中的实用性,使用这种语音分类来评估舌癌患者的语言障碍,在识别任务中产生了有希望的结果。
{"title":"Contrastive Learning Approach for Assessment of Phonological Precision in Patients with Tongue Cancer Using MRI Data.","authors":"Tomás Arias-Vergara, Paula Andrea Pérez-Toro, Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Jerry L Prince, Maria Schuster, Elmar Nöth, Jonghye Woo, Andreas Maier","doi":"10.21437/interspeech.2024-2236","DOIUrl":"10.21437/interspeech.2024-2236","url":null,"abstract":"<p><p>Magnetic Resonance Imaging (MRI) allows analyzing speech production by capturing high-resolution images of the dynamic processes in the vocal tract. In clinical applications, combining MRI with synchronized speech recordings leads to improved patient outcomes, especially if a phonological-based approach is used for assessment. However, when audio signals are unavailable, the recognition accuracy of sounds is decreased when using only MRI data. We propose a contrastive learning approach to improve the detection of phonological classes from MRI data when acoustic signals are not available at inference time. We demonstrate that frame-wise recognition of phonological classes improves from an f1 of 0.74 to 0.85 when the contrastive loss approach is implemented. Furthermore, we show the utility of our approach in the clinical application of using such phonological classes to assess speech disorders in patients with tongue cancer, yielding promising results in the recognition task.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"927-931"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671147/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection. 用于轻度认知障碍检测的自发语音助手命令的多模态特征分析。
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2288
Nana Lin, Youxiang Zhu, Xiaohui Liang, John A Batsis, Caroline Summerour

Mild cognitive impairment (MCI) is a major public health concern due to its high risk of progressing to dementia. This study investigates the potential of detecting MCI with spontaneous voice assistant (VA) commands from 35 older adults in a controlled setting. Specifically, a command-generation task is designed with pre-defined intents for participants to freely generate commands that are more associated with cognitive ability than read commands. We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Our results confirm the effectiveness of the command-generation task and imply the promise of using longitudinal in-home commands for MCI detection.

轻度认知障碍(MCI)由于其发展为痴呆的高风险,是一个主要的公共卫生问题。本研究调查了35名老年人在受控环境下使用自发语音助手(VA)命令检测轻度认知障碍的潜力。具体来说,命令生成任务的设计带有预先定义的意图,参与者可以自由地生成与认知能力相关的命令,而不是读取命令。我们开发了具有音频、文本、意图和多模态融合特征的MCI分类和回归模型。我们发现命令生成任务优于命令读取任务,平均分类准确率为82%,这是通过利用多模态融合特征实现的。此外,生成命令与记忆和注意子域的相关性比读取命令更强。我们的研究结果证实了命令生成任务的有效性,并暗示了使用纵向家庭命令进行MCI检测的前景。
{"title":"Analyzing Multimodal Features of Spontaneous Voice Assistant Commands for Mild Cognitive Impairment Detection.","authors":"Nana Lin, Youxiang Zhu, Xiaohui Liang, John A Batsis, Caroline Summerour","doi":"10.21437/interspeech.2024-2288","DOIUrl":"10.21437/interspeech.2024-2288","url":null,"abstract":"<p><p>Mild cognitive impairment (MCI) is a major public health concern due to its high risk of progressing to dementia. This study investigates the potential of detecting MCI with spontaneous voice assistant (VA) commands from 35 older adults in a controlled setting. Specifically, a command-generation task is designed with pre-defined intents for participants to freely generate commands that are more associated with cognitive ability than read commands. We develop MCI classification and regression models with audio, textual, intent, and multimodal fusion features. We find the command-generation task outperforms the command-reading task with an average classification accuracy of 82%, achieved by leveraging multimodal fusion features. In addition, generated commands correlate more strongly with memory and attention subdomains than read commands. Our results confirm the effectiveness of the command-generation task and imply the promise of using longitudinal in-home commands for MCI detection.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"3030-3034"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12419495/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance. 认知风险因素分类的分段和超分段语音基础模型:评估开箱即用的性能。
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2063
Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha

Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.

语音基础模型在各种消费者应用中非常成功,促使其扩展到临床用例。这受到小型临床数据集的挑战,这妨碍了有效的微调。我们测试了两种模型通过分段(Wav2Vec2.0)和超分段(Trillsson)语音分析窗口对参与者进行分类的有效性。在这两个时间尺度上的分析显示了认知能力下降的背景差异。演讲者被分类为健康对照(HC)、淀粉样蛋白-β+ (Aβ+)、轻度认知障碍(MCI)或痴呆。W2V2和Trillsson表示的子集显示HC与每个风险因素之间的效应量很大。交叉验证表明,W2V2的性能始终优于Trillsson。平均宏观f1分别为54.1%、63.5%和72.0%,用于区分Aβ+、MCI和HC型痴呆。Trillsson和W2V2的重复性为0.30和0.41。这些模型的可靠性必须提高临床语音分析和纵向跟踪。
{"title":"Segmental and Suprasegmental Speech Foundation Models for Classifying Cognitive Risk Factors: Evaluating Out-of-the-Box Performance.","authors":"Si-Ioi Ng, Lingfeng Xu, Kimberly D Mueller, Julie Liss, Visar Berisha","doi":"10.21437/interspeech.2024-2063","DOIUrl":"10.21437/interspeech.2024-2063","url":null,"abstract":"<p><p>Speech foundation models are remarkably successful in various consumer applications, prompting their extension to clinical use-cases. This is challenged by small clinical datasets, which precludes effective fine-tuning. We tested the efficacy of two models to classify participants by segmental (Wav2Vec2.0) and suprasegmental (Trillsson) speech analysis windows. Analysis at both time scales has shown differences in the context of cognitive decline. Speakers were classified as healthy controls (HC), Amyloid-β+ (Aβ+), mild cognitive impairment (MCI), or dementia. A subset of W2V2 and Trillsson representations showed large effect size between HC and each risk factor. Cross-validation showed W2V2 consistently outperforms Trillsson. Mean macro-F1 of 54.1%, 63.5%, and 72.0% in were found for classifying Aβ+, MCI, and dementia from HC. Repeatability of Trillsson and W2V2 showed intraclass correlations of 0.30 and 0.41. Reliability of such models must be enhanced for clinical speech analysis and longitudinal tracking.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"917-921"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11884505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143574965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection. YOLO-Stutter:端到端的区域智能语言障碍检测。
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-1855
Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli

Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose YOLO-Stutter: a first end-to-end method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes imperfect speech-text alignment as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, VCTK-Stutter and VCTK-TTS, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves state-of-the-art performance with a minimum number of trainable parameters for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.

语音异常检测是语音异常分析和口语学习的瓶颈。当前最先进的模型是由基于规则的系统控制的[1,2],这些系统缺乏效率和鲁棒性,并且对模板设计很敏感。在本文中,我们提出了YOLO-Stutter:第一种端到端方法,以时间精确的方式检测不流畅。YOLO-Stutter将不完美的语音-文本对齐作为输入,然后使用空间特征聚合器和时间依赖提取器来执行区域边界和类别预测。我们还介绍了两个语言障碍语料库,VCTK-Stutter和VCTK-TTS,模拟自然的口语障碍,包括重复,块,缺失,替换和延长。我们的端到端方法在模拟数据和真实失语症语音中使用最少数量的可训练参数实现了最先进的性能。代码和数据集在https://github.com/rorizzz/YOLO-Stutter上开源。
{"title":"YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection.","authors":"Xuanru Zhou, Anshul Kashyap, Steve Li, Ayati Sharma, Brittany Morin, David Baquirin, Jet Vonk, Zoe Ezzes, Zachary Miller, Maria Tempini, Jiachen Lian, Gopala Anumanchipalli","doi":"10.21437/interspeech.2024-1855","DOIUrl":"10.21437/interspeech.2024-1855","url":null,"abstract":"<p><p>Dysfluent speech detection is the bottleneck for disordered speech analysis and spoken language learning. Current state-of-the-art models are governed by rule-based systems [1, 2] which lack efficiency and robustness, and are sensitive to template design. In this paper, we propose <i>YOLO-Stutter</i>: a <i>first end-to-end</i> method that detects dysfluencies in a time-accurate manner. YOLO-Stutter takes <i>imperfect speech-text alignment</i> as input, followed by a spatial feature aggregator, and a temporal dependency extractor to perform region-wise boundary and class predictions. We also introduce two dysfluency corpus, <i>VCTK-Stutter</i> and <i>VCTK-TTS</i>, that simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation. Our end-to-end method achieves <i>state-of-the-art performance</i> with a <i>minimum number of trainable parameters</i> for on both simulated data and real aphasia speech. Code and datasets are open-sourced at https://github.com/rorizzz/YOLO-Stutter.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"937-941"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226351/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144577143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech? 对齐错误如何影响儿童语音自动评分?
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-2239
Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha

Automated goodness of pronunciation scores measure deviation from typical adult speech by first phonetically segmenting speech using forced alignment and then computing phoneme likelihoods. Care must be taken to distinguish between the impact of alignment error (a spurious signal) and true acoustic deviation on the automated score. Using mixed effects modeling, we predict Δ P L L R , the difference between pronunciation scores computed using manual alignment ( P L L R m ) versus computed using automatic forced alignments ( P L L R a ). Pronunciation deviations and alignment error are both magnified in children's speech and may be influenced by factors such as phoneme position and phoneme type. Our methodology shows that alignment error has a moderate effect on Δ P L L R , and other variables have small to no effect. Manual PLLR closely matches automatically calculated PLLR following cross utterance averaging. Thus, practical comparisons between child speakers should be very comparable across the two methods.

自动发音优度评分通过首先使用强制对齐对语音进行语音分割,然后计算音素可能性来衡量与典型成人语音的偏差。必须注意区分对准误差(一种伪信号)和真实声学偏差对自动评分的影响。使用混合效应建模,我们预测Δ P L L R,使用手动对齐计算的发音分数(P L L R m)与使用自动强制对齐计算的发音分数(P L L R a)之间的差异。发音偏差和对齐误差在儿童言语中都会被放大,并可能受到音素位置和音素类型等因素的影响。我们的方法表明,对准误差对Δ P L L R的影响中等,其他变量的影响很小甚至没有影响。手动PLLR与交叉发音平均后自动计算的PLLR紧密匹配。因此,儿童说话者之间的实际比较应该在两种方法之间具有可比性。
{"title":"How Does Alignment Error Affect Automated Pronunciation Scoring in Children's Speech?","authors":"Prad Kadambi, Tristan Mahr, Lucas Annear, Henry Nomeland, Julie Liss, Katherine Hustad, Visar Berisha","doi":"10.21437/interspeech.2024-2239","DOIUrl":"10.21437/interspeech.2024-2239","url":null,"abstract":"<p><p>Automated goodness of pronunciation scores measure deviation from typical adult speech by first phonetically segmenting speech using forced alignment and then computing phoneme likelihoods. Care must be taken to distinguish between the impact of alignment error (a spurious signal) and true acoustic deviation on the automated score. Using mixed effects modeling, we predict <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , the difference between pronunciation scores computed using manual alignment ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>m</mi></mrow> </msub> </math> ) versus computed using automatic forced alignments ( <math><mi>P</mi> <mi>L</mi> <mi>L</mi> <msub><mrow><mi>R</mi></mrow> <mrow><mi>a</mi></mrow> </msub> </math> ). Pronunciation deviations and alignment error are both magnified in children's speech and may be influenced by factors such as phoneme position and phoneme type. Our methodology shows that alignment error has a moderate effect on <math><mi>Δ</mi> <mi>P</mi> <mi>L</mi> <mi>L</mi> <mi>R</mi></math> , and other variables have small to no effect. Manual <math><mi>PLLR</mi></math> closely matches automatically calculated <math><mi>PLLR</mi></math> following cross utterance averaging. Thus, practical comparisons between child speakers should be very comparable across the two methods.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"5133-5137"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11977302/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction. 比较有和没有发声功能亢进的说话者在日常生活中的动态声音测量和简短的实验室评估。
Pub Date : 2024-09-01 DOI: 10.21437/interspeech.2024-1484
Daryush D Mehta, Jarrad H Van Stan, Hamzeh Ghasemzadeh, Robert E Hillman

The most common types of voice disorders are associated with hyperfunctional voice use in daily life. Although current clinical practice uses measures from brief laboratory recordings to assess vocal function, it is unclear how these relate to an individual's habitual voice use. The purpose of this study was to quantify the correlation and offset between voice features computed from laboratory and ambulatory recordings in speakers with and without vocal hyperfunction. Features derived from a neck-surface accelerometer included estimates of sound pressure level, fundamental frequency, cepstral peak prominence, and spectral tilt. Whereas some measures from laboratory recordings correlated significantly with those captured during daily life, only approximately 6-52% of the actual variance was accounted for. Thus, brief voice assessments are quite limited in the extent to which they can accurately characterize the daily voice use of speakers with and without vocal hyperfunction.

最常见的声音障碍类型与日常生活中功能亢进的声音使用有关。尽管目前的临床实践使用简短的实验室录音来评估声音功能,但尚不清楚这些与个人习惯性声音使用的关系。本研究的目的是量化从有或没有发声功能亢进的说话者的实验室和流动录音中计算的声音特征之间的相关性和偏移。颈表面加速度计的特征包括声压级、基频、倒谱峰值突出和频谱倾斜的估计。尽管实验室记录的一些测量值与日常生活中捕获的测量值显著相关,但实际方差仅占约6-52%。因此,简短的声音评估在一定程度上是非常有限的,它们不能准确地描述有或没有声音功能亢进的说话者的日常声音使用。
{"title":"Comparing ambulatory voice measures during daily life with brief laboratory assessments in speakers with and without vocal hyperfunction.","authors":"Daryush D Mehta, Jarrad H Van Stan, Hamzeh Ghasemzadeh, Robert E Hillman","doi":"10.21437/interspeech.2024-1484","DOIUrl":"https://doi.org/10.21437/interspeech.2024-1484","url":null,"abstract":"<p><p>The most common types of voice disorders are associated with hyperfunctional voice use in daily life. Although current clinical practice uses measures from brief laboratory recordings to assess vocal function, it is unclear how these relate to an individual's habitual voice use. The purpose of this study was to quantify the correlation and offset between voice features computed from laboratory and ambulatory recordings in speakers with and without vocal hyperfunction. Features derived from a neck-surface accelerometer included estimates of sound pressure level, fundamental frequency, cepstral peak prominence, and spectral tilt. Whereas some measures from laboratory recordings correlated significantly with those captured during daily life, only approximately 6-52% of the actual variance was accounted for. Thus, brief voice assessments are quite limited in the extent to which they can accurately characterize the daily voice use of speakers with and without vocal hyperfunction.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2024 ","pages":"1455-1459"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12014202/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144044125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance. 使用多模式对话代理对ALS进行远程评估:数据质量、可行性和任务符合性。
Pub Date : 2023-08-01 DOI: 10.21437/interspeech.2023-2115
Vanessa Richter, Michael Neumann, Jordan R Green, Brian Richburg, Oliver Roesler, Hardik Kothare, Vikram Ramanarayanan

We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.

我们研究了用于肌萎缩侧索硬化症(ALS)远程评估的基于多模式对话的解决方案的可行性、任务依从性和视听数据质量。53名ALS患者和52名健康对照者与Tina(一种基于云的对话代理)进行了互动,在录制他们的音频和视频时,他们执行了旨在探索运动言语功能各个方面的言语任务。我们对总共250段录音的音频/视频质量和参与者任务依从性进行了评级,以及观察到的不同问题的相对频率。我们观察到良好的依从性(98%)、音频(95.2%)和视觉质量率(84.8%),导致80.8%的录音符合要求且质量高。此外,录音质量和合规性不受语音严重程度的影响,在终端设备之间也没有显著差异。这些发现支持对话系统在ALS语音远程监测中的应用。
{"title":"Remote Assessment for ALS using Multimodal Dialog Agents: Data Quality, Feasibility and Task Compliance.","authors":"Vanessa Richter,&nbsp;Michael Neumann,&nbsp;Jordan R Green,&nbsp;Brian Richburg,&nbsp;Oliver Roesler,&nbsp;Hardik Kothare,&nbsp;Vikram Ramanarayanan","doi":"10.21437/interspeech.2023-2115","DOIUrl":"https://doi.org/10.21437/interspeech.2023-2115","url":null,"abstract":"<p><p>We investigate the feasibility, task compliance and audiovisual data quality of a multimodal dialog-based solution for remote assessment of Amyotrophic Lateral Sclerosis (ALS). 53 people with ALS and 52 healthy controls interacted with Tina, a cloud-based conversational agent, in performing speech tasks designed to probe various aspects of motor speech function while their audio and video was recorded. We rated a total of 250 recordings for audio/video quality and participant task compliance, along with the relative frequency of different issues observed. We observed excellent compliance (98%) and audio (95.2%) and visual quality rates (84.8%), resulting in an overall yield of 80.8% recordings that were both compliant and of high quality. Furthermore, recording quality and compliance were not affected by level of speech severity and did not differ significantly across end devices. These findings support the utility of dialog systems for remote monitoring of speech in ALS.</p>","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"2023 ","pages":"5441-5445"},"PeriodicalIF":0.0,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547018/pdf/nihms-1931217.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer 考虑语言迁移影响的普通话ASR外来词语音建模
Pub Date : 2022-10-07 DOI: 10.21437/Interspeech.2014-353
Lei Wang, R. Tong
One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.
语音自动识别的难点之一是外来词识别。人们观察到,说话者对外来词的发音受到其母语知识的影响,这种现象被称为语言迁移效应。本文主要研究语音自动识别中语言迁移的语音效应。提出了一套将英语单词转换为汉语语音表示的词汇规则。通过这种方式,汉语词汇可以通过加入英语单词来扩充。因此,普通话ASR系统无需重新训练或重新估计声学模型参数即可识别英语单词。在不影响纯普通话语音准确性的前提下,使用基于规则的词汇表,提高了普通话英语混合语音的ASR性能。所提出的词汇规则具有概括性,可以直接应用于未见过的英语单词。
{"title":"Pronunciation modeling of foreign words for Mandarin ASR by considering the effect of language transfer","authors":"Lei Wang, R. Tong","doi":"10.21437/Interspeech.2014-353","DOIUrl":"https://doi.org/10.21437/Interspeech.2014-353","url":null,"abstract":"One of the challenges in automatic speech recognition is foreign words recognition. It is observed that a speaker's pronunciation of a foreign word is influenced by his native language knowledge, and such phenomenon is known as the effect of language transfer. This paper focuses on examining the phonetic effect of language transfer in automatic speech recognition. A set of lexical rules is proposed to convert an English word into Mandarin phonetic representation. In this way, a Mandarin lexicon can be augmented by including English words. Hence, the Mandarin ASR system becomes capable to recognize English words without retraining or re-estimation of the acoustic model parameters. Using the lexicon that derived from the proposed rules, the ASR performance of Mandarin English mixed speech is improved without harming the accuracy of Mandarin only speech. The proposed lexical rules are generalized and they can be directly applied to unseen English words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1443-1447"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43367945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic Speaker Verification System for Dysarthria Patients 用于构音障碍患者的自动扬声器验证系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-375
Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad
Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.
构音障碍是最常见的言语交流障碍之一,与削弱言语所需肌肉的神经损伤有关。在本文中,我们致力于开发一个基于x向量的自动说话人验证(ASV)系统,用于不同语音清晰度(低、中、高)的构音障碍说话人。为此,基线ASV系统是根据健康说话者的语音数据进行训练的,因为严重缺乏构音障碍说话者的数据。为了提高构音障碍说话者的表现,本研究提出了基于持续时间修改的数据增强。将几个比例因子的时长修正应用于健康训练语音。然后对ASV系统进行了健康语音训练,并对其持续时间进行了修改。它弥补了不同语音清晰度的正常和构音障碍说话者之间电话持续时间的巨大差异。本研究中的实验评估表明,所提出的基于持续时间修改的数据增强比基线相对提高了22%。此外,在具有高严重程度构音障碍的说话者的情况下,获得了26%的相对改善。
{"title":"Automatic Speaker Verification System for Dysarthria Patients","authors":"Shinimol Salim, S. Shahnawazuddin, Waquar Ahmad","doi":"10.21437/interspeech.2022-375","DOIUrl":"https://doi.org/10.21437/interspeech.2022-375","url":null,"abstract":"Dysarthria is one of the most common speech communication disorder associate with a neurological damage that weakens the muscles necessary for speech. In this paper, we present our efforts towards developing an automatic speaker verification (ASV) system based on x -vectors for dysarthric speakers with varying speech intelligibility (low, medium and high). For that purpose, a baseline ASV system was trained on speech data from healthy speakers since there is severe scarcity of data from dysarthric speakers. To improve the performance with respect to dysarthric speakers, data augmentation based on duration modification is proposed in this study. Duration modification with several scaling factors was applied to healthy training speech. An ASV system was then trained on healthy speech augmented with its duration modified versions. It compen-sates for the substantial disparities in phone duration between normal and dysarthric speakers of varying speech intelligibilty. Experiment evaluations presented in this study show that proposed duration-modification-based data augmentation resulted in a relative improvement of 22% over the baseline. Further to that, a relative improvement of 26% was obtained in the case of speakers with high severity level of dysarthria.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5070-5074"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44912875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics 基于发声特征的新型冠状病毒咳嗽检测鲁棒咳嗽特征提取与分类方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10401
Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun
A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.
一种快速、高效和准确的新冠肺炎检测方法仍然是一个关键的挑战。许多基于咳嗽的新冠肺炎检测研究通过人工智能显示了具有竞争力的结果。然而,由于缺乏对咳嗽声发声特征的分析,限制了检测性能的进一步提高。在本文中,我们提出了咳嗽声的两种新的声学特征和用于新冠肺炎检测的卷积神经网络结构。首先,提出了一种时频微分特征来表征咳嗽声在时域和频域中的动态信息。然后,提出了一种能量比特征来计算不同咳嗽阶段由发音特征引起的能量差异。最后,提出了一种在大量未标记咳嗽数据上预训练的具有两个并行分支的卷积神经网络进行分类。实验结果表明,我们提出的方法在用于新冠肺炎检测的Coswara数据集上实现了最先进的性能。在外部临床数据集Virify上的结果也表明了我们提出的方法更好的泛化能力。版权所有©2022 ISCA。
{"title":"Robust Cough Feature Extraction and Classification Method for COVID-19 Cough Detection Based on Vocalization Characteristics","authors":"Xueshuai Zhang, Jiakun Shen, J. Zhou, Pengyuan Zhang, Yonghong Yan, Zhihua Huang, Yanfen Tang, Yu Wang, Fujie Zhang, Shenmin Zhang, Aijun Sun","doi":"10.21437/interspeech.2022-10401","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10401","url":null,"abstract":"A fast, efficient and accurate detection method of COVID-19 remains a critical challenge. Many cough-based COVID-19 detection researches have shown competitive results through artificial intelligence. However, the lack of analysis on vocalization characteristics of cough sounds limits the further improvement of detection performance. In this paper, we propose two novel acoustic features of cough sounds and a convolutional neural network structure for COVID-19 detection. First, a time-frequency differential feature is proposed to characterize dynamic information of cough sounds in time and frequency domain. Then, an energy ratio feature is proposed to calculate the energy difference caused by the phonation characteristics in different cough phases. Finally, a convolutional neural network with two parallel branches which is pre-trained on a large amount of unlabeled cough data is proposed for classification. Experiment results show that our proposed method achieves state-of-the-art performance on Coswara dataset for COVID-19 detection. The results on an external clinical dataset Virufy also show the better generalization ability of our proposed method. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2168-2172"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45011547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1