FluencyBank Timestamped: An Updated Data Set for Disfluency Detection and Automatic Intended Speech Recognition.

IF 2.2 2区医学 Q1 AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY Journal of Speech Language and Hearing Research Pub Date : 2024-11-07 Epub Date: 2024-10-08 DOI:10.1044/2024_JSLHR-24-00070

Amrit Romana, Minxue Niu, Matthew Perez, Emily Mower Provost

{"title":"FluencyBank Timestamped: An Updated Data Set for Disfluency Detection and Automatic Intended Speech Recognition.","authors":"Amrit Romana, Minxue Niu, Matthew Perez, Emily Mower Provost","doi":"10.1044/2024_JSLHR-24-00070","DOIUrl":null,"url":null,"abstract":"Purpose: This work introduces updated transcripts, disfluency annotations, and word timings for FluencyBank, which we refer to as FluencyBank Timestamped. This data set will enable the thorough analysis of how speech processing models (such as speech recognition and disfluency detection models) perform when evaluated with typical speech versus speech from people who stutter (PWS).Method: We update the FluencyBank data set, which includes audio recordings from adults who stutter, to explore the robustness of speech processing models. Our update (semi-automated with manual review) includes new transcripts with timestamps and disfluency labels corresponding to each token in the transcript. Our disfluency labels capture typical disfluencies (filled pauses, repetitions, revisions, and partial words), and we explore how speech model performance compares for Switchboard (typical speech) and FluencyBank Timestamped. We present benchmarks for three speech tasks: intended speech recognition, text-based disfluency detection, and audio-based disfluency detection. For the first task, we evaluate how well Whisper performs for intended speech recognition (i.e., transcribing speech without disfluencies). For the next tasks, we evaluate how well a Bidirectional Embedding Representations from Transformers (BERT) text-based model and a Whisper audio-based model perform for disfluency detection. We select these models, BERT and Whisper, as they have shown high accuracies on a broad range of tasks in their language and audio domains, respectively.Results: For the transcription task, we calculate an intended speech word error rate (isWER) between the model's output and the speaker's intended speech (i.e., speech without disfluencies). We find isWER is comparable between Switchboard and FluencyBank Timestamped, but that Whisper transcribes filled pauses and partial words at higher rates in the latter data set. Within FluencyBank Timestamped, isWER increases with stuttering severity. For the disfluency detection tasks, we find the models detect filled pauses, revisions, and partial words relatively well in FluencyBank Timestamped, but performance drops substantially for repetitions because the models are unable to generalize to the different types of repetitions (e.g., multiple repetitions and sound repetitions) from PWS. We hope that FluencyBank Timestamped will allow researchers to explore closing performance gaps between typical speech and speech from PWS.Conclusions: Our analysis shows that there are gaps in speech recognition and disfluency detection performance between typical speech and speech from PWS. We hope that FluencyBank Timestamped will contribute to more advancements in training robust speech processing models.","PeriodicalId":51254,"journal":{"name":"Journal of Speech Language and Hearing Research","volume":" ","pages":"4203-4215"},"PeriodicalIF":2.2000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12379651/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Speech Language and Hearing Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1044/2024_JSLHR-24-00070","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/8 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: This work introduces updated transcripts, disfluency annotations, and word timings for FluencyBank, which we refer to as FluencyBank Timestamped. This data set will enable the thorough analysis of how speech processing models (such as speech recognition and disfluency detection models) perform when evaluated with typical speech versus speech from people who stutter (PWS).

Method: We update the FluencyBank data set, which includes audio recordings from adults who stutter, to explore the robustness of speech processing models. Our update (semi-automated with manual review) includes new transcripts with timestamps and disfluency labels corresponding to each token in the transcript. Our disfluency labels capture typical disfluencies (filled pauses, repetitions, revisions, and partial words), and we explore how speech model performance compares for Switchboard (typical speech) and FluencyBank Timestamped. We present benchmarks for three speech tasks: intended speech recognition, text-based disfluency detection, and audio-based disfluency detection. For the first task, we evaluate how well Whisper performs for intended speech recognition (i.e., transcribing speech without disfluencies). For the next tasks, we evaluate how well a Bidirectional Embedding Representations from Transformers (BERT) text-based model and a Whisper audio-based model perform for disfluency detection. We select these models, BERT and Whisper, as they have shown high accuracies on a broad range of tasks in their language and audio domains, respectively.

Results: For the transcription task, we calculate an intended speech word error rate (isWER) between the model's output and the speaker's intended speech (i.e., speech without disfluencies). We find isWER is comparable between Switchboard and FluencyBank Timestamped, but that Whisper transcribes filled pauses and partial words at higher rates in the latter data set. Within FluencyBank Timestamped, isWER increases with stuttering severity. For the disfluency detection tasks, we find the models detect filled pauses, revisions, and partial words relatively well in FluencyBank Timestamped, but performance drops substantially for repetitions because the models are unable to generalize to the different types of repetitions (e.g., multiple repetitions and sound repetitions) from PWS. We hope that FluencyBank Timestamped will allow researchers to explore closing performance gaps between typical speech and speech from PWS.

Conclusions: Our analysis shows that there are gaps in speech recognition and disfluency detection performance between typical speech and speech from PWS. We hope that FluencyBank Timestamped will contribute to more advancements in training robust speech processing models.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

流利度数据库（FluencyBank）时间戳：用于流畅性检测和自动意图语音识别的最新数据集。

目的：这项工作介绍了流利说数据库的最新录音誊本、不流利注释和单词定时，我们称之为流利说数据库时间戳。该数据集将有助于全面分析语音处理模型（如语音识别和不流利检测模型）在评估典型语音和口吃患者（PWS）语音时的表现：我们更新了 FluencyBank 数据集，其中包括口吃成年人的录音，以探索语音处理模型的稳健性。我们的更新（半自动化，人工审核）包括带有时间戳的新记录誊本和与记录誊本中每个标记相对应的不流利标签。我们的不流畅标签捕捉了典型的不流畅现象（填充停顿、重复、修改和部分词语），我们探讨了 Switchboard（典型语音）和 FluencyBank Timestamped 的语音模型性能比较。我们提供了三项语音任务的基准：意图语音识别、基于文本的不流畅检测和基于音频的不流畅检测。在第一项任务中，我们评估了 Whisper 在预期语音识别（即转录无断句语音）方面的表现。在接下来的任务中，我们将评估基于转换器的双向嵌入表征（BERT）文本模型和基于 Whisper 音频模型在不流畅语句检测中的表现。我们选择 BERT 和 Whisper 这两个模型，是因为它们分别在其语言和音频领域的大量任务中表现出了很高的准确率：在转录任务中，我们计算了模型输出与说话人预期语音（即不含不流利词语的语音）之间的预期语音词语错误率（isWER）。我们发现，Switchboard 和 FluencyBank Timestamped 的 isWER 不相上下，但在后者的数据集中，Whisper 转录填充停顿和不完整单词的比率更高。在 FluencyBank Timestamped 中，isWER 会随着口吃严重程度的增加而增加。对于不流利检测任务，我们发现在流利库 Timestamped 中，模型对填充停顿、修订和部分词语的检测效果相对较好，但对重复的检测效果则大幅下降，因为模型无法泛化到 PWS 中不同类型的重复（如多重重复和声音重复）。我们希望，FluencyBank Timestamped 能让研究人员探索如何缩小典型语音和 PWS 语音之间的性能差距：我们的分析表明，典型语音和来自 PWS 的语音在语音识别和不流畅检测性能方面存在差距。我们希望，FluencyBank Timestamped 将有助于在训练强大的语音处理模型方面取得更多进展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Speech Language and Hearing Research AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY-REHABILITATION

CiteScore

4.10

自引率

19.20%

发文量

538

审稿时长

4-8 weeks

期刊介绍： Mission: JSLHR publishes peer-reviewed research and other scholarly articles on the normal and disordered processes in speech, language, hearing, and related areas such as cognition, oral-motor function, and swallowing. The journal is an international outlet for both basic research on communication processes and clinical research pertaining to screening, diagnosis, and management of communication disorders as well as the etiologies and characteristics of these disorders. JSLHR seeks to advance evidence-based practice by disseminating the results of new studies as well as providing a forum for critical reviews and meta-analyses of previously published work. Scope: The broad field of communication sciences and disorders, including speech production and perception; anatomy and physiology of speech and voice; genetics, biomechanics, and other basic sciences pertaining to human communication; mastication and swallowing; speech disorders; voice disorders; development of speech, language, or hearing in children; normal language processes; language disorders; disorders of hearing and balance; psychoacoustics; and anatomy and physiology of hearing.