Crowd-sourcing for difficult transcription of speech

2011 IEEE Workshop on Automatic Speech Recognition & Understanding Pub Date : 2011-12-01 DOI:10.1109/ASRU.2011.6163988

J. Williams, I. D. Melamed, Tirso Alonso, B. Hollister, J. Wilpon

引用次数: 34

Abstract

Crowd-sourcing is a promising method for fast and cheap transcription of large volumes of speech data. However, this method cannot achieve the accuracy of expert transcribers on speech that is difficult to transcribe. Faced with such speech data, we developed three new methods of crowd-sourcing, which allow explicit trade-offs among precision, recall, and cost. The methods are: incremental redundancy, treating ASR as a transcriber, and using a regression model to predict transcription reliability. Even though the accuracy of individual crowd-workers is only 55% on our data, our best method achieves 90% accuracy on 93% of the utterances, using only 1.3 crowd-worker transcriptions per utterance on average. When forced to transcribe all utterances, our best method matches the accuracy of previous crowd-sourcing methods using only one third as many transcriptions. We also study the effects of various task design factors on transcription latency and accuracy, some of which have not been reported before.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为困难的语音转录提供众包

对于大量语音数据的快速、廉价转录，众包是一种很有前途的方法。然而，对于难以转录的语音，这种方法无法达到专家转录员的准确性。面对这样的语音数据，我们开发了三种新的众包方法，允许在精度，召回率和成本之间进行明确的权衡。方法是:增量冗余，将ASR视为转录因子，并使用回归模型预测转录可靠性。尽管个体众工在我们的数据上的准确率只有55%，但我们最好的方法在93%的话语上达到了90%的准确率，平均每个话语只使用1.3个众工转录。当被迫转录所有话语时，我们最好的方法与以前的众包方法相匹配，只使用三分之一的转录。我们还研究了各种任务设计因素对转录延迟和准确性的影响，其中一些以前没有报道过。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE Workshop on Automatic Speech Recognition & Understanding

自引率

0.00%

发文量

期刊最新文献

Applying feature bagging for more accurate and robust automated speaking assessment Towards choosing better primes for spoken dialog systems Accent level adjustment in bilingual Thai-English text-to-speech synthesis Fast speaker diarization using a high-level scripting language Evaluating prosodic features for automated scoring of non-native read speech