Improving ASR Output for Endangered Language Documentation

Workshop on Spoken Language Technologies for Under-resourced Languages Pub Date : 2018-08-29 DOI:10.21437/SLTU.2018-39

Robert Jimerson, Kruthika Simha, R. Ptucha, Emily Tucker Prud'hommeaux

引用次数: 12

Abstract

Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task.

查看原文