首页 > 最新文献

Workshop on Spoken Language Technologies for Under-resourced Languages最新文献

英文 中文
Visually Grounded Cross-Lingual Keyword Spotting in Speech 基于视觉的跨语言关键字识别
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-53
H. Kamper, Michael Roth
{"title":"Visually Grounded Cross-Lingual Keyword Spotting in Speech","authors":"H. Kamper, Michael Roth","doi":"10.21437/SLTU.2018-53","DOIUrl":"https://doi.org/10.21437/SLTU.2018-53","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122428080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Prosodic Analysis of Non-Native South Indian English Speech 非母语南印度英语语音的韵律分析
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-15
Radha Krishna Guntur, R. Krishnan, V. K. Mittal
Investigations on linguistic prosody related to non-native English speech by South Indians were carried out using a database specifically meant for this study. Prosodic differences between native and non-native speech samples of regional language groups: Kannada, Tamil, and Telugu were evaluated and compared. This information is useful in applications such as Native language identification. It is observed that the mean value of pitch, and the general variation of pitch contour is higher in the case of non-native English speech by all the three groups of speakers, indicating accommodation of speaking manner. This study finds that dynamic variation of pitch is the least for English speech by native Kannada language speakers. The increase in standard deviation of pitch contour for non-native English speech by Kannada speakers is much less at about 3.7% on an average. In the case of Tamil and Telugu native speakers, it is 9.5%, and 27% respectively.
使用专门用于本研究的数据库,对南印度人非母语英语语音的语言韵律进行了调查。评估和比较了区域语言群体:卡纳达语、泰米尔语和泰卢固语的母语和非母语语音样本之间的韵律差异。此信息在诸如母语识别之类的应用程序中非常有用。观察到,三组说话者在非母语英语的情况下,音高的平均值和音高轮廓的总体变化都更高,表明说话方式的适应性。本研究发现,以卡纳达语为母语的人在英语讲话中,音高的动态变化最小。坎那达语使用者的非英语母语语音的音高轮廓标准差的增幅要小得多,平均约为3.7%。以泰米尔语和泰卢固语为母语的人,这一比例分别为9.5%和27%。
{"title":"Prosodic Analysis of Non-Native South Indian English Speech","authors":"Radha Krishna Guntur, R. Krishnan, V. K. Mittal","doi":"10.21437/SLTU.2018-15","DOIUrl":"https://doi.org/10.21437/SLTU.2018-15","url":null,"abstract":"Investigations on linguistic prosody related to non-native English speech by South Indians were carried out using a database specifically meant for this study. Prosodic differences between native and non-native speech samples of regional language groups: Kannada, Tamil, and Telugu were evaluated and compared. This information is useful in applications such as Native language identification. It is observed that the mean value of pitch, and the general variation of pitch contour is higher in the case of non-native English speech by all the three groups of speakers, indicating accommodation of speaking manner. This study finds that dynamic variation of pitch is the least for English speech by native Kannada language speakers. The increase in standard deviation of pitch contour for non-native English speech by Kannada speakers is much less at about 3.7% on an average. In the case of Tamil and Telugu native speakers, it is 9.5%, and 27% respectively.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127091112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Post-Processing Using Speech Enhancement Techniques for Unit Selection and Hidden Markov Model Based Low Resource Language Marathi Text-to-Speech System 基于单元选择的语音增强后处理和基于隐马尔可夫模型的低资源语言马拉地语文本到语音系统
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-20
Sangramsing Kayte, Monica R. Mundada
{"title":"Post-Processing Using Speech Enhancement Techniques for Unit Selection and Hidden Markov Model Based Low Resource Language Marathi Text-to-Speech System","authors":"Sangramsing Kayte, Monica R. Mundada","doi":"10.21437/SLTU.2018-20","DOIUrl":"https://doi.org/10.21437/SLTU.2018-20","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121324401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
IIITH-ILSC Speech Database for Indain Language Identification iith - ilsc印度语言识别语音数据库
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-12
R. Vuddagiri, K. Gurugubelli, P. Jain, Hari Krishna Vydana, A. Vuppala
This work focuses on the development of speech data comprising 23 Indian languages for developing language identification (LID) systems. Large data is a pre-requisite for developing state-of-the-art LID systems. With this motivation, the task of developing multilingual speech corpus for Indian languages has been initiated. This paper describes the composition of the data and the performances of various LID systems developed using this data. In this paper, Mel frequency cepstral feature representation is used for language identification. In this work, various state-of-the-art LID systems are developed using i-vectors, deep neural network (DNN) and deep neural network with attention (DNN-WA) models. The performance of the LID system is observed in terms of the equal error rate for i-vector, DNN and DNN-WA is 17.77%, 17.95%, and 15.18% respec-tively. Deep neural network with attention model shows a better performance over i-vector and DNN models.
这项工作的重点是开发23种印度语言的语音数据,用于开发语言识别(LID)系统。大数据是开发最先进的LID系统的先决条件。基于这一动机,印度语言多语言语料库的开发工作已经启动。本文介绍了这些数据的组成以及利用这些数据开发的各种LID系统的性能。本文将Mel频率倒谱特征表示用于语言识别。在这项工作中,使用i向量、深度神经网络(DNN)和深度神经网络与注意力(DNN- wa)模型开发了各种最先进的LID系统。从i-vector、DNN和DNN- wa的等错误率分别为17.77%、17.95%和15.18%来看,LID系统的性能。与i-vector和DNN模型相比,具有注意力模型的深度神经网络表现出更好的性能。
{"title":"IIITH-ILSC Speech Database for Indain Language Identification","authors":"R. Vuddagiri, K. Gurugubelli, P. Jain, Hari Krishna Vydana, A. Vuppala","doi":"10.21437/SLTU.2018-12","DOIUrl":"https://doi.org/10.21437/SLTU.2018-12","url":null,"abstract":"This work focuses on the development of speech data comprising 23 Indian languages for developing language identification (LID) systems. Large data is a pre-requisite for developing state-of-the-art LID systems. With this motivation, the task of developing multilingual speech corpus for Indian languages has been initiated. This paper describes the composition of the data and the performances of various LID systems developed using this data. In this paper, Mel frequency cepstral feature representation is used for language identification. In this work, various state-of-the-art LID systems are developed using i-vectors, deep neural network (DNN) and deep neural network with attention (DNN-WA) models. The performance of the LID system is observed in terms of the equal error rate for i-vector, DNN and DNN-WA is 17.77%, 17.95%, and 15.18% respec-tively. Deep neural network with attention model shows a better performance over i-vector and DNN models.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"86 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116421005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A Human Quality Text to Speech System for Sinhala 僧伽罗语人性化文本转语音系统
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-33
L. Nanayakkara, Chamila Liyanage, Pubudu Tharaka Viswakula, Thilini Nagungodage, Randil Pushpananda, R. Weerasinghe
This paper proposes an approach on implementing a Text to Speech system for Sinhala language using MaryTTS framework. In this project, a set of rules for mapping text to sound were identified and proceeded with Unit selection mechanism. The datasets used for this study were gathered from newspaper articles and the corresponding sentences were recorded by a professional speaker. User level evaluation was conducted with 20 candidates, where the intelligibility and the naturalness of the developed Sinhala TTS system received an approximate score of 70%. And the overall speech quality is an approximately to 60%.
本文提出了一种利用MaryTTS框架实现僧伽罗语文本到语音系统的方法。在这个项目中,确定了一套将文本映射到声音的规则,并进行了单元选择机制。本研究使用的数据集是从报纸文章中收集的,相应的句子由专业演讲者录制。对20名候选人进行了用户水平评估,其中开发的僧伽罗语TTS系统的可理解性和自然性获得了大约70%的分数。整体语音质量约为60%。
{"title":"A Human Quality Text to Speech System for Sinhala","authors":"L. Nanayakkara, Chamila Liyanage, Pubudu Tharaka Viswakula, Thilini Nagungodage, Randil Pushpananda, R. Weerasinghe","doi":"10.21437/SLTU.2018-33","DOIUrl":"https://doi.org/10.21437/SLTU.2018-33","url":null,"abstract":"This paper proposes an approach on implementing a Text to Speech system for Sinhala language using MaryTTS framework. In this project, a set of rules for mapping text to sound were identified and proceeded with Unit selection mechanism. The datasets used for this study were gathered from newspaper articles and the corresponding sentences were recorded by a professional speaker. User level evaluation was conducted with 20 candidates, where the intelligibility and the naturalness of the developed Sinhala TTS system received an approximate score of 70%. And the overall speech quality is an approximately to 60%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123686298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Predicting the Features of World Atlas of Language Structures from Speech 从言语预测世界语言结构地图集的特征
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-52
Alexander Gutkin, Tatiana Merkulova, Martin Jansche
Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.
最近的研究工作考虑了在没有转录的情况下,如何将图像与语音配对作为构建语音系统的监督。我们想知道视觉基础是否可以用于跨语言关键字识别:给定一种语言的文本关键字,任务是检索包含该关键字的另一种语言的口语。这样就可以使用高资源语言中的文本查询来搜索低资源语言中的语音。作为概念验证,我们使用英语语音和德语查询:我们使用德语视觉标记器为每个训练图像添加关键字标签,然后训练神经网络将英语语音映射到德语关键字。在没有看到平行的语音转录或翻译的情况下,该模型的准确率达到了58%的10%。我们发现,大多数错误检索包含等效或语义相关的关键字;排除这些因素后,P@10将提高到91%。
{"title":"Predicting the Features of World Atlas of Language Structures from Speech","authors":"Alexander Gutkin, Tatiana Merkulova, Martin Jansche","doi":"10.21437/SLTU.2018-52","DOIUrl":"https://doi.org/10.21437/SLTU.2018-52","url":null,"abstract":"Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126389098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-resource Tibetan Dialect Acoustic Modeling Based on Transfer Learning 基于迁移学习的低资源藏语方言声学建模
Pub Date : 2018-08-29 DOI: 10.21437/SLTU.2018-2
Jinghao Yan, Zhiqiang Lv, Shen Huang, Hongzhi Yu
{"title":"Low-resource Tibetan Dialect Acoustic Modeling Based on Transfer Learning","authors":"Jinghao Yan, Zhiqiang Lv, Shen Huang, Hongzhi Yu","doi":"10.21437/SLTU.2018-2","DOIUrl":"https://doi.org/10.21437/SLTU.2018-2","url":null,"abstract":"","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125047378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System 结合说话人规范化能力到端到端语音识别系统
Pub Date : 2018-08-29 DOI: 10.21437/sltu.2018-36
Hari Krishna Vydana, Sivanand Achanta, A. Vuppala
Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be in-corporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.
说话人归一化是自动语音识别系统(ASR)的关键问题之一。说话人归一化是为了减少由于说话人变化而导致的ASR性能下降。传统的说话人归一化方法大多是对每个说话人估计的输入数据进行线性变换,这种变换在数据充足的情况下是有效的。在实际场景中,测试说话者只能说出一个话语。本研究探索了端到端语音识别系统的说话人规范化方法,即使在看不见的说话人的单个话语可用时也可以有效地执行。在这项工作中,假设通过在训练端到端神经网络的同时适当地提供有关说话人身份的信息,可以将说话人的变异性归一化的能力纳入ASR系统。这些归一化方法的效率取决于对未见说话者使用的表示。在这项工作中,训练说话人的身份以两种不同的方式表示,即i)使用一热说话人代码,ii)所有训练说话人身份的加权组合。来自测试集的未见的说话人使用训练说话人表示的加权组合来表示。两种方法均将WSJ语料库的单词错误率(WER)降低了0.6,1.3%。
{"title":"Incorporating Speaker Normalizing Capabilities to an End-to-End Speech Recognition System","authors":"Hari Krishna Vydana, Sivanand Achanta, A. Vuppala","doi":"10.21437/sltu.2018-36","DOIUrl":"https://doi.org/10.21437/sltu.2018-36","url":null,"abstract":"Speaker normalization is one of the crucial aspects of an Automatic speech recognition system (ASR). Speaker normalization is employed to reduce the performance drop in ASR due to speaker variabilities. Traditional speaker normalization methods are mostly linear transforms over the input data estimated per speaker, such transforms would be efficient with sufficient data. In practical scenarios, only a single utterance from the test speaker is accessible. The present study explores speaker normalization methods for end-to-end speech recognition systems that could efficiently be performed even when single utterance from the unseen speaker is available. In this work, it is hypothesized that by suitably providing information about the speaker’s identity while training an end-to-end neural network, the capability to normalize the speaker variability could be in-corporated into an ASR system. The efficiency of these normalization methods depends on the representation used for unseen speakers. In this work, the identity of the training speaker is represented in two different ways viz. i) by using a one-hot speaker code, ii) a weighted combination of all the training speakers identities. The unseen speakers from the test set are represented using a weighted combination of training speakers representations. Both the approaches have reduced the word error rate (WER) by 0.6, 1.3% WSJ corpus.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126785093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A small Griko-Italian speech translation corpus 一个小的griko -意大利语语音翻译语料库
Pub Date : 2018-07-27 DOI: 10.21437/SLTU.2018-8
Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier
This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.
This论文提出了一个扩展到一个非常低资源的平行语料库收集在一种濒危语言,Griko,使其对计算研究有用。该语料库由330个话语(约2小时的演讲)组成,这些话语已被转录并翻译为意大利语,并附有单词级语音到转录和语音到翻译对齐的注释。该语料库还包括词法语法标记和单词级注释。应用自动单元发现方法,生成了伪电话。我们详细介绍了语料库是如何收集、清理和处理的,并通过展示语音到翻译对齐和无监督词发现任务的一些基线结果,说明了语料库在零资源任务中的使用。该数据集将在网上提供,旨在鼓励计算语言文档实验的可复制性和多样性。
{"title":"A small Griko-Italian speech translation corpus","authors":"Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier","doi":"10.21437/SLTU.2018-8","DOIUrl":"https://doi.org/10.21437/SLTU.2018-8","url":null,"abstract":"This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129372206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Automatic Speech Recognition for Humanitarian Applications in Somali 索马里人道主义应用的自动语音识别
Pub Date : 2018-07-23 DOI: 10.21437/SLTU.2018-5
Raghav Menon, A. Biswas, A. Saeb, John Quinn, T. Niesler
We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.
本文首次使用1.57小时的带注释语音进行声学模型训练,为资源匮乏的索马里语构建了自动语音识别系统。该系统是联合国正在努力实施的关键字识别系统的一部分,该系统支持非洲语言资源严重不足的部分地区的人道主义救济方案。我们评估了几种类型的声学模型,包括最近的神经结构。本文还考虑了使用循环神经网络(RNN)和长短期记忆神经网络(LSTMs)相结合的语言模型数据增强以及声学数据的扰动。我们发现两种类型的数据增强都有利于性能,我们最好的系统使用卷积神经网络(cnn),时延神经网络(tdnn)和双向长短期记忆(BLSTMs)的组合,实现了53.75%的单词错误率。
{"title":"Automatic Speech Recognition for Humanitarian Applications in Somali","authors":"Raghav Menon, A. Biswas, A. Saeb, John Quinn, T. Niesler","doi":"10.21437/SLTU.2018-5","DOIUrl":"https://doi.org/10.21437/SLTU.2018-5","url":null,"abstract":"We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130841081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Workshop on Spoken Language Technologies for Under-resourced Languages
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1