首页 > 最新文献

2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
A comparison-based approach to mispronunciation detection 基于比较的误发音检测方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424254
Ann Lee, James R. Glass
The task of mispronunciation detection for language learning is typically accomplished via automatic speech recognition (ASR). Unfortunately, less than 2% of the world's languages have an ASR capability, and the conventional process of creating an ASR system requires large quantities of expensive, annotated data. In this paper we report on our efforts to develop a comparison-based framework for detecting word-level mispronunciations in nonnative speech. Dynamic time warping (DTW) is carried out between a student's (non-native speaker) utterance and a teacher's (native speaker) utterance, and we focus on extracting word-level and phone-level features that describe the degree of mis-alignment in the warping path and the distance matrix. Experimental results on a Chinese University of Hong Kong (CUHK) nonnative corpus show that the proposed framework improves the relative performance on a mispronounced word detection task by nearly 50% compared to an approach that only considers DTW alignment scores.
语言学习中的发音错误检测通常通过自动语音识别(ASR)来完成。不幸的是,世界上只有不到2%的语言具有ASR功能,而创建ASR系统的传统过程需要大量昂贵的带注释的数据。在本文中,我们报告了我们开发一个基于比较的框架来检测非母语语音中的单词级错误发音的努力。在学生(非母语者)的话语和教师(母语者)的话语之间进行动态时间翘曲(DTW),我们重点提取描述翘曲路径和距离矩阵中不对齐程度的词级和电话级特征。在香港中文大学非母语语料库上的实验结果表明,与仅考虑DTW比对分数的方法相比,所提出的框架在误读词检测任务上的相对性能提高了近50%。
{"title":"A comparison-based approach to mispronunciation detection","authors":"Ann Lee, James R. Glass","doi":"10.1109/SLT.2012.6424254","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424254","url":null,"abstract":"The task of mispronunciation detection for language learning is typically accomplished via automatic speech recognition (ASR). Unfortunately, less than 2% of the world's languages have an ASR capability, and the conventional process of creating an ASR system requires large quantities of expensive, annotated data. In this paper we report on our efforts to develop a comparison-based framework for detecting word-level mispronunciations in nonnative speech. Dynamic time warping (DTW) is carried out between a student's (non-native speaker) utterance and a teacher's (native speaker) utterance, and we focus on extracting word-level and phone-level features that describe the degree of mis-alignment in the warping path and the distance matrix. Experimental results on a Chinese University of Hong Kong (CUHK) nonnative corpus show that the proposed framework improves the relative performance on a mispronounced word detection task by nearly 50% compared to an approach that only considers DTW alignment scores.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132535089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Word segmentation through cross-lingual word-to-phoneme alignment 通过跨语言的词-音素对齐进行分词
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424202
Felix Stahlberg, Tim Schlippe, S. Vogel, Tanja Schultz
We present our new alignment model Model 3P for cross-lingual word-to-phoneme alignment, and show that unsupervised learning of word segmentation is more accurate when information of another language is used. Word segmentation with cross-lingual information is highly relevant to bootstrap pronunciation dictionaries from audio data for Automatic Speech Recognition, bypass the written form in Speech-to-Speech Translation or build the vocabulary of an unseen language, particularly in the context of under-resourced languages. Using Model 3P for the alignment between English words and Spanish phonemes outperforms a state-of-the-art monolingual word segmentation approach [1] on the BTEC corpus [2] by up to 42% absolute in F-Score on the phoneme level and a GIZA++ alignment based on IBM Model 3 by up to 17%.
我们提出了一种新的跨语言词-音素对齐模型model 3P,并表明当使用其他语言的信息时,无监督的分词学习更准确。使用跨语言信息的分词与从音频数据中引导语音字典进行自动语音识别,绕过语音到语音翻译中的书面形式或构建未知语言的词汇表高度相关,特别是在资源不足的语言背景下。使用模型3P对英语单词和西班牙语音素进行对齐,在音素水平上的F-Score绝对分数比BTEC语料库上最先进的单语分词方法[1]高出42%,基于IBM模型3的GIZA++对齐高出17%。
{"title":"Word segmentation through cross-lingual word-to-phoneme alignment","authors":"Felix Stahlberg, Tim Schlippe, S. Vogel, Tanja Schultz","doi":"10.1109/SLT.2012.6424202","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424202","url":null,"abstract":"We present our new alignment model Model 3P for cross-lingual word-to-phoneme alignment, and show that unsupervised learning of word segmentation is more accurate when information of another language is used. Word segmentation with cross-lingual information is highly relevant to bootstrap pronunciation dictionaries from audio data for Automatic Speech Recognition, bypass the written form in Speech-to-Speech Translation or build the vocabulary of an unseen language, particularly in the context of under-resourced languages. Using Model 3P for the alignment between English words and Spanish phonemes outperforms a state-of-the-art monolingual word segmentation approach [1] on the BTEC corpus [2] by up to 42% absolute in F-Score on the phoneme level and a GIZA++ alignment based on IBM Model 3 by up to 17%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
A reranking approach for recognition and classification of speech input in conversational dialogue systems 会话式对话系统中语音输入识别与分类的重排序方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424196
Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Maarten Van Segbroeck, Kenji Sagae, P. Georgiou, D. Traum, Shrikanth S. Narayanan
We address the challenge of interpreting spoken input in a conversational dialogue system with an approach that aims to exploit the close relationship between the tasks of speech recognition and language understanding through joint modeling of these two tasks. Instead of using a standard pipeline approach where the output of a speech recognizer is the input of a language understanding module, we merge multiple speech recognition and utterance classification hypotheses into one list to be processed by a joint reranking model. We obtain substantially improved performance in language understanding in experiments with thousands of user utterances collected from a deployed spoken dialogue system.
我们解决了在会话对话系统中解释语音输入的挑战,该方法旨在通过对这两个任务的联合建模来利用语音识别和语言理解任务之间的密切关系。我们没有使用标准的管道方法,其中语音识别器的输出是语言理解模块的输入,而是将多个语音识别和话语分类假设合并到一个列表中,由联合重新排序模型进行处理。我们从一个已部署的口语对话系统中收集了数千个用户的话语,在实验中获得了显著提高的语言理解性能。
{"title":"A reranking approach for recognition and classification of speech input in conversational dialogue systems","authors":"Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Maarten Van Segbroeck, Kenji Sagae, P. Georgiou, D. Traum, Shrikanth S. Narayanan","doi":"10.1109/SLT.2012.6424196","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424196","url":null,"abstract":"We address the challenge of interpreting spoken input in a conversational dialogue system with an approach that aims to exploit the close relationship between the tasks of speech recognition and language understanding through joint modeling of these two tasks. Instead of using a standard pipeline approach where the output of a speech recognizer is the input of a language understanding module, we merge multiple speech recognition and utterance classification hypotheses into one list to be processed by a joint reranking model. We obtain substantially improved performance in language understanding in experiments with thousands of user utterances collected from a deployed spoken dialogue system.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125529361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Speaker diarization and linking of large corpora 大型语料库的说话人划分与连读
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424236
Marc Ferras, Herve Boudard
Performing speaker diarization of a collection of recordings, where speakers are uniquely identified across the database, is a challenging task. In this context, inter-session variability compensation and reasonable computation times are essential to be addressed. In this paper we propose a two-stage system composed of speaker diarization and speaker linking modules that are able to perform data set wide speaker diarization and that handle both large volumes of data and inter-session variability compensation. The speaker linking system agglomeratively clusters speaker factor posterior distributions, obtained within the Joint Factor Analysis framework, that model the speaker clusters output by a standard speaker diarization system. Therefore, the technique inherently compensates the channel variability effects from recording to recording within the database. A threshold is used to obtain meaningful speaker clusters by cutting the dendrogram obtained by the agglomerative clustering. We show how the Hotteling t-square statistic is an interesting distance measure for this task and input data, obtaining the best results and stability. The system is evaluated using three subsets of the AMI corpus involving different speaker and channel variabilities. We use the within-recording and across-recording diarization error rates (DER), cluster purity and cluster coverage to measure the performance of the proposed system. Across-recording DER as low as within-recording DER are obtained for some system setups.
对一组录音进行说话人分类是一项具有挑战性的任务,因为说话人在数据库中是唯一标识的。在这种情况下,会话间可变性补偿和合理的计算时间是必须解决的问题。在本文中,我们提出了一个由说话人拨号和说话人连接模块组成的两级系统,该系统能够执行数据集范围内的说话人拨号,并处理大量数据和会话间可变性补偿。扬声器连接系统将在联合因子分析框架内获得的扬声器因子后验分布聚集在一起,该框架对标准扬声器拨号系统的扬声器集群输出进行建模。因此,该技术固有地补偿了数据库中从记录到记录的通道可变性影响。利用阈值对聚类得到的树突图进行裁剪,得到有意义的说话人聚类。我们展示了Hotteling t平方统计量是一个有趣的距离度量,对于这个任务和输入数据,获得了最好的结果和稳定性。该系统使用AMI语料库的三个子集进行评估,这些子集涉及不同的说话人和信道变量。我们使用记录内和跨记录的diarization错误率(DER)、聚类纯度和聚类覆盖率来衡量所提出系统的性能。对于某些系统设置,可以获得与记录内DER一样低的跨记录DER。
{"title":"Speaker diarization and linking of large corpora","authors":"Marc Ferras, Herve Boudard","doi":"10.1109/SLT.2012.6424236","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424236","url":null,"abstract":"Performing speaker diarization of a collection of recordings, where speakers are uniquely identified across the database, is a challenging task. In this context, inter-session variability compensation and reasonable computation times are essential to be addressed. In this paper we propose a two-stage system composed of speaker diarization and speaker linking modules that are able to perform data set wide speaker diarization and that handle both large volumes of data and inter-session variability compensation. The speaker linking system agglomeratively clusters speaker factor posterior distributions, obtained within the Joint Factor Analysis framework, that model the speaker clusters output by a standard speaker diarization system. Therefore, the technique inherently compensates the channel variability effects from recording to recording within the database. A threshold is used to obtain meaningful speaker clusters by cutting the dendrogram obtained by the agglomerative clustering. We show how the Hotteling t-square statistic is an interesting distance measure for this task and input data, obtaining the best results and stability. The system is evaluated using three subsets of the AMI corpus involving different speaker and channel variabilities. We use the within-recording and across-recording diarization error rates (DER), cluster purity and cluster coverage to measure the performance of the proposed system. Across-recording DER as low as within-recording DER are obtained for some system setups.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128643104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Incorporating syllable duration into line-detection-based spoken term detection 将音节长度纳入基于行检测的口语术语检测
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424223
Teppei Ohno, T. Akiba
A conventional method for spoken term detection (STD) is to apply approximate string matching to subword sequences in a spoken document obtained by speech recognition. An STD method that considers string matching as line detection in a syllable distance plane has been proposed. While this has demonstrated fast ordered-by-distance detections, it has still suffered from the insertion and deletion errors introduced by the speech recognition. In this work, we aim to improve detection performance by employing syllable-duration information. The proposed method enables robust detection by introducing a distance plane that uses frames as units instead of using syllables as units. Our experimental evaluation showed that the incorporation of syllable-duration achieved higher detection performance in high-recall regions.
传统的语音词检测方法是对语音识别得到的语音文档中的子词序列进行近似字符串匹配。提出了一种在音节距离平面上将字符串匹配作为行检测的STD方法。虽然这证明了快速的按距离排序检测,但它仍然受到语音识别引入的插入和删除错误的影响。在这项工作中,我们的目标是通过使用音节长度信息来提高检测性能。提出的方法通过引入以帧为单位而不是以音节为单位的距离平面来实现鲁棒检测。实验结果表明,结合音节时长的方法在高召回区域具有较高的检测性能。
{"title":"Incorporating syllable duration into line-detection-based spoken term detection","authors":"Teppei Ohno, T. Akiba","doi":"10.1109/SLT.2012.6424223","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424223","url":null,"abstract":"A conventional method for spoken term detection (STD) is to apply approximate string matching to subword sequences in a spoken document obtained by speech recognition. An STD method that considers string matching as line detection in a syllable distance plane has been proposed. While this has demonstrated fast ordered-by-distance detections, it has still suffered from the insertion and deletion errors introduced by the speech recognition. In this work, we aim to improve detection performance by employing syllable-duration information. The proposed method enables robust detection by introducing a distance plane that uses frames as units instead of using syllables as units. Our experimental evaluation showed that the incorporation of syllable-duration achieved higher detection performance in high-recall regions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126119673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A grapheme-based method for automatic alignment of speech and text data 基于字素的语音和文本数据自动对齐方法
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424237
Adriana Stan, P. Bell, Simon King
This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.
本文介绍了一种自动对齐语音数据与不同步,不完美的转录本的方法,在没有初始声学模型可用的领域。使用基于字素的声学模型、单词跳过网络和正字法语音转录本,我们能够收获55%的语音,产生的转录文本具有93%的话语级精度和99%的单词精度。这项工作是基于语音和文本之间高度对应的假设,并且不需要所有语音的完整转录。该方法是语言独立的,唯一需要的先验知识和资源是语音和文本文本,以及一些次要的用户干预。
{"title":"A grapheme-based method for automatic alignment of speech and text data","authors":"Adriana Stan, P. Bell, Simon King","doi":"10.1109/SLT.2012.6424237","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424237","url":null,"abstract":"This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133588641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Realistic answer verification: An analysis of user errors in a sentence-repetition task 现实答案验证:对句子重复任务中的用户错误进行分析
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424163
S. Shirali-Shahreza, Gerald Penn
Speech authentication protocols should have a challenge/response feature to be protected against replay attacks. As a result, they need to verify whether the user responded to an interactive prompt. However, it is usually assumed that the user will provide their answer perfectly. In this paper, we report on an ecologically valid user study that we conducted to test this assumption. Our results show that 40% of user answers are imperfect, even in a task as simple as sentence repetition. Error analysis reveals that 60% of the imperfect answers contain small errors that should be deemed acceptable, which increases the total acceptance rate of this task to 84%. We also tested a forced alignment algorithm as a means of verifying answers automatically.
语音认证协议应该具有挑战/响应特性,以防止重放攻击。因此,他们需要验证用户是否响应了交互式提示。然而,通常假设用户会完美地提供他们的答案。在本文中,我们报告了一项生态有效的用户研究,我们进行了测试这一假设。我们的研究结果表明,40%的用户回答是不完美的,即使是在重复句子这样简单的任务中。错误分析显示,60%的不完美答案包含可以接受的小错误,这将该任务的总接受率提高到84%。我们还测试了一种强制对齐算法,作为自动验证答案的手段。
{"title":"Realistic answer verification: An analysis of user errors in a sentence-repetition task","authors":"S. Shirali-Shahreza, Gerald Penn","doi":"10.1109/SLT.2012.6424163","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424163","url":null,"abstract":"Speech authentication protocols should have a challenge/response feature to be protected against replay attacks. As a result, they need to verify whether the user responded to an interactive prompt. However, it is usually assumed that the user will provide their answer perfectly. In this paper, we report on an ecologically valid user study that we conducted to test this assumption. Our results show that 40% of user answers are imperfect, even in a task as simple as sentence repetition. Error analysis reveals that 60% of the imperfect answers contain small errors that should be deemed acceptable, which increases the total acceptance rate of this task to 84%. We also tested a forced alignment algorithm as a means of verifying answers automatically.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130490456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using rhythmic features for Japanese spoken term detection 利用节奏特征进行日语口语词汇检测
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424217
Naoyuki Kanda, Ryu Takeda, Y. Obuchi
A new rescoring method for spoken term detection (STD) is proposed. Phoneme-based close-matching techniques have been used because of their ability to detect out-of-vocabulary (OOV) queries. To improve the accuracy of phoneme-based techniques, rescoring techniques have been used to accurately re-rank the results from phoneme-based close-matching; however, conventional rescoring techniques based on an utterance verification model still produce many false detection results. To further improve the accuracy, in this study, several features representing the “naturalness” (or “abnormality”) of duration of phonemes/syllables in detected candidates of a keyword are proposed. These features are incorporated into a conventional rescoring technique using logistic regression. Experimental results with a 604-hour Japanese speech corpus indicated that combining the rhythmic features achieved a further relative error reduction of 8.9% compared to a conventional rescoring technique.
提出了一种新的语音词检测(STD)评分方法。基于音素的紧密匹配技术之所以被使用,是因为它们能够检测出词汇外(OOV)查询。为了提高基于音素的匹配技术的准确性,采用评分技术对基于音素的紧密匹配结果进行准确的重新排序;然而,传统的基于话语验证模型的评分技术仍然会产生许多错误的检测结果。为了进一步提高准确率,本研究提出了几个表征关键词候选词中音素/音节持续时间“自然”(或“异常”)的特征。这些特征被整合到使用逻辑回归的传统评分技术中。基于604小时日语语音语料库的实验结果表明,与传统的评分技术相比,结合节奏特征的相对误差进一步降低了8.9%。
{"title":"Using rhythmic features for Japanese spoken term detection","authors":"Naoyuki Kanda, Ryu Takeda, Y. Obuchi","doi":"10.1109/SLT.2012.6424217","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424217","url":null,"abstract":"A new rescoring method for spoken term detection (STD) is proposed. Phoneme-based close-matching techniques have been used because of their ability to detect out-of-vocabulary (OOV) queries. To improve the accuracy of phoneme-based techniques, rescoring techniques have been used to accurately re-rank the results from phoneme-based close-matching; however, conventional rescoring techniques based on an utterance verification model still produce many false detection results. To further improve the accuracy, in this study, several features representing the “naturalness” (or “abnormality”) of duration of phonemes/syllables in detected candidates of a keyword are proposed. These features are incorporated into a conventional rescoring technique using logistic regression. Experimental results with a 604-hour Japanese speech corpus indicated that combining the rhythmic features achieved a further relative error reduction of 8.9% compared to a conventional rescoring technique.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117036998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Affective evaluation of a mobile multimodal dialogue system using brain signals 基于脑信号的移动多模态对话系统的情感评价
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424195
M. Perakakis, A. Potamianos
We propose the use of affective metrics such as excitement, frustration and engagement for the evaluation of multimodal dialogue systems. The affective metrics are elicited from the ElectroEncephaloGraphy (EEG) signals using the Emotiv EPOC neuroheadset device. The affective metrics are used in conjunction with traditional evaluation metrics (turn duration, input modality) to investigate the effect of speech recognition errors and modality usage patterns in a multimodal (touch and speech) dialogue form-filling application for the iPhone mobile device. Results show that: (1) engagement is higher for touch input, while excitement and frustration is higher for speech input, and (2) speech recognition errors and associated repairs correspond to specific dynamic patters of excitement and frustration. Use of such physiological channels and their elaborated interpretation is a challenging but also a potentially rewarding direction towards emotional and cognitive assessment of multimodal interaction design.
我们建议使用情感指标,如兴奋、沮丧和参与来评估多模态对话系统。使用Emotiv EPOC神经耳机设备从脑电图(EEG)信号中提取情感指标。情感度量标准与传统的评估度量标准(回合持续时间、输入情态)结合使用,以研究iPhone移动设备的多情态(触摸和语音)对话表单填充应用程序中语音识别错误和情态使用模式的影响。结果表明:(1)触摸输入的参与程度更高,而语音输入的兴奋和沮丧程度更高;(2)语音识别错误和相关修复对应于兴奋和沮丧的特定动态模式。使用这种生理通道及其详尽的解释是一种挑战,但也是对多模态交互设计的情感和认知评估的潜在回报。
{"title":"Affective evaluation of a mobile multimodal dialogue system using brain signals","authors":"M. Perakakis, A. Potamianos","doi":"10.1109/SLT.2012.6424195","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424195","url":null,"abstract":"We propose the use of affective metrics such as excitement, frustration and engagement for the evaluation of multimodal dialogue systems. The affective metrics are elicited from the ElectroEncephaloGraphy (EEG) signals using the Emotiv EPOC neuroheadset device. The affective metrics are used in conjunction with traditional evaluation metrics (turn duration, input modality) to investigate the effect of speech recognition errors and modality usage patterns in a multimodal (touch and speech) dialogue form-filling application for the iPhone mobile device. Results show that: (1) engagement is higher for touch input, while excitement and frustration is higher for speech input, and (2) speech recognition errors and associated repairs correspond to specific dynamic patters of excitement and frustration. Use of such physiological channels and their elaborated interpretation is a challenging but also a potentially rewarding direction towards emotional and cognitive assessment of multimodal interaction design.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128196296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Performance improvement of automatic pronunciation assessment in a noisy classroom 嘈杂教室中语音自动评估的性能改进
Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424262
Yi Luan, Masayuki Suzuki, Yutaka Yamauchi, N. Minematsu, Shuhei Kato, K. Hirose
In recent years Computer-Assisted Language Learning (CALL) systems have been widely used in foreign language education. Some systems use automatic speech recognition (ASR) technologies to detect pronunciation errors and estimate the proficiency level of individual students. When speech recording is done in a CALL classroom, however, utterances of a student are always recorded with those of the others in the same class. The latter utterances are just background noise, and the performance of automatic pronunciation assessment is degraded especially when a student is surrounded with very active students. To solve this problem, we apply a noise reduction technique, Stereo-based Piecewise Linear Compensation for Environments (SPLICE), and the compensated feature sequences are input to a Goodness Of Pronunciation (GOP) assessment system. Results show that SPLICE-based noise reduction works very well as a means to improve the assessment performance in a noisy classroom.
近年来,计算机辅助语言学习(CALL)系统在外语教育中得到了广泛的应用。一些系统使用自动语音识别(ASR)技术来检测发音错误并估计个别学生的熟练程度。然而,当在CALL教室进行录音时,一个学生的发言总是与同一班级的其他学生的发言一起被记录下来。后一种语音只是背景噪音,当学生周围有非常活跃的学生时,语音自动评估的性能会下降。为了解决这个问题,我们采用了一种降噪技术——基于立体的环境分段线性补偿(SPLICE),并将补偿后的特征序列输入到语音优度(GOP)评估系统中。结果表明,基于splice的降噪方法可以很好地改善嘈杂教室的评估效果。
{"title":"Performance improvement of automatic pronunciation assessment in a noisy classroom","authors":"Yi Luan, Masayuki Suzuki, Yutaka Yamauchi, N. Minematsu, Shuhei Kato, K. Hirose","doi":"10.1109/SLT.2012.6424262","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424262","url":null,"abstract":"In recent years Computer-Assisted Language Learning (CALL) systems have been widely used in foreign language education. Some systems use automatic speech recognition (ASR) technologies to detect pronunciation errors and estimate the proficiency level of individual students. When speech recording is done in a CALL classroom, however, utterances of a student are always recorded with those of the others in the same class. The latter utterances are just background noise, and the performance of automatic pronunciation assessment is degraded especially when a student is surrounded with very active students. To solve this problem, we apply a noise reduction technique, Stereo-based Piecewise Linear Compensation for Environments (SPLICE), and the compensated feature sequences are input to a Goodness Of Pronunciation (GOP) assessment system. Results show that SPLICE-based noise reduction works very well as a means to improve the assessment performance in a noisy classroom.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133319366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2012 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1