首页 > 最新文献

2011 IEEE Workshop on Automatic Speech Recognition & Understanding最新文献

英文 中文
Bootstrapping a spoken language identification system using unsupervised integrated sensing and processing decision trees 基于无监督集成传感和处理决策树的自引导语音识别系统
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163955
Shuai Huang, D. Karakos, Glen A. Coppersmith, Kenneth Ward Church, S. Siniscalchi
In many inference and learning tasks, collecting large amounts of labeled training data is time consuming and expensive, and oftentimes impractical. Thus, being able to efficiently use small amounts of labeled data with an abundance of unlabeled data—the topic of semi-supervised learning (SSL) [1]—has garnered much attention. In this paper, we look at the problem of choosing these small amounts of labeled data, the first step in a bootstrapping paradigm. Contrary to traditional active learning where an initial trained model is employed to select the unlabeled data points which would be most informative if labeled, our selection has to be done in an unsupervised way, as we do not even have labeled data to train an initial model. We propose using unsupervised clustering algorithms, in particular integrated sensing and processing decision trees (ISPDTs) [2], to select small amounts of data to label and subsequently use in SSL (e.g. transductive SVMs). In a language identification task on the CallFriend1 and 2003 NIST Language Recognition Evaluation corpora [3], we demonstrate that the proposed method results in significantly improved performance over random selection of equivalently sized training data.
在许多推理和学习任务中,收集大量标记的训练数据既耗时又昂贵,而且往往不切实际。因此,能够有效地使用少量标记数据和大量未标记数据——半监督学习(SSL)的主题[1]——已经引起了人们的广泛关注。在本文中,我们着眼于选择这些少量标记数据的问题,这是自举范式的第一步。与传统的主动学习相反,在传统的主动学习中,初始训练模型被用来选择未标记的数据点,如果标记的话,这些数据点将是最有信息的,我们的选择必须以一种无监督的方式完成,因为我们甚至没有标记数据来训练初始模型。我们建议使用无监督聚类算法,特别是集成传感和处理决策树(ispdt)[2],选择少量数据进行标记并随后在SSL中使用(例如,换能型支持向量机)。在CallFriend1和2003 NIST语言识别评估语料库[3]上的语言识别任务中,我们证明了所提出的方法比随机选择同等大小的训练数据显著提高了性能。
{"title":"Bootstrapping a spoken language identification system using unsupervised integrated sensing and processing decision trees","authors":"Shuai Huang, D. Karakos, Glen A. Coppersmith, Kenneth Ward Church, S. Siniscalchi","doi":"10.1109/ASRU.2011.6163955","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163955","url":null,"abstract":"In many inference and learning tasks, collecting large amounts of labeled training data is time consuming and expensive, and oftentimes impractical. Thus, being able to efficiently use small amounts of labeled data with an abundance of unlabeled data—the topic of semi-supervised learning (SSL) [1]—has garnered much attention. In this paper, we look at the problem of choosing these small amounts of labeled data, the first step in a bootstrapping paradigm. Contrary to traditional active learning where an initial trained model is employed to select the unlabeled data points which would be most informative if labeled, our selection has to be done in an unsupervised way, as we do not even have labeled data to train an initial model. We propose using unsupervised clustering algorithms, in particular integrated sensing and processing decision trees (ISPDTs) [2], to select small amounts of data to label and subsequently use in SSL (e.g. transductive SVMs). In a language identification task on the CallFriend1 and 2003 NIST Language Recognition Evaluation corpora [3], we demonstrate that the proposed method results in significantly improved performance over random selection of equivalently sized training data.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130928690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework 动态贝叶斯网络框架中使用发音手势的鲁棒语音识别
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163918
V. Mitra, Hosung Nam, C. Espy-Wilson
Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.
发音音韵学将语音建模为收缩事件(如抬起舌尖,收缩嘴唇等)的时空组合,称为发音手势。这些手势与声道上不同的器官(嘴唇、舌尖、舌体、膜和声门)有关。在本文中,我们提出了一种基于动态贝叶斯网络的语音识别体系结构,该体系结构将发音手势建模为隐藏变量,并将其用于语音识别。利用所提出的架构,我们进行了:(a)在Aurora-2的噪声数据上进行的单词识别实验和(b)在威斯康星大学x射线微束数据库上进行的手机识别实验。我们的研究结果表明,与仅使用声学信息的系统相比,使用手势信息有助于提高识别系统的性能。
{"title":"Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework","authors":"V. Mitra, Hosung Nam, C. Espy-Wilson","doi":"10.1109/ASRU.2011.6163918","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163918","url":null,"abstract":"Articulatory Phonology models speech as spatio-temporal constellation of constricting events (e.g. raising tongue tip, narrowing lips etc.), known as articulatory gestures. These gestures are associated with distinct organs (lips, tongue tip, tongue body, velum and glottis) along the vocal tract. In this paper we present a Dynamic Bayesian Network based speech recognition architecture that models the articulatory gestures as hidden variables and uses them for speech recognition. Using the proposed architecture we performed: (a) word recognition experiments on the noisy data of Aurora-2 and (b) phone recognition experiments on the University of Wisconsin X-ray microbeam database. Our results indicate that the use of gestural information helps to improve the performance of the recognition system compared to the system using acoustic information only.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116002496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation Albayzin 2010语言识别评价的多站点异构系统融合
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163961
Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, M. Díez, Germán Bordel, D. M. González, Jesús Antonio Villalba López, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO, A. Abad, Oscar Koller, I. Trancoso, Paula Lopez-Otero, Laura Docío Fernández, C. García-Mateo, R. Saeidi, Mehdi Soufifar, T. Kinnunen, T. Svendsen, P. Fränti
Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.
最好的语言识别性能通常是通过融合多个异构系统的分数来获得的。不管采用哪种融合方法,假设不同的系统可能提供互补的信息,要么是因为它们是在不同的数据集上开发的,要么是因为它们使用不同的特征或不同的建模方法。大多数作者将融合作为基于现有系统集改进性能的最终资源。虽然考虑到更大的系统集时,相对性能收益会降低,但通常通过融合所有可用系统来获得最佳性能,这可能导致较高的计算成本。在本文中,我们的目标是发现哪些技术通过融合结合得最好,并分析可能解释这种良好性能的因素(数据,特征,建模方法等)。本文介绍并讨论了由参与网站和Albayzin 2010语言识别评估组织团队提供的一些系统的结果。我们希望这项工作的结论可以帮助研究小组在开发语言识别技术方面做出更好的决定。
{"title":"Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation","authors":"Luis Javier Rodriguez-Fuentes, M. Peñagarikano, A. Varona, M. Díez, Germán Bordel, D. M. González, Jesús Antonio Villalba López, A. Miguel, A. Ortega, EDUARDO LLEIDA SOLANO, A. Abad, Oscar Koller, I. Trancoso, Paula Lopez-Otero, Laura Docío Fernández, C. García-Mateo, R. Saeidi, Mehdi Soufifar, T. Kinnunen, T. Svendsen, P. Fränti","doi":"10.1109/ASRU.2011.6163961","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163961","url":null,"abstract":"Best language recognition performance is commonly obtained by fusing the scores of several heterogeneous systems. Regardless the fusion approach, it is assumed that different systems may contribute complementary information, either because they are developed on different datasets, or because they use different features or different modeling approaches. Most authors apply fusion as a final resource for improving performance based on an existing set of systems. Though relative performance gains decrease as larger sets of systems are considered, best performance is usually attained by fusing all the available systems, which may lead to high computational costs. In this paper, we aim to discover which technologies combine the best through fusion and to analyse the factors (data, features, modeling methodologies, etc.) that may explain such a good performance. Results are presented and discussed for a number of systems provided by the participating sites and the organizing team of the Albayzin 2010 Language Recognition Evaluation. We hope the conclusions of this work help research groups make better decisions in developing language recognition technology.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126613706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Detection of precisely transcribed parts from inexact transcribed corpus 从非精确转录语料中检测精确转录部分
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163989
Kengo Ohta, Masatoshi Tsuchiya, S. Nakagawa
Although large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing, they are usually limited due to their construction cost especially in transcribing precisely. On the other hand, inexact transcribed corpora like shorthand notes, meeting records and closed captions are widely available. Unfortunately, it is difficult to use them directly as speech corpora for learning acoustic models, because they contain two kinds of text, precisely transcribed parts and edited parts. In order to resolve this problem, this paper proposes an automatic detection method of precisely transcribed parts from inexact transcribed corpora. Our method consists of two steps: the first step is an automatic alignment between the inexact transcription and its corresponding utterance, and the second step is a support vector machine based detector of precisely transcribed parts using several features obtained by the first step. Experiments using the Japanese National Diet Record shows that automatic detection of precise parts is effective for lightly supervised speaker adaptation, and shows that it achieves reasonable performance to reduce the converting cost from inexact transcribed corpora into precisely transcribed ones.
大规模的自发语音语料库是口语语言处理的重要资源,但由于其构建成本的限制,尤其是转录精度的限制。另一方面,抄写不准确的语料库,如速记笔记、会议记录和隐藏式字幕,也随处可见。不幸的是,它们很难直接用作语音语料库来学习声学模型,因为它们包含两种文本,一种是精确转录的部分,另一种是编辑过的部分。为了解决这一问题,本文提出了一种从非精确转录语料库中自动检测精确转录部分的方法。我们的方法包括两步:第一步是在不精确的转录和相应的话语之间自动对齐,第二步是基于支持向量机的精确转录部分检测器,利用第一步获得的几个特征。利用日本国会录音进行的实验表明,精确部位的自动检测对于轻监督的说话人自适应是有效的,并且达到了合理的性能,降低了从不精确转录的语料库到精确转录的语料库的转换成本。
{"title":"Detection of precisely transcribed parts from inexact transcribed corpus","authors":"Kengo Ohta, Masatoshi Tsuchiya, S. Nakagawa","doi":"10.1109/ASRU.2011.6163989","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163989","url":null,"abstract":"Although large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing, they are usually limited due to their construction cost especially in transcribing precisely. On the other hand, inexact transcribed corpora like shorthand notes, meeting records and closed captions are widely available. Unfortunately, it is difficult to use them directly as speech corpora for learning acoustic models, because they contain two kinds of text, precisely transcribed parts and edited parts. In order to resolve this problem, this paper proposes an automatic detection method of precisely transcribed parts from inexact transcribed corpora. Our method consists of two steps: the first step is an automatic alignment between the inexact transcription and its corresponding utterance, and the second step is a support vector machine based detector of precisely transcribed parts using several features obtained by the first step. Experiments using the Japanese National Diet Record shows that automatic detection of precise parts is effective for lightly supervised speaker adaptation, and shows that it achieves reasonable performance to reduce the converting cost from inexact transcribed corpora into precisely transcribed ones.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128848634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Evolutionary discriminative speaker adaptation 进化辨别说话人适应
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163924
S. Selouani
This paper presents a new evolutionary-based approach that aims at investigating more solutions while simplifying the speaker adaptation process. In this approach, a single global transformation set of parameters is optimized by genetic algorithms using a discriminative objective function. The goal is to achieve accurate speaker adaptation whatever the amount of available adaptive data. Experiments using the ARPA-RM database demonstrate the effectiveness of the proposed method.
本文提出了一种新的基于进化的方法,旨在探索更多的解决方案,同时简化说话人的适应过程。在该方法中,使用判别目标函数对单个全局参数转换集进行遗传算法优化。目标是无论可用的自适应数据有多少,都能实现准确的说话人自适应。在ARPA-RM数据库上的实验验证了该方法的有效性。
{"title":"Evolutionary discriminative speaker adaptation","authors":"S. Selouani","doi":"10.1109/ASRU.2011.6163924","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163924","url":null,"abstract":"This paper presents a new evolutionary-based approach that aims at investigating more solutions while simplifying the speaker adaptation process. In this approach, a single global transformation set of parameters is optimized by genetic algorithms using a discriminative objective function. The goal is to achieve accurate speaker adaptation whatever the amount of available adaptive data. Experiments using the ARPA-RM database demonstrate the effectiveness of the proposed method.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128555817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast speaker diarization using a high-level scripting language 使用高级脚本语言的快速说话人拨号
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163887
Ekaterina Gonina, G. Friedland, Henry Cook, K. Keutzer
Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.
目前大多数说话人分类系统使用高斯混合模型(GMMs)的聚集聚类来确定录音中的“谁在什么时候说话”。虽然在精度上是最先进的,但这种方法在计算上是昂贵的,主要是由于GMM训练,因此限制了当前方法的性能大致实时。当前数据集的增加需要处理数百小时的数据,因此非常需要更有效的处理方法。随着图形处理单元(gpu)等高度并行的多核和多核处理器的出现,人们可以利用训练计算中的并行性重新实现GMM训练,以获得比实时更快的性能。然而,开发和维护复杂的底层GPU代码是困难的,并且需要对并行处理器的硬件架构有深入的了解。此外,这种低级实现在其他应用程序中不容易重用,也不能移植到其他平台,从而限制了程序员的工作效率。在本文中,我们提出了一个用不到50行Python捕获的扬声器diarization系统,通过使用专门化框架在NVIDIA GPU上自动映射和执行计算密集型GMM训练,实现了比实时性能快50 - 250倍的性能,而精度没有明显损失。
{"title":"Fast speaker diarization using a high-level scripting language","authors":"Ekaterina Gonina, G. Friedland, Henry Cook, K. Keutzer","doi":"10.1109/ASRU.2011.6163887","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163887","url":null,"abstract":"Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computationally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this paper we present a speaker diarization system captured in under 50 lines of Python that achieves 50–250× faster than real-time performance by using a specialization framework to automatically map and execute computationally intensive GMM training on an NVIDIA GPU, without significant loss in accuracy.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114890158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Leveraging large amounts of loosely transcribed corporate videos for acoustic model training 利用大量松散转录的公司视频进行声学模型培训
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163912
M. Paulik, P. Panchapagesan
Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.
在过去的十年中,轻监督声学模型(AM)训练引起了极大的兴趣。它只依赖少量准确转录的语音和大量不完整(松散)转录的语音,从而有望显著节省成本。后者通常可以从现有来源获得,而无需额外费用。我们认为公司视频就是这样一个来源。在回顾了轻度监督的AM培训的艺术状态后,我们描述了我们在利用977小时松散转录的企业视频进行AM培训的努力。我们报告说,在我们的基线上,单词错误率大幅降低了19.4%。我们还报告了一个简单而有效的方案的初步结果,该方案用于识别对训练过程更重要的轻监督训练标签子集。
{"title":"Leveraging large amounts of loosely transcribed corporate videos for acoustic model training","authors":"M. Paulik, P. Panchapagesan","doi":"10.1109/ASRU.2011.6163912","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163912","url":null,"abstract":"Lightly supervised acoustic model (AM) training has seen a tremendous amount of interest over the past decade. It promises significant cost-savings by relying on only small amounts of accurately transcribed speech and large amounts of imperfectly (loosely) transcribed speech. The latter can often times be acquired from existing sources, without additional cost. We identify corporate videos as one such source. After reviewing the state of the art in lightly supervised AM training, we describe our efforts on exploiting 977 hours of loosely transcribed corporate videos for AM training. We report strong reductions in word error rate of up to 19.4% over our baseline. We also report initial results for a simple, yet effective scheme to identify a subset of lightly supervised training labels that are more important to the training process.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124441121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Decision of response timing for incremental speech recognition with reinforcement learning 基于强化学习的增量语音识别响应时间决策
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163976
Di Lu, T. Nishimoto, N. Minematsu
In spoken dialog systems, it is important to reduce the delay in generating a response to a user's utterance. We investigate the use of incremental recognition results which can be obtained from a speech recognition engine before the input utterance ends. To enable the system to respond correctly before the end of the utterance, it is desired to utilize the incremental results effectively, although they are not reliable enough. We formulate this problem as a decision making task, in which the system makes choices iteratively either to answer based on previous observations, or to wait until the next observation. The reinforcement learning can be applied to the problem. As the results of experiments, the users highly evaluate the proposed method which estimate completion time of a user's utterance by using the results of speech recognition based on mora units.
在口语对话系统中,减少对用户话语产生响应的延迟是很重要的。我们研究了增量识别结果的使用,增量识别结果可以在输入话语结束之前从语音识别引擎获得。为了使系统能够在话语结束之前做出正确的响应,需要有效地利用增量结果,尽管它们不够可靠。我们将这个问题表述为一个决策任务,在这个任务中,系统迭代地做出选择,要么根据之前的观察结果做出回答,要么等到下一次观察结果。强化学习可以应用于这个问题。实验结果表明,基于道德单元的语音识别结果估计用户话语完成时间的方法得到了用户的高度评价。
{"title":"Decision of response timing for incremental speech recognition with reinforcement learning","authors":"Di Lu, T. Nishimoto, N. Minematsu","doi":"10.1109/ASRU.2011.6163976","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163976","url":null,"abstract":"In spoken dialog systems, it is important to reduce the delay in generating a response to a user's utterance. We investigate the use of incremental recognition results which can be obtained from a speech recognition engine before the input utterance ends. To enable the system to respond correctly before the end of the utterance, it is desired to utilize the incremental results effectively, although they are not reliable enough. We formulate this problem as a decision making task, in which the system makes choices iteratively either to answer based on previous observations, or to wait until the next observation. The reinforcement learning can be applied to the problem. As the results of experiments, the users highly evaluate the proposed method which estimate completion time of a user's utterance by using the results of speech recognition based on mora units.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130410230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Speaker adaptation with an Exponential Transform 基于指数变换的说话人自适应
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163923
Daniel Povey, G. Zweig, A. Acero
In this paper we describe a linear transform that we call an Exponential Transform (ET), which integrates aspects of CMLLR, VTLN and STC/MLLT into a single transform with jointly trained components. Its main advantage is that a very small number of speaker-specific parameters is required, thus enabling effective adaptation with small amounts of speaker specific data. Our formulation shares some characteristics of Vocal Tract Length Normalization (VTLN), and is intended as a substitute for VTLN. The key part of the transform is controlled by a single speaker-specific parameter that is analogous to a VTLN warp factor. The transform has non-speaker-specific parameters that are learned from data, and we find that the axis along which male and female speakers differ is automatically learned. The exponential transform has no explicit notion of frequency warping, which makes it applicable in principle to non-standard features such as those derived from neural nets, or when the key axes may not be male-female. Based on our experiments with standard MFCC features, it appears to perform better than conventional VTLN.
在本文中,我们描述了一个线性变换,我们称之为指数变换(ET),它将CMLLR, VTLN和STC/MLLT的各个方面集成到一个具有联合训练分量的单一变换中。它的主要优点是需要非常少的特定于说话人的参数,因此能够有效地适应少量特定于说话人的数据。我们的配方具有声道长度归一化(VTLN)的一些特征,并打算作为VTLN的替代品。转换的关键部分由单个特定于扬声器的参数控制,该参数类似于VTLN扭曲因子。转换具有从数据中学习的非特定于说话者的参数,我们发现男性和女性说话者的不同轴线是自动学习的。指数变换没有明确的频率扭曲的概念,这使得它原则上适用于非标准的特征,如来自神经网络的特征,或者当关键轴可能不是男-女时。基于我们对标准MFCC特征的实验,它似乎比传统的VTLN表现得更好。
{"title":"Speaker adaptation with an Exponential Transform","authors":"Daniel Povey, G. Zweig, A. Acero","doi":"10.1109/ASRU.2011.6163923","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163923","url":null,"abstract":"In this paper we describe a linear transform that we call an Exponential Transform (ET), which integrates aspects of CMLLR, VTLN and STC/MLLT into a single transform with jointly trained components. Its main advantage is that a very small number of speaker-specific parameters is required, thus enabling effective adaptation with small amounts of speaker specific data. Our formulation shares some characteristics of Vocal Tract Length Normalization (VTLN), and is intended as a substitute for VTLN. The key part of the transform is controlled by a single speaker-specific parameter that is analogous to a VTLN warp factor. The transform has non-speaker-specific parameters that are learned from data, and we find that the axis along which male and female speakers differ is automatically learned. The exponential transform has no explicit notion of frequency warping, which makes it applicable in principle to non-standard features such as those derived from neural nets, or when the key axes may not be male-female. Based on our experiments with standard MFCC features, it appears to perform better than conventional VTLN.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130576407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Regularized subspace Gaussian mixture models for cross-lingual speech recognition 跨语言语音识别的正则子空间高斯混合模型
Pub Date : 2011-12-01 DOI: 10.1109/ASRU.2011.6163959
Liang Lu, Arnab Ghoshal, S. Renals
We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.
我们使用子空间高斯混合模型(SGMM)研究了低资源语言的跨语言声学建模。我们假设存在在多个源语言上训练的声学模型,并使用来自这些模型的全局子空间参数在具有有限转录语音量的目标语言中改进建模。在GlobalPhone语料库上使用西班牙语、葡萄牙语和瑞典语作为源语言,德语作为目标语言(分别有1小时和5小时的转录音频)进行的实验表明,多语言训练的SGMM共享参数比使用单一源语言的SGMM共享参数产生更低的单词错误率(wer)。我们还表明,通过惩罚它们的1-范数来正则化SGMM状态向量的估计有助于克服数值不稳定性并导致更低的WER。
{"title":"Regularized subspace Gaussian mixture models for cross-lingual speech recognition","authors":"Liang Lu, Arnab Ghoshal, S. Renals","doi":"10.1109/ASRU.2011.6163959","DOIUrl":"https://doi.org/10.1109/ASRU.2011.6163959","url":null,"abstract":"We investigate cross-lingual acoustic modelling for low resource languages using the subspace Gaussian mixture model (SGMM). We assume the presence of acoustic models trained on multiple source languages, and use the global subspace parameters from those models for improved modelling in a target language with limited amounts of transcribed speech. Experiments on the GlobalPhone corpus using Spanish, Portuguese, and Swedish as source languages and German as target language (with 1 hour and 5 hours of transcribed audio) show that multilingually trained SGMM shared parameters result in lower word error rates (WERs) than using those from a single source language. We also show that regularizing the estimation of the SGMM state vectors by penalizing their ℓ1-norm help to overcome numerical instabilities and lead to lower WER.","PeriodicalId":338241,"journal":{"name":"2011 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116239947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
期刊
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1