首页 > 最新文献

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)最新文献

英文 中文
Error simulation for training statistical dialogue systems 训练统计对话系统的误差模拟
Pub Date : 2007-12-13 DOI: 10.1109/ASRU.2007.4430167
J. Schatzmann, Blaise Thomson, S. Young
Human-machine dialogue is heavily influenced by speech recognition and understanding errors and it is hence desirable to train and test statistical dialogue system policies under realistic noise conditions. This paper presents a novel approach to error simulation based on statistical models for word-level utterance generation, ASR confusions, and confidence score generation. While the method explicitly models the context-dependent acoustic confusability of words and allows the system specific language model and semantic decoder to be incorporated, it is computationally inexpensive and thus potentially suitable for running thousands of training simulations. Experimental evaluation results with a POMDP-based dialogue system and the Hidden Agenda User Simulator indicate a close match between the statistical properties of real and synthetic errors.
人机对话受到语音识别和理解错误的严重影响,因此需要在现实噪声条件下训练和测试统计对话系统策略。本文提出了一种基于统计模型的错误模拟方法,用于单词级话语生成、ASR混淆和置信度评分生成。虽然该方法明确地模拟了上下文相关的单词的声学混淆性,并允许将系统特定的语言模型和语义解码器结合起来,但它在计算上便宜,因此可能适合运行数千个训练模拟。基于pomdp的对话系统和隐藏议程用户模拟器的实验评估结果表明,真实误差和合成误差的统计特性非常接近。
{"title":"Error simulation for training statistical dialogue systems","authors":"J. Schatzmann, Blaise Thomson, S. Young","doi":"10.1109/ASRU.2007.4430167","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430167","url":null,"abstract":"Human-machine dialogue is heavily influenced by speech recognition and understanding errors and it is hence desirable to train and test statistical dialogue system policies under realistic noise conditions. This paper presents a novel approach to error simulation based on statistical models for word-level utterance generation, ASR confusions, and confidence score generation. While the method explicitly models the context-dependent acoustic confusability of words and allows the system specific language model and semantic decoder to be incorporated, it is computationally inexpensive and thus potentially suitable for running thousands of training simulations. Experimental evaluation results with a POMDP-based dialogue system and the Hidden Agenda User Simulator indicate a close match between the statistical properties of real and synthetic errors.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"41 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132442498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 90
Predictive linear transforms for noise robust speech recognition 基于预测线性变换的噪声鲁棒语音识别
Pub Date : 2007-12-13 DOI: 10.1109/ASRU.2007.4430084
M. Gales, R. V. Dalen
It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is joint uncertainty decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the feature-space: adaptation-based linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semi-tied covariance matrices. These predictive transforms are estimated using statistics derived from the full JUD transforms rather than noisy data. The schemes are evaluated on AURORA 2 and a noise-corrupted resource management task.
众所周知,背景噪声的加入会改变元素之间的相关性,例如,MFCC特征向量。然而,基于标准模型的补偿技术并没有改变对角协方差矩阵高斯混合模型估计的特征空间。解决这一问题的一种方法是采用全变换的联合不确定性解码(JUD)。不幸的是,这导致解码期间的高计算成本。本文对比了两种近似全JUD的方法,同时降低了计算成本。两者都使用预测线性变换来修改特征空间:基于自适应的线性变换,其中模型参数被限制为与原始清洁系统相同;以及精确的矩阵建模方法,特别是半捆绑协方差矩阵。这些预测转换是使用来自完整的JUD转换而不是噪声数据的统计数据来估计的。在AURORA 2和一个受噪声干扰的资源管理任务上对这些方案进行了评估。
{"title":"Predictive linear transforms for noise robust speech recognition","authors":"M. Gales, R. V. Dalen","doi":"10.1109/ASRU.2007.4430084","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430084","url":null,"abstract":"It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is joint uncertainty decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the feature-space: adaptation-based linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semi-tied covariance matrices. These predictive transforms are estimated using statistics derived from the full JUD transforms rather than noisy data. The schemes are evaluated on AURORA 2 and a noise-corrupted resource management task.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121147483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Development of a phonetic system for large vocabulary Arabic speech recognition 大词汇量阿拉伯语语音识别语音系统的开发
Pub Date : 2007-12-13 DOI: 10.1109/ASRU.2007.4430078
M. Gales, Frank Diehl, C. Raut, M. Tomalin, P. Woodland, Kai Yu
This paper describes the development of an Arabic speech recognition system based on a phonetic dictionary. Though phonetic systems have been previously investigated, this paper makes a number of contributions to the understanding of how to build these systems, as well as describing a complete Arabic speech recognition system. The first issue considered is discriminative training when there are a large number of pronunciation variants for each word. In particular, the loss function associated with minimum phone error (MPE) training is examined. The performance and combination of phonetic and graphemic acoustic models are then compared on both Broadcast News (BN) and Broadcast Conversation (BC) data. The final contribution of the paper is a simple scheme for automatically generating pronunciations for use in training and reducing the phonetic out-of-vocabulary rate. The paper concludes with a description and results from using phonetic and graphemic systems in a multipass/combination framework.
本文介绍了一种基于语音词典的阿拉伯语语音识别系统的开发。虽然语音系统以前已经研究过,但本文对如何构建这些系统的理解做出了许多贡献,并描述了一个完整的阿拉伯语语音识别系统。首先要考虑的问题是当每个单词都有大量的发音变体时的判别训练。特别地,研究了与最小电话误差(MPE)训练相关的损失函数。然后在广播新闻(BN)和广播会话(BC)数据上比较了语音和文字声学模型的性能和组合。本文的最后贡献是一个简单的方案,用于自动生成语音用于训练和减少语音词汇外率。最后给出了在多通道/组合框架中使用语音和字母系统的描述和结果。
{"title":"Development of a phonetic system for large vocabulary Arabic speech recognition","authors":"M. Gales, Frank Diehl, C. Raut, M. Tomalin, P. Woodland, Kai Yu","doi":"10.1109/ASRU.2007.4430078","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430078","url":null,"abstract":"This paper describes the development of an Arabic speech recognition system based on a phonetic dictionary. Though phonetic systems have been previously investigated, this paper makes a number of contributions to the understanding of how to build these systems, as well as describing a complete Arabic speech recognition system. The first issue considered is discriminative training when there are a large number of pronunciation variants for each word. In particular, the loss function associated with minimum phone error (MPE) training is examined. The performance and combination of phonetic and graphemic acoustic models are then compared on both Broadcast News (BN) and Broadcast Conversation (BC) data. The final contribution of the paper is a simple scheme for automatically generating pronunciations for use in training and reducing the phonetic out-of-vocabulary rate. The paper concludes with a description and results from using phonetic and graphemic systems in a multipass/combination framework.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129392914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Speechfind for CDP: Advances in spoken document retrieval for the U. S. collaborative digitization program 面向CDP的语音检索:美国协作数字化计划的语音文档检索进展
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430195
Wooil Kim, J. Hansen
This paper presents our recent advances for SpeechFind, a CRSS-UTD designed spoken document retrieval system for the U.S. based Collaborative Digitization Program (CDP). A proto-type of SpeechFind for the CDP is currently serving as the search engine for 1,300 hours of CDP audio content which contain a wide range of acoustic conditions, vocabulary and period selection, and topics. In an effort to determine the amount of user corrected transcripts needed to impact automatic speech recognition (ASR) and audio search, a web-based online interface for verification of ASR-generated transcripts was developed. The procedure for enhancing the transcription performance for SpeechFind is also presented. A selection of adaptation methods for language and acoustic models are employed depending on the acoustics of the corpora under test. Experimental results on the CDP corpus demonstrate that the employed model adaptation scheme using the verified transcripts is effective in improving recognition accuracy. Through a combination of feature/acoustic model enhancement and language model selection, up to 24.8% relative improvement in ASR was obtained. The SpeechFind system, employing automatic transcript generation, online CDP transcript correction, and our transcript reliability estimator, demonstrates a comprehensive support mechanism to ensure reliable transcription and search for U.S. libraries with limited speech technology experience.
本文介绍了我们为基于美国的协同数字化计划(CDP)设计的基于cross - utd的语音文档检索系统SpeechFind的最新进展。目前,用于CDP的一个原型speech - find正在作为1300小时CDP音频内容的搜索引擎,这些音频内容包含广泛的声学条件、词汇和周期选择以及主题。为了确定影响自动语音识别(ASR)和音频搜索所需的用户更正文本的数量,开发了一个基于web的在线界面,用于验证ASR生成的文本。本文还介绍了提高语音查找转录性能的方法。根据被测语料库的声学特性,选择语言和声学模型的适应方法。在CDP语料库上的实验结果表明,基于验证文本的模型自适应方案能够有效地提高识别精度。通过特征/声学模型增强和语言模型选择相结合,ASR的相对改善率高达24.8%。SpeechFind系统采用自动转录生成、在线CDP转录纠正和我们的转录可靠性估计器,展示了一个全面的支持机制,以确保对语音技术经验有限的美国图书馆的可靠转录和搜索。
{"title":"Speechfind for CDP: Advances in spoken document retrieval for the U. S. collaborative digitization program","authors":"Wooil Kim, J. Hansen","doi":"10.1109/ASRU.2007.4430195","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430195","url":null,"abstract":"This paper presents our recent advances for SpeechFind, a CRSS-UTD designed spoken document retrieval system for the U.S. based Collaborative Digitization Program (CDP). A proto-type of SpeechFind for the CDP is currently serving as the search engine for 1,300 hours of CDP audio content which contain a wide range of acoustic conditions, vocabulary and period selection, and topics. In an effort to determine the amount of user corrected transcripts needed to impact automatic speech recognition (ASR) and audio search, a web-based online interface for verification of ASR-generated transcripts was developed. The procedure for enhancing the transcription performance for SpeechFind is also presented. A selection of adaptation methods for language and acoustic models are employed depending on the acoustics of the corpora under test. Experimental results on the CDP corpus demonstrate that the employed model adaptation scheme using the verified transcripts is effective in improving recognition accuracy. Through a combination of feature/acoustic model enhancement and language model selection, up to 24.8% relative improvement in ASR was obtained. The SpeechFind system, employing automatic transcript generation, online CDP transcript correction, and our transcript reliability estimator, demonstrates a comprehensive support mechanism to ensure reliable transcription and search for U.S. libraries with limited speech technology experience.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121031290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Development and portability of ASR and Q&A modules for real-environment speech-oriented guidance systems 面向真实环境语音制导系统的ASR和问答模块的开发与可移植性
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430166
T. Cincarek, Hiromichi Kawanami, H. Saruwatari, K. Shikano
In this paper, we investigate development and portability of ASR and Q&A modules of speech-oriented guidance systems for two different real environments. An initial prototype system has been constructed for a local community center using two years of human-labeled data collected by the system. Collection of real user data is required because ASR task and Q&A domain of a guidance system are defined by the target environment and potential users. However, since human preparation of data is always costly, most often only a relatively small amount real data will be available for system adaptation in practice. Therefore, the portability of the initial prototype system is investigated for a different environment, a local subway station. The purpose is to identify reusable system parts. The ASR module is found to be highly portable across the two environments. However, the portability of the Q&A module was only medium. From an objective analysis it became clear that this is mainly due to the environment-dependent domain differences between the two systems. This implicates that it will always be important to take the behavior of actual users under real conditions into account to build a system with high user satisfaction.
在本文中,我们研究了两种不同真实环境下面向语音制导系统的ASR和问答模块的开发和可移植性。最初的原型系统已经为当地社区中心构建,使用该系统收集的两年人工标记数据。由于制导系统的ASR任务和问答域是由目标环境和潜在用户定义的,因此需要收集真实的用户数据。但是,由于人工编制数据的成本总是很高,因此在实践中只有相对少量的实际数据可用于系统调整。因此,在一个不同的环境下,一个当地的地铁站,研究了初始原型系统的可移植性。目的是识别可重用的系统部件。发现ASR模块在这两个环境中具有高度可移植性。然而,问答模块的可移植性只是中等。从客观分析来看,这主要是由于两个系统之间的环境依赖域差异。这意味着,要构建具有高用户满意度的系统,必须考虑实际用户在实际条件下的行为。
{"title":"Development and portability of ASR and Q&A modules for real-environment speech-oriented guidance systems","authors":"T. Cincarek, Hiromichi Kawanami, H. Saruwatari, K. Shikano","doi":"10.1109/ASRU.2007.4430166","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430166","url":null,"abstract":"In this paper, we investigate development and portability of ASR and Q&A modules of speech-oriented guidance systems for two different real environments. An initial prototype system has been constructed for a local community center using two years of human-labeled data collected by the system. Collection of real user data is required because ASR task and Q&A domain of a guidance system are defined by the target environment and potential users. However, since human preparation of data is always costly, most often only a relatively small amount real data will be available for system adaptation in practice. Therefore, the portability of the initial prototype system is investigated for a different environment, a local subway station. The purpose is to identify reusable system parts. The ASR module is found to be highly portable across the two environments. However, the portability of the Q&A module was only medium. From an objective analysis it became clear that this is mainly due to the environment-dependent domain differences between the two systems. This implicates that it will always be important to take the behavior of actual users under real conditions into account to build a system with high user satisfaction.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124965120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Introduction of the METI project “development of fundamental speech recognition technology” 日本经济产业省“基础语音识别技术开发”项目介绍
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430117
S. Furui, Tetsunori Kobayashi
Summary form only given. Waseda University, Tokyo Institute of Technology, and six companies, Asahi-kasei, Hitachi, Mitsubishi, NEC, Oki and Toshiba, initiated a three year project in 2006 supported by the ministry of economy, industry and trade (METI), Japan, for jointly developing fundamental automatic speech recognition (ASR) technology. The project focuses on utilizing ASR technology in car and home environments. Seven subtasks are being investigated: speech/non-speech separation using multiple microphones, speech/non-speech separation for a single audio stream, developing a high-performance WFST-based decoder, multi-lingual ASR modeling, higher-order language modeling, developing a system for assisting speech interface development, and overall technology evaluation. This talk will give an overview of the intermediate technological progress achieved by the project.
只提供摘要形式。早稻田大学、东京工业大学和朝日、日立、三菱、NEC、Oki和东芝六家公司在日本经济产业省(METI)的支持下,于2006年启动了一项为期三年的项目,共同开发基本的自动语音识别(ASR)技术。该项目的重点是在汽车和家庭环境中利用ASR技术。七个子任务正在研究中:使用多个麦克风的语音/非语音分离,单个音频流的语音/非语音分离,开发基于wfst的高性能解码器,多语言ASR建模,高阶语言建模,开发辅助语音接口开发的系统,以及总体技术评估。本讲座将概述该项目所取得的中间技术进步。
{"title":"Introduction of the METI project “development of fundamental speech recognition technology”","authors":"S. Furui, Tetsunori Kobayashi","doi":"10.1109/ASRU.2007.4430117","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430117","url":null,"abstract":"Summary form only given. Waseda University, Tokyo Institute of Technology, and six companies, Asahi-kasei, Hitachi, Mitsubishi, NEC, Oki and Toshiba, initiated a three year project in 2006 supported by the ministry of economy, industry and trade (METI), Japan, for jointly developing fundamental automatic speech recognition (ASR) technology. The project focuses on utilizing ASR technology in car and home environments. Seven subtasks are being investigated: speech/non-speech separation using multiple microphones, speech/non-speech separation for a single audio stream, developing a high-performance WFST-based decoder, multi-lingual ASR modeling, higher-order language modeling, developing a system for assisting speech interface development, and overall technology evaluation. This talk will give an overview of the intermediate technological progress achieved by the project.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125111425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A study on rescoring using HMM-based detectors for continuous speech recognition 基于hmm检测器的连续语音识别评分研究
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430175
Qiang Fu, B. Juang
This paper presents an investigation of the rescoring performance using hidden Markov model (HMM) based attribute detectors. The minimum verification error (MVE) criterion is employed to enhance the reliability of the detectors in continuous speech recognition. The HMM-based detectors are applied on the possible recognition candidates, which are generated from the conventional decoder and organized in phone/word graphs. We focus on the study of rescoring performance with the detectors trained on the tokens produced by the decoder but labeled in broad phonetic categories rather than the phonetic identities. Various training criteria and knowledge fusion methods are investigated under various semantic level rescoring scenarios. This research demonstrates various possibilities of embedding auxiliary information into the current automatic speech recognition (ASR) framework for improved results. It also represents an intermediate step towards the construction of a true detection-based ASR paradigm.
本文研究了基于隐马尔可夫模型(HMM)的属性检测器的评分性能。在连续语音识别中,采用最小验证误差(MVE)准则来提高检测器的可靠性。基于hmm的检测器应用于可能的识别候选者,这些候选者由传统解码器生成并组织在电话/单词图中。我们重点研究了检测器对解码器产生的标记进行训练,但标记在广泛的语音类别而不是语音身份上的评分性能。在不同的语义级评分场景下,研究了不同的训练准则和知识融合方法。本研究展示了在当前自动语音识别(ASR)框架中嵌入辅助信息以改善结果的各种可能性。它还代表了构建真正基于检测的ASR范式的中间步骤。
{"title":"A study on rescoring using HMM-based detectors for continuous speech recognition","authors":"Qiang Fu, B. Juang","doi":"10.1109/ASRU.2007.4430175","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430175","url":null,"abstract":"This paper presents an investigation of the rescoring performance using hidden Markov model (HMM) based attribute detectors. The minimum verification error (MVE) criterion is employed to enhance the reliability of the detectors in continuous speech recognition. The HMM-based detectors are applied on the possible recognition candidates, which are generated from the conventional decoder and organized in phone/word graphs. We focus on the study of rescoring performance with the detectors trained on the tokens produced by the decoder but labeled in broad phonetic categories rather than the phonetic identities. Various training criteria and knowledge fusion methods are investigated under various semantic level rescoring scenarios. This research demonstrates various possibilities of embedding auxiliary information into the current automatic speech recognition (ASR) framework for improved results. It also represents an intermediate step towards the construction of a true detection-based ASR paradigm.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115515157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Never-ending learning system for on-line speaker diarization 永无休止的在线扬声器拨号学习系统
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430197
K. Markov, Satoshi Nakamura
In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. All modules share a set of Gaussian mixture models (GMM) representing pause, male and female speakers, and each individual speaker. Initially, there are only three GMMs for pause and two speaker genders, trained in advance from some data. During the speaker diarization process, for each speech segment it is decided whether it comes from a new speaker or from already known speaker. In case of a new speaker, his/her gender is identified, and then, from the corresponding gender GMM, a new GMM is spawned by copying its parameters. This GMM is learned on-line using the speech segment data and from this point it is used to represent the new speaker. All individual speaker models are produced in this way. In the case of an old speaker, s/he is identified and the corresponding GMM is again learned on-line. In order to prevent an unlimited grow of the speaker model number, those models that have not been selected as winners for a long period of time are deleted from the system. This allows the system to be able to perform its task indefinitely in addition to being capable of self-organization, i.e. unsupervised adaptive learning, and preservation of the learned knowledge, i.e. speakers. Such functionalities are attributed to the so called Never-Ending Learning systems. For evaluation, we used part of the TC-STAR database consisting of European Parliament Plenary speeches. The results show that this system achieves a speaker diarization error rate of 4.6% with latency of at most 3 seconds.
在本文中,我们描述了一种新的高性能在线扬声器拨号系统,该系统工作速度快于实时,并且具有非常低的延迟。它包括语音活动检测、新说话人检测、说话人性别和说话人身份分类等几个模块。所有模块共享一组高斯混合模型(GMM),表示暂停,男性和女性演讲者以及每个单独的演讲者。最初,只有三个gmm用于暂停和两个说话者性别,这些都是事先从一些数据中训练出来的。在说话人分化过程中,对于每一个语音片段,都要确定它是来自一个新的说话人还是来自一个已知的说话人。如果有新的说话人,则识别其性别,然后从对应的性别GMM中复制其参数,生成新的GMM。这个GMM是使用语音片段数据在线学习的,从这一点上它被用来表示新的说话者。所有单独的扬声器型号都是以这种方式生产的。对于老说话者,识别他/她,并再次在线学习相应的GMM。为了防止扬声器型号的无限增长,长时间未被选中的型号将从系统中删除。这使得系统除了能够自组织(即无监督自适应学习)和保存所学知识(即说话者)之外,还能够无限期地执行其任务。这些功能都归功于所谓的永无止境的学习系统。为了进行评估,我们使用了由欧洲议会全体会议发言组成的TC-STAR数据库的一部分。结果表明,该系统的说话人拨号错误率为4.6%,延迟不超过3秒。
{"title":"Never-ending learning system for on-line speaker diarization","authors":"K. Markov, Satoshi Nakamura","doi":"10.1109/ASRU.2007.4430197","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430197","url":null,"abstract":"In this paper, we describe new high-performance on-line speaker diarization system which works faster than real-time and has very low latency. It consists of several modules including voice activity detection, novel speaker detection, speaker gender and speaker identity classification. All modules share a set of Gaussian mixture models (GMM) representing pause, male and female speakers, and each individual speaker. Initially, there are only three GMMs for pause and two speaker genders, trained in advance from some data. During the speaker diarization process, for each speech segment it is decided whether it comes from a new speaker or from already known speaker. In case of a new speaker, his/her gender is identified, and then, from the corresponding gender GMM, a new GMM is spawned by copying its parameters. This GMM is learned on-line using the speech segment data and from this point it is used to represent the new speaker. All individual speaker models are produced in this way. In the case of an old speaker, s/he is identified and the corresponding GMM is again learned on-line. In order to prevent an unlimited grow of the speaker model number, those models that have not been selected as winners for a long period of time are deleted from the system. This allows the system to be able to perform its task indefinitely in addition to being capable of self-organization, i.e. unsupervised adaptive learning, and preservation of the learned knowledge, i.e. speakers. Such functionalities are attributed to the so called Never-Ending Learning systems. For evaluation, we used part of the TC-STAR database consisting of European Parliament Plenary speeches. The results show that this system achieves a speaker diarization error rate of 4.6% with latency of at most 3 seconds.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129426523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Hierarchical Pitman-Yor language models for ASR in meetings 会议中ASR的分层Pitman-Yor语言模型
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430096
Songfang Huang, S. Renals
In this paper we investigate the application of a hierarchical Bayesian language model (LM) based on the Pitman-Yor process for automatic speech recognition (ASR) of multiparty meetings. The hierarchical Pitman-Yor language model (HPY-LM) provides a Bayesian interpretation of LM smoothing. An approximation to the HPYLM recovers the exact formulation of the interpolated Kneser-Ney smoothing method in n-gram models. This paper focuses on the application and scalability of HPYLM on a practical large vocabulary ASR system. Experimental results on NIST RT06s evaluation meeting data verify that HPYLM is a competitive and promising language modeling technique, which consistently performs better than interpolated Kneser-Ney and modified Kneser-Ney n-gram LMs in terms of both perplexity and word error rate.
本文研究了基于Pitman-Yor过程的层次贝叶斯语言模型(LM)在多人会议自动语音识别(ASR)中的应用。分层Pitman-Yor语言模型(HPY-LM)提供了对LM平滑的贝叶斯解释。对HPYLM的近似恢复了n-gram模型中插值Kneser-Ney平滑方法的精确公式。本文主要研究了HPYLM在一个实际的大词汇量ASR系统中的应用和可扩展性。在NIST RT06s评估会议数据上的实验结果验证了HPYLM是一种有竞争力和前景的语言建模技术,在困惑度和单词错误率方面,HPYLM始终优于内插的Kneser-Ney和修改的Kneser-Ney n-gram LMs。
{"title":"Hierarchical Pitman-Yor language models for ASR in meetings","authors":"Songfang Huang, S. Renals","doi":"10.1109/ASRU.2007.4430096","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430096","url":null,"abstract":"In this paper we investigate the application of a hierarchical Bayesian language model (LM) based on the Pitman-Yor process for automatic speech recognition (ASR) of multiparty meetings. The hierarchical Pitman-Yor language model (HPY-LM) provides a Bayesian interpretation of LM smoothing. An approximation to the HPYLM recovers the exact formulation of the interpolated Kneser-Ney smoothing method in n-gram models. This paper focuses on the application and scalability of HPYLM on a practical large vocabulary ASR system. Experimental results on NIST RT06s evaluation meeting data verify that HPYLM is a competitive and promising language modeling technique, which consistently performs better than interpolated Kneser-Ney and modified Kneser-Ney n-gram LMs in terms of both perplexity and word error rate.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123155807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Investigating linguistic knowledge in a maximum entropy token-based language model 研究基于最大熵符号的语言模型中的语言知识
Pub Date : 2007-12-01 DOI: 10.1109/ASRU.2007.4430104
Jia Cui, Yi Su, Keith B. Hall, F. Jelinek
We present a novel language model capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) producing sequences of words with trivial output distributions. The transition probabilities, however, are computed using a maximum entropy model to take advantage of potentially overlapping features. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system.
我们提出了一种新的语言模型,能够将各种类型的语言信息以标记、(词、标签)元组的形式编码。使用标记作为隐藏状态,我们的模型实际上是一个隐马尔可夫模型(HMM),产生具有平凡输出分布的单词序列。然而,转移概率是使用最大熵模型来计算的,以利用潜在的重叠特征。我们研究了具有广泛语言含义的不同类型的标签。这些模型在标准数据集上的困惑度和大词汇量语音识别系统的单词错误率方面都优于Kneser-Ney平滑n-gram模型。
{"title":"Investigating linguistic knowledge in a maximum entropy token-based language model","authors":"Jia Cui, Yi Su, Keith B. Hall, F. Jelinek","doi":"10.1109/ASRU.2007.4430104","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430104","url":null,"abstract":"We present a novel language model capable of incorporating various types of linguistic information as encoded in the form of a token, a (word, label)-tuple. Using tokens as hidden states, our model is effectively a hidden Markov model (HMM) producing sequences of words with trivial output distributions. The transition probabilities, however, are computed using a maximum entropy model to take advantage of potentially overlapping features. We investigated different types of labels with a wide range of linguistic implications. These models outperform Kneser-Ney smoothed n-gram models both in terms of perplexity on standard datasets and in terms of word error rate for a large vocabulary speech recognition system.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121376553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1