首页 > 最新文献

2016 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds 同时识别语音和环境声音的深度瓶颈特征和依赖于声音的i向量
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846242
S. Sakti, S. Kawanishi, Graham Neubig, Koichiro Yoshino, Satoshi Nakamura
In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.
在语音界面中,通常需要了解整体听觉环境,不仅要识别所说的内容,还要了解话语周围的位置或动作。然而,自动语音识别(ASR)在识别带有环境声音的语音时变得困难。标准解决方案将环境声音视为噪音,并将其消除以提高ASR性能。另一方面,大多数关于环境音的研究只针对环境音构建分类词,不受口语话语的干扰。但是,在现实中,这种单独的情况几乎不存在。本研究试图解决同时识别语音和环境声音的问题。特别是,我们研究了使用深度神经网络(DNN)技术同时识别语音和环境声音的可能性,并在各自的噪声条件下提高这两项任务的准确性。首先,我们研究了深度神经网络架构,包括两个并行的单任务深度神经网络和一个多任务深度神经网络。然而,我们发现同步语音和环境识别的直接多任务学习是困难的。因此,我们进一步提出了一种在该框架内结合瓶颈特征和声音相关i向量的方法。实验评估结果表明,利用瓶颈特征和i向量作为深度神经网络的输入,有助于提高各识别任务的准确率。
{"title":"Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds","authors":"S. Sakti, S. Kawanishi, Graham Neubig, Koichiro Yoshino, Satoshi Nakamura","doi":"10.1109/SLT.2016.7846242","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846242","url":null,"abstract":"In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123596019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Batch-normalized joint training for DNN-based distant speech recognition 基于dnn的远程语音识别批归一化联合训练
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846241
M. Ravanelli, Philemon Brakel, M. Omologo, Yoshua Bengio
Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.
改进远程语音识别是实现灵活人机界面的关键一步。然而,目前的技术仍然缺乏鲁棒性,特别是当遇到不利的声学条件时。尽管过去几年在语音增强和语音识别方面取得了重大进展,但最先进技术的一个潜在限制在于组成模块,因为它们没有共同训练而不能很好地匹配。
{"title":"Batch-normalized joint training for DNN-based distant speech recognition","authors":"M. Ravanelli, Philemon Brakel, M. Omologo, Yoshua Bengio","doi":"10.1109/SLT.2016.7846241","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846241","url":null,"abstract":"Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116493189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Discriminative multiple sound source localization based on deep neural networks using independent location model 基于独立定位模型的深度神经网络判别多声源定位
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846325
Ryu Takeda, Kazunori Komatani
We propose a training method for multiple sound source localization (SSL) based on deep neural networks (DNNs). Such networks function as posterior probability estimator of sound location in terms of position labels and achieve high localization correctness. Since the previous DNNs' configuration for SSL handles one-sound-source cases, it should be extended to multiple-sound-source cases to apply it to real environments. However, a naïve design causes 1) an increase in the number of labels and training data patterns and 2) a lack of label consistency across different numbers of sound sources, such as one and two-or-more-sound cases. These two problems were solved using our proposed method, which involves an independent location model for the former and an block-wise consistent labeling with ordering for the latter. Our experiments indicated that the SSL based on DNNs trained by our proposed training method out-performed a conventional SSL method by a maximum of 18 points in terms of block-level correctness.
提出了一种基于深度神经网络的多声源定位(SSL)训练方法。该网络在位置标签方面作为声音定位的后验概率估计器,具有较高的定位正确性。由于前面dnn的SSL配置处理的是单声源情况,因此应该将其扩展到多声源情况,以便将其应用于实际环境。然而,naïve的设计导致1)标签和训练数据模式的数量增加,2)不同数量的声源之间缺乏标签一致性,例如一个和两个或更多的声音案例。采用本文提出的方法解决了这两个问题,前者采用独立的位置模型,后者采用带排序的分块一致标记。我们的实验表明,基于我们提出的训练方法训练的dnn的SSL在块级正确性方面比传统SSL方法高出最多18点。
{"title":"Discriminative multiple sound source localization based on deep neural networks using independent location model","authors":"Ryu Takeda, Kazunori Komatani","doi":"10.1109/SLT.2016.7846325","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846325","url":null,"abstract":"We propose a training method for multiple sound source localization (SSL) based on deep neural networks (DNNs). Such networks function as posterior probability estimator of sound location in terms of position labels and achieve high localization correctness. Since the previous DNNs' configuration for SSL handles one-sound-source cases, it should be extended to multiple-sound-source cases to apply it to real environments. However, a naïve design causes 1) an increase in the number of labels and training data patterns and 2) a lack of label consistency across different numbers of sound sources, such as one and two-or-more-sound cases. These two problems were solved using our proposed method, which involves an independent location model for the former and an block-wise consistent labeling with ordering for the latter. Our experiments indicated that the SSL based on DNNs trained by our proposed training method out-performed a conventional SSL method by a maximum of 18 points in terms of block-level correctness.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Boosting performance on low-resource languages by standard corpora: An analysis 用标准语料库提高低资源语言的性能:一个分析
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846329
F. Grézl, M. Karafiát
In this paper, we analyze the feasibility of using single well-resourced language - English - as a source language for multilingual techniques in context of Stacked Bottle-Neck tandem system. The effect of amount of data and number of tied-states in the source language on performance of ported system is evaluated together with different porting strategies. Generally, increasing data amount and level-of-detail both is positive. A greater effect is observed for increasing number of tied states. The modified neural network structure, shown useful for multilingual porting, was also evaluated with its specific porting procedure. Using original NN structure in combination with modified porting adapt-adapt strategy was fount as best. It achieves relative improvement 3.5–8.8% on variety of target languages. These results are comparable with using multilingual NNs pretrained on 7 languages.
本文分析了在瓶颈串联系统的背景下,使用单一资源丰富的语言英语作为多语言技术的源语言的可行性。在不同的移植策略下,评估了源语言的数据量和绑定状态数对移植系统性能的影响。一般来说,增加数据量和详细程度都是积极的。随着平局州数量的增加,观察到的影响更大。改进后的神经网络结构对多语言移植很有用,并对其具体的移植过程进行了评价。采用原有的神经网络结构结合改进的移植自适应策略是最优的。在目标语言多样性上实现了3.5-8.8%的相对提升。这些结果与使用7种语言预训练的多语言神经网络相当。
{"title":"Boosting performance on low-resource languages by standard corpora: An analysis","authors":"F. Grézl, M. Karafiát","doi":"10.1109/SLT.2016.7846329","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846329","url":null,"abstract":"In this paper, we analyze the feasibility of using single well-resourced language - English - as a source language for multilingual techniques in context of Stacked Bottle-Neck tandem system. The effect of amount of data and number of tied-states in the source language on performance of ported system is evaluated together with different porting strategies. Generally, increasing data amount and level-of-detail both is positive. A greater effect is observed for increasing number of tied states. The modified neural network structure, shown useful for multilingual porting, was also evaluated with its specific porting procedure. Using original NN structure in combination with modified porting adapt-adapt strategy was fount as best. It achieves relative improvement 3.5–8.8% on variety of target languages. These results are comparable with using multilingual NNs pretrained on 7 languages.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127475041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Toward human-assisted lexical unit discovery without text resources 面向无文本资源的人工辅助词汇单位发现
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846246
C. Bartels, Wen Wang, V. Mitra, Colleen Richey, A. Kathol, D. Vergyri, H. Bratt, C. Hung
This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.
这项工作解决了没有(可用的)书面资源的语言的词汇单位发现。以前的工作使用完全无监督的方法解决了这个问题。相比之下,我们的方法调查了语言和说话者知识的使用,即使文本资源不存在,这些知识也经常可用。我们创建了一个受益于这些资源的框架,不假设正字法表示,避免生成单词级转录。我们将通用电话识别器应用于目标语言,并利用它将音频转换为可搜索的电话字符串,通过模糊子字符串匹配来发现词汇单位。语言知识被用来约束手机识别输出和限制手机识别器输出的词汇单位发现。
{"title":"Toward human-assisted lexical unit discovery without text resources","authors":"C. Bartels, Wen Wang, V. Mitra, Colleen Richey, A. Kathol, D. Vergyri, H. Bratt, C. Hung","doi":"10.1109/SLT.2016.7846246","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846246","url":null,"abstract":"This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"305 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134152038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view 结合侧面或正面视角深度信息的双扬声器场景中的视听语音活动检测
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846321
Spyridon Thermos, G. Potamianos
Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.
随着深度视觉传感器(如Kinect设备)的日益普及,我们研究了深度信息在视听语音活动检测中的应用。假设有两个主体的场景,也考虑到语音重叠。采用了两种感官设置,其中深度视频捕获对象的正面或侧面视图,并随后与相应的平面视频和音频流相结合。此外,还考虑了多视图融合,在互补视图设置中使用来自传感器的音频和平面视频。支持向量机为每个视觉检测到的主题提供时间语音活动分类,融合可用的情态流。分类结果进一步组合得到说话人的特征。本文报道了用两个kinect记录的合适的视听语料库进行的实验。结果表明深度信息的好处,特别是在正面深度视图设置中,与忽略它的系统相比,可以减少语音活动检测和说话人拨号错误。
{"title":"Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view","authors":"Spyridon Thermos, G. Potamianos","doi":"10.1109/SLT.2016.7846321","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846321","url":null,"abstract":"Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133200055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Speaker independent diarization for child language environment analysis using deep neural networks 基于深度神经网络的儿童语言环境分析
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846253
M. Najafian, J. Hansen
Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.
通过测量在语音交流中其他儿童和成人对儿童的言语量来大规模监测儿童语言环境是一项重要的任务。利用从儿童在托儿中心佩戴的录音装置中提取的音频,我们提出的录音系统可以在每个时间点确定儿童语言环境的内容,通过将音频内容分为四大类之一,即(1)佩戴录音装置的儿童发起的语音,(2)其他儿童或(3)成人发起的针对主要儿童的语音,以及(4)非语音内容。在这项研究中,我们利用具有多个状态的复杂隐马尔可夫模型(HMM)来建模不同声学变异性源之间的时间依赖性,并使用深度神经网络作为判别建模方法来估计HMM状态输出概率。该系统对由快速转弯、类相似性和背景噪声引起的常见化误差具有鲁棒性,而不需要预先聚类技术。实验结果表明,该方法在不需要自下而上聚类的情况下优于最先进的基于高斯混合模型的diarization,相对误差降低了22.24%。
{"title":"Speaker independent diarization for child language environment analysis using deep neural networks","authors":"M. Najafian, J. Hansen","doi":"10.1109/SLT.2016.7846253","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846253","url":null,"abstract":"Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134054672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Dynamic adjustment of language models for automatic speech recognition using word similarity 基于词相似度的自动语音识别语言模型动态调整
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846299
Anna Currey, I. Illina, D. Fohr
Out-of-vocabulary (OOV) words can pose a particular problem for automatic speech recognition (ASR) of broadcast news. The language models (LMs) of ASR systems are typically trained on static corpora, whereas new words (particularly new proper nouns) are continually introduced in the media. Additionally, such OOVs are often content-rich proper nouns that are vital to understanding the topic. In this work, we explore methods for dynamically adding OOVs to language models by adapting the n-gram language model used in our ASR system. We propose two strategies: the first relies on finding in-vocabulary (IV) words similar to the OOVs, where word embeddings are used to define similarity. Our second strategy leverages a small contemporary corpus to estimate OOV probabilities. The models we propose yield improvements in perplexity over the baseline; in addition, the corpus-based approach leads to a significant decrease in proper noun error rate over the baseline in recognition experiments.
词汇外词对广播新闻的自动语音识别(ASR)造成了特殊的问题。ASR系统的语言模型(LMs)通常是在静态语料库上训练的,而新词(特别是新的专有名词)则不断地在媒体中引入。此外,这种oov通常是内容丰富的专有名词,对于理解主题至关重要。在这项工作中,我们探索了通过适应我们的ASR系统中使用的n-gram语言模型来动态地将oov添加到语言模型中的方法。我们提出了两种策略:第一种依赖于寻找与oov相似的词汇(IV),其中使用词嵌入来定义相似性。我们的第二个策略利用一个小的当代语料库来估计OOV概率。我们提出的模型改善了基线上的困惑度;此外,基于语料库的方法在识别实验中使专有名词错误率在基线基础上显著降低。
{"title":"Dynamic adjustment of language models for automatic speech recognition using word similarity","authors":"Anna Currey, I. Illina, D. Fohr","doi":"10.1109/SLT.2016.7846299","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846299","url":null,"abstract":"Out-of-vocabulary (OOV) words can pose a particular problem for automatic speech recognition (ASR) of broadcast news. The language models (LMs) of ASR systems are typically trained on static corpora, whereas new words (particularly new proper nouns) are continually introduced in the media. Additionally, such OOVs are often content-rich proper nouns that are vital to understanding the topic. In this work, we explore methods for dynamically adding OOVs to language models by adapting the n-gram language model used in our ASR system. We propose two strategies: the first relies on finding in-vocabulary (IV) words similar to the OOVs, where word embeddings are used to define similarity. Our second strategy leverages a small contemporary corpus to estimate OOV probabilities. The models we propose yield improvements in perplexity over the baseline; in addition, the corpus-based approach leads to a significant decrease in proper noun error rate over the baseline in recognition experiments.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130663475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Adaptation of SVM for MIL for inferring the polarity of movies and movie reviews 基于MIL的SVM自适应预测电影和影评的极性
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846274
M. J. Correia, I. Trancoso, B. Raj
Polarity detection is a research topic of major interest, with many applications including detecting the polarity of product reviews. However, in some cases, the polarity of the product reviews might not be available while the polarity of the product itself might be, prohibiting the use of any form of fully supervised learning technique. This scenario, while different, is close to that of multiple instance learning (MIL). In this work we propose two new adaptations of support vector machines (SVM) for MIL, θ-MIL, to suit this new scenario, and infer the polarity of products and product reviews. We perform experiments on the proposed methods using the IMDb movie review corpus, and compare the performance of the proposed methods to the traditional SVM for MIL approach. Although we make weaker assumptions about the data, the proposed methods achieve a comparable performance to the SVM for MIL in accurately detecting the polarity of movies and movie reviews.
极性检测是一个重要的研究课题,有许多应用,包括检测产品评论的极性。然而,在某些情况下,产品评论的极性可能不可用,而产品本身的极性可能可用,从而禁止使用任何形式的完全监督学习技术。这种情况虽然不同,但与多实例学习(MIL)的情况接近。在这项工作中,我们提出了支持向量机(SVM)对MIL的两种新的适应,θ-MIL,以适应这种新的场景,并推断产品和产品评论的极性。我们使用IMDb电影评论语料库对所提出的方法进行了实验,并将所提出的方法与传统支持向量机的MIL方法的性能进行了比较。虽然我们对数据做了较弱的假设,但所提出的方法在准确检测电影和电影评论的极性方面取得了与支持向量机相当的性能。
{"title":"Adaptation of SVM for MIL for inferring the polarity of movies and movie reviews","authors":"M. J. Correia, I. Trancoso, B. Raj","doi":"10.1109/SLT.2016.7846274","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846274","url":null,"abstract":"Polarity detection is a research topic of major interest, with many applications including detecting the polarity of product reviews. However, in some cases, the polarity of the product reviews might not be available while the polarity of the product itself might be, prohibiting the use of any form of fully supervised learning technique. This scenario, while different, is close to that of multiple instance learning (MIL). In this work we propose two new adaptations of support vector machines (SVM) for MIL, θ-MIL, to suit this new scenario, and infer the polarity of products and product reviews. We perform experiments on the proposed methods using the IMDb movie review corpus, and compare the performance of the proposed methods to the traditional SVM for MIL approach. Although we make weaker assumptions about the data, the proposed methods achieve a comparable performance to the SVM for MIL in accurately detecting the polarity of movies and movie reviews.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134532302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Entropy-based pruning of hidden units to reduce DNN parameters 基于熵的隐单元剪枝以减少深度神经网络参数
Pub Date : 2016-12-01 DOI: 10.1109/SLT.2016.7846335
G. Mantena, K. Sim
For acoustic modeling, the use of DNN has become popular due to its superior performance improvements observed in many automatic speech recognition (ASR) tasks. Typically, DNNs with deep (many layers) and wide (many hidden units per layer) architectures are chosen in order to achieve good gains. An issue with such approaches is that there is an explosion in the number of learnable parameters. Thus, it is often difficult to build models in cases where there is no sufficient amount of training data (or data for adaptation), and also limits the usage of ASR systems on hand-held devices such as mobile phones. A method to overcome this issue is to reduce the number of parameters. In this work, we provide a framework to effectively reduce the number of parameters by removing the hidden units. Each hidden unit is represented by an activity vector associated with speech attributes such as phones. A normalized entropy-based measure is computed from these activity vectors which reflects the significance of these units in the DNN model. For comparison we also use low-rank matrix factorization to reduce the number of parameters. We show that low-rank matrix factorization can reduce the number of parameters only to a certain extent. Thus, we extend the pruning technique in combination with low-rank matrix factorization to further reduce the model. In this work, we provide detailed experimental results on the Aurora-4 and TEDLIUM databases and show that the models can be reduced to approximately 20 – 30% of its initial size without much loss in the ASR performance.
对于声学建模,深度神经网络的使用已经变得流行,因为它在许多自动语音识别(ASR)任务中观察到优越的性能改进。通常,选择具有深度(多层)和宽(每层有许多隐藏单元)架构的dnn是为了获得良好的增益。这种方法的一个问题是,可学习参数的数量呈爆炸式增长。因此,在没有足够数量的训练数据(或用于适应的数据)的情况下,通常很难建立模型,并且还限制了ASR系统在诸如移动电话之类的手持设备上的使用。克服这个问题的一个方法是减少参数的数量。在这项工作中,我们提供了一个框架,通过去除隐藏单元来有效地减少参数的数量。每个隐藏单元由与语音属性(如电话)相关的活动向量表示。从这些活动向量中计算一个归一化的基于熵的度量,这反映了这些单元在DNN模型中的重要性。为了比较,我们还使用低秩矩阵分解来减少参数的数量。我们证明了低秩矩阵分解只能在一定程度上减少参数的数量。因此,我们将剪枝技术与低秩矩阵分解相结合进行扩展,进一步简化模型。在这项工作中,我们在Aurora-4和TEDLIUM数据库上提供了详细的实验结果,并表明模型可以减少到其初始大小的20 - 30%左右,而不会对ASR性能造成很大损失。
{"title":"Entropy-based pruning of hidden units to reduce DNN parameters","authors":"G. Mantena, K. Sim","doi":"10.1109/SLT.2016.7846335","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846335","url":null,"abstract":"For acoustic modeling, the use of DNN has become popular due to its superior performance improvements observed in many automatic speech recognition (ASR) tasks. Typically, DNNs with deep (many layers) and wide (many hidden units per layer) architectures are chosen in order to achieve good gains. An issue with such approaches is that there is an explosion in the number of learnable parameters. Thus, it is often difficult to build models in cases where there is no sufficient amount of training data (or data for adaptation), and also limits the usage of ASR systems on hand-held devices such as mobile phones. A method to overcome this issue is to reduce the number of parameters. In this work, we provide a framework to effectively reduce the number of parameters by removing the hidden units. Each hidden unit is represented by an activity vector associated with speech attributes such as phones. A normalized entropy-based measure is computed from these activity vectors which reflects the significance of these units in the DNN model. For comparison we also use low-rank matrix factorization to reduce the number of parameters. We show that low-rank matrix factorization can reduce the number of parameters only to a certain extent. Thus, we extend the pruning technique in combination with low-rank matrix factorization to further reduce the model. In this work, we provide detailed experimental results on the Aurora-4 and TEDLIUM databases and show that the models can be reduced to approximately 20 – 30% of its initial size without much loss in the ASR performance.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131338888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2016 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1