Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846242
S. Sakti, S. Kawanishi, Graham Neubig, Koichiro Yoshino, Satoshi Nakamura
In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.
{"title":"Deep bottleneck features and sound-dependent i-vectors for simultaneous recognition of speech and environmental sounds","authors":"S. Sakti, S. Kawanishi, Graham Neubig, Koichiro Yoshino, Satoshi Nakamura","doi":"10.1109/SLT.2016.7846242","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846242","url":null,"abstract":"In speech interfaces, it is often necessary to understand the overall auditory environment, not only recognizing what is being said, but also being aware of the location or actions surrounding the utterance. However, automatic speech recognition (ASR) becomes difficult when recognizing speech with environmental sounds. Standard solutions treat environmental sounds as noise, and remove them to improve ASR performance. On the other hand, most studies on environmental sounds construct classifiers for environmental sounds only, without interference of spoken utterances. But, in reality, such separate situations almost never exist. This study attempts to address the problem of simultaneous recognition of speech and environmental sounds. Particularly, we examine the possibility of using deep neural network (DNN) techniques to recognize speech and environmental sounds simultaneously, and improve the accuracy of both tasks under respective noisy conditions. First, we investigate DNN architectures including two parallel single-task DNNs, and a single multi-task DNN. However, we found direct multi-task learning of simultaneous speech and environmental recognition to be difficult. Therefore, we further propose a method that combines bottleneck features and sound-dependent i-vectors within this framework. Experimental evaluation results reveal that the utilizing bottleneck features and i-vectors as the input of DNNs can help to improve accuracy of each recognition task.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123596019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846241
M. Ravanelli, Philemon Brakel, M. Omologo, Yoshua Bengio
Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.
{"title":"Batch-normalized joint training for DNN-based distant speech recognition","authors":"M. Ravanelli, Philemon Brakel, M. Omologo, Yoshua Bengio","doi":"10.1109/SLT.2016.7846241","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846241","url":null,"abstract":"Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116493189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846325
Ryu Takeda, Kazunori Komatani
We propose a training method for multiple sound source localization (SSL) based on deep neural networks (DNNs). Such networks function as posterior probability estimator of sound location in terms of position labels and achieve high localization correctness. Since the previous DNNs' configuration for SSL handles one-sound-source cases, it should be extended to multiple-sound-source cases to apply it to real environments. However, a naïve design causes 1) an increase in the number of labels and training data patterns and 2) a lack of label consistency across different numbers of sound sources, such as one and two-or-more-sound cases. These two problems were solved using our proposed method, which involves an independent location model for the former and an block-wise consistent labeling with ordering for the latter. Our experiments indicated that the SSL based on DNNs trained by our proposed training method out-performed a conventional SSL method by a maximum of 18 points in terms of block-level correctness.
{"title":"Discriminative multiple sound source localization based on deep neural networks using independent location model","authors":"Ryu Takeda, Kazunori Komatani","doi":"10.1109/SLT.2016.7846325","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846325","url":null,"abstract":"We propose a training method for multiple sound source localization (SSL) based on deep neural networks (DNNs). Such networks function as posterior probability estimator of sound location in terms of position labels and achieve high localization correctness. Since the previous DNNs' configuration for SSL handles one-sound-source cases, it should be extended to multiple-sound-source cases to apply it to real environments. However, a naïve design causes 1) an increase in the number of labels and training data patterns and 2) a lack of label consistency across different numbers of sound sources, such as one and two-or-more-sound cases. These two problems were solved using our proposed method, which involves an independent location model for the former and an block-wise consistent labeling with ordering for the latter. Our experiments indicated that the SSL based on DNNs trained by our proposed training method out-performed a conventional SSL method by a maximum of 18 points in terms of block-level correctness.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846329
F. Grézl, M. Karafiát
In this paper, we analyze the feasibility of using single well-resourced language - English - as a source language for multilingual techniques in context of Stacked Bottle-Neck tandem system. The effect of amount of data and number of tied-states in the source language on performance of ported system is evaluated together with different porting strategies. Generally, increasing data amount and level-of-detail both is positive. A greater effect is observed for increasing number of tied states. The modified neural network structure, shown useful for multilingual porting, was also evaluated with its specific porting procedure. Using original NN structure in combination with modified porting adapt-adapt strategy was fount as best. It achieves relative improvement 3.5–8.8% on variety of target languages. These results are comparable with using multilingual NNs pretrained on 7 languages.
{"title":"Boosting performance on low-resource languages by standard corpora: An analysis","authors":"F. Grézl, M. Karafiát","doi":"10.1109/SLT.2016.7846329","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846329","url":null,"abstract":"In this paper, we analyze the feasibility of using single well-resourced language - English - as a source language for multilingual techniques in context of Stacked Bottle-Neck tandem system. The effect of amount of data and number of tied-states in the source language on performance of ported system is evaluated together with different porting strategies. Generally, increasing data amount and level-of-detail both is positive. A greater effect is observed for increasing number of tied states. The modified neural network structure, shown useful for multilingual porting, was also evaluated with its specific porting procedure. Using original NN structure in combination with modified porting adapt-adapt strategy was fount as best. It achieves relative improvement 3.5–8.8% on variety of target languages. These results are comparable with using multilingual NNs pretrained on 7 languages.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127475041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846246
C. Bartels, Wen Wang, V. Mitra, Colleen Richey, A. Kathol, D. Vergyri, H. Bratt, C. Hung
This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.
{"title":"Toward human-assisted lexical unit discovery without text resources","authors":"C. Bartels, Wen Wang, V. Mitra, Colleen Richey, A. Kathol, D. Vergyri, H. Bratt, C. Hung","doi":"10.1109/SLT.2016.7846246","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846246","url":null,"abstract":"This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"305 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134152038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846321
Spyridon Thermos, G. Potamianos
Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.
{"title":"Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view","authors":"Spyridon Thermos, G. Potamianos","doi":"10.1109/SLT.2016.7846321","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846321","url":null,"abstract":"Motivated by increasing popularity of depth visual sensors, such as the Kinect device, we investigate the utility of depth information in audio-visual speech activity detection. A two-subject scenario is assumed, allowing to also consider speech overlap. Two sensory setups are employed, where depth video captures either a frontal or profile view of the subjects, and is subsequently combined with the corresponding planar video and audio streams. Further, multi-view fusion is regarded, using audio and planar video from a sensor at the complementary view setup. Support vector machines provide temporal speech activity classification for each visually detected subject, fusing the available modality streams. Classification results are further combined to yield speaker diarization. Experiments are reported on a suitable audio-visual corpus recorded by two Kinects. Results demonstrate the benefits of depth information, particularly in the frontal depth view setup, reducing speech activity detection and speaker diarization errors over systems that ignore it.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133200055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846253
M. Najafian, J. Hansen
Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.
{"title":"Speaker independent diarization for child language environment analysis using deep neural networks","authors":"M. Najafian, J. Hansen","doi":"10.1109/SLT.2016.7846253","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846253","url":null,"abstract":"Large-scale monitoring of the child language environment through measuring the amount of speech directed to the child by other children and adults during a vocal communication is an important task. Using the audio extracted from a recording unit worn by a child within a childcare center, at each point in time our proposed diarization system can determine the content of the child's language environment, by categorizing the audio content into one of the four major categories, namely (1) speech initiated by the child wearing the recording unit, speech originated by other (2) children or (3) adults and directed at the primary child, and (4) non-speech contents. In this study, we exploit complex Hidden Markov Models (HMMs) with multiple states to model the temporal dependencies between different sources of acoustic variability and estimate the HMM state output probabilities using deep neural networks as a discriminative modeling approach. The proposed system is robust against common diarization errors caused by rapid turn takings, between class similarities, and background noise without the need to prior clustering techniques. The experimental results confirm that this approach outperforms the state-of-the-art Gaussian mixture model based diarization without the need for bottom-up clustering and leads to 22.24% relative error reduction.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134054672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846299
Anna Currey, I. Illina, D. Fohr
Out-of-vocabulary (OOV) words can pose a particular problem for automatic speech recognition (ASR) of broadcast news. The language models (LMs) of ASR systems are typically trained on static corpora, whereas new words (particularly new proper nouns) are continually introduced in the media. Additionally, such OOVs are often content-rich proper nouns that are vital to understanding the topic. In this work, we explore methods for dynamically adding OOVs to language models by adapting the n-gram language model used in our ASR system. We propose two strategies: the first relies on finding in-vocabulary (IV) words similar to the OOVs, where word embeddings are used to define similarity. Our second strategy leverages a small contemporary corpus to estimate OOV probabilities. The models we propose yield improvements in perplexity over the baseline; in addition, the corpus-based approach leads to a significant decrease in proper noun error rate over the baseline in recognition experiments.
{"title":"Dynamic adjustment of language models for automatic speech recognition using word similarity","authors":"Anna Currey, I. Illina, D. Fohr","doi":"10.1109/SLT.2016.7846299","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846299","url":null,"abstract":"Out-of-vocabulary (OOV) words can pose a particular problem for automatic speech recognition (ASR) of broadcast news. The language models (LMs) of ASR systems are typically trained on static corpora, whereas new words (particularly new proper nouns) are continually introduced in the media. Additionally, such OOVs are often content-rich proper nouns that are vital to understanding the topic. In this work, we explore methods for dynamically adding OOVs to language models by adapting the n-gram language model used in our ASR system. We propose two strategies: the first relies on finding in-vocabulary (IV) words similar to the OOVs, where word embeddings are used to define similarity. Our second strategy leverages a small contemporary corpus to estimate OOV probabilities. The models we propose yield improvements in perplexity over the baseline; in addition, the corpus-based approach leads to a significant decrease in proper noun error rate over the baseline in recognition experiments.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130663475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846274
M. J. Correia, I. Trancoso, B. Raj
Polarity detection is a research topic of major interest, with many applications including detecting the polarity of product reviews. However, in some cases, the polarity of the product reviews might not be available while the polarity of the product itself might be, prohibiting the use of any form of fully supervised learning technique. This scenario, while different, is close to that of multiple instance learning (MIL). In this work we propose two new adaptations of support vector machines (SVM) for MIL, θ-MIL, to suit this new scenario, and infer the polarity of products and product reviews. We perform experiments on the proposed methods using the IMDb movie review corpus, and compare the performance of the proposed methods to the traditional SVM for MIL approach. Although we make weaker assumptions about the data, the proposed methods achieve a comparable performance to the SVM for MIL in accurately detecting the polarity of movies and movie reviews.
{"title":"Adaptation of SVM for MIL for inferring the polarity of movies and movie reviews","authors":"M. J. Correia, I. Trancoso, B. Raj","doi":"10.1109/SLT.2016.7846274","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846274","url":null,"abstract":"Polarity detection is a research topic of major interest, with many applications including detecting the polarity of product reviews. However, in some cases, the polarity of the product reviews might not be available while the polarity of the product itself might be, prohibiting the use of any form of fully supervised learning technique. This scenario, while different, is close to that of multiple instance learning (MIL). In this work we propose two new adaptations of support vector machines (SVM) for MIL, θ-MIL, to suit this new scenario, and infer the polarity of products and product reviews. We perform experiments on the proposed methods using the IMDb movie review corpus, and compare the performance of the proposed methods to the traditional SVM for MIL approach. Although we make weaker assumptions about the data, the proposed methods achieve a comparable performance to the SVM for MIL in accurately detecting the polarity of movies and movie reviews.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134532302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/SLT.2016.7846335
G. Mantena, K. Sim
For acoustic modeling, the use of DNN has become popular due to its superior performance improvements observed in many automatic speech recognition (ASR) tasks. Typically, DNNs with deep (many layers) and wide (many hidden units per layer) architectures are chosen in order to achieve good gains. An issue with such approaches is that there is an explosion in the number of learnable parameters. Thus, it is often difficult to build models in cases where there is no sufficient amount of training data (or data for adaptation), and also limits the usage of ASR systems on hand-held devices such as mobile phones. A method to overcome this issue is to reduce the number of parameters. In this work, we provide a framework to effectively reduce the number of parameters by removing the hidden units. Each hidden unit is represented by an activity vector associated with speech attributes such as phones. A normalized entropy-based measure is computed from these activity vectors which reflects the significance of these units in the DNN model. For comparison we also use low-rank matrix factorization to reduce the number of parameters. We show that low-rank matrix factorization can reduce the number of parameters only to a certain extent. Thus, we extend the pruning technique in combination with low-rank matrix factorization to further reduce the model. In this work, we provide detailed experimental results on the Aurora-4 and TEDLIUM databases and show that the models can be reduced to approximately 20 – 30% of its initial size without much loss in the ASR performance.
{"title":"Entropy-based pruning of hidden units to reduce DNN parameters","authors":"G. Mantena, K. Sim","doi":"10.1109/SLT.2016.7846335","DOIUrl":"https://doi.org/10.1109/SLT.2016.7846335","url":null,"abstract":"For acoustic modeling, the use of DNN has become popular due to its superior performance improvements observed in many automatic speech recognition (ASR) tasks. Typically, DNNs with deep (many layers) and wide (many hidden units per layer) architectures are chosen in order to achieve good gains. An issue with such approaches is that there is an explosion in the number of learnable parameters. Thus, it is often difficult to build models in cases where there is no sufficient amount of training data (or data for adaptation), and also limits the usage of ASR systems on hand-held devices such as mobile phones. A method to overcome this issue is to reduce the number of parameters. In this work, we provide a framework to effectively reduce the number of parameters by removing the hidden units. Each hidden unit is represented by an activity vector associated with speech attributes such as phones. A normalized entropy-based measure is computed from these activity vectors which reflects the significance of these units in the DNN model. For comparison we also use low-rank matrix factorization to reduce the number of parameters. We show that low-rank matrix factorization can reduce the number of parameters only to a certain extent. Thus, we extend the pruning technique in combination with low-rank matrix factorization to further reduce the model. In this work, we provide detailed experimental results on the Aurora-4 and TEDLIUM databases and show that the models can be reduced to approximately 20 – 30% of its initial size without much loss in the ASR performance.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131338888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}