Pub Date : 2013-12-01DOI: 10.1109/ASRU.2013.6707732
Duc Le, E. Provost
Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.
{"title":"Emotion recognition from spontaneous speech using Hidden Markov models with deep belief networks","authors":"Duc Le, E. Provost","doi":"10.1109/ASRU.2013.6707732","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707732","url":null,"abstract":"Research in emotion recognition seeks to develop insights into the temporal properties of emotion. However, automatic emotion recognition from spontaneous speech is challenging due to non-ideal recording conditions and highly ambiguous ground truth labels. Further, emotion recognition systems typically work with noisy high-dimensional data, rendering it difficult to find representative features and train an effective classifier. We tackle this problem by using Deep Belief Networks, which can model complex and non-linear high-level relationships between low-level features. We propose and evaluate a suite of hybrid classifiers based on Hidden Markov Models and Deep Belief Networks. We achieve state-of-the-art results on FAU Aibo, a benchmark dataset in emotion recognition [1]. Our work provides insights into important similarities and differences between speech and emotion.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126311824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/ASRU.2013.6707738
A. Touazi, M. Debyeche
This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.
{"title":"An SVD-based scheme for MFCC compression in distributed speech recognition system","authors":"A. Touazi, M. Debyeche","doi":"10.1109/ASRU.2013.6707738","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707738","url":null,"abstract":"This paper proposes a new scheme for low bit-rate source coding of Mel Frequency Cepstral Coefficients (MFCCs) in Distributed Speech Recognition (DSR) system. The method uses the compressed ETSI Advanced Front-End (ETSI-AFE) features factorized into SVD components. By investigating the correlation property between successive MFCC frames, the odd ones are encoded using ETSI-AFE, while only the singular values and the nearest left singular vectors index are encoded and transmitted for the even frames. At the server side, the non-transmitted MFCCs are evaluated through their quantized singular values and the nearest left singular vectors. The system provides a compression bit-rate of 2.7 kbps. The recognition experiments were carried out on the Aurora-2 database for clean and multi-condition training modes. The simulation results show good recognition performance without significant degradation, with respect to the ETSI-AFE encoder.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/ASRU.2013.6707703
Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey
This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.
{"title":"A generalized discriminative training framework for system combination","authors":"Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, J. Hershey","doi":"10.1109/ASRU.2013.6707703","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707703","url":null,"abstract":"This paper proposes a generalized discriminative training framework for system combination, which encompasses acoustic modeling (Gaussian mixture models and deep neural networks) and discriminative feature transformation. To improve the performance by combining base systems with complementary systems, complementary systems should have reasonably good performance while tending to have different outputs compared with the base system. Although it is difficult to balance these two somewhat opposite targets in conventional heuristic combination approaches, our framework provides a new objective function that enables to adjust the balance within a sequential discriminative training criterion. We also describe how the proposed method relates to boosting methods. Experiments on highly noisy middle vocabulary speech recognition task (2nd CHiME challenge track 2) and LVCSR task (Corpus of Spontaneous Japanese) show the effectiveness of the proposed method, compared with a conventional system combination approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121489053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/ASRU.2013.6707714
Emmanuel Ferreira, F. Lefèvre
This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).
{"title":"Expert-based reward shaping and exploration scheme for boosting policy learning of dialogue management","authors":"Emmanuel Ferreira, F. Lefèvre","doi":"10.1109/ASRU.2013.6707714","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707714","url":null,"abstract":"This paper investigates the conditions under which expert knowledge can be used to accelerate the policy optimization of a learning agent. Recent works on reinforcement learning for dialogue management allowed to devise sophisticated methods for value estimation in order to deal all together with exploration/exploitation dilemma, sample-efficiency and non-stationary environments. In this paper, a reward shaping method and an exploration scheme, both based on some intuitive hand-coded expert advices, are combined with an efficient temporal difference-based learning procedure. The key objective is to boost the initial training stage, when the system is not sufficiently reliable to interact with real users (e.g. clients). Our claims are illustrated by experiments based on simulation and carried out using a state-of-the-art goal-oriented dialogue management framework, the Hidden Information State (HIS).","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121518288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-09-05DOI: 10.1109/ASRU.2013.6707747
Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.
{"title":"Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling","authors":"Tara N. Sainath, L. Horesh, Brian Kingsbury, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707747","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707747","url":null,"abstract":"Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration, we further propose the employment of flexible Krylov subspace solvers that retain the desired theoretical convergence guarantees of their conventional counterparts. Second, we propose a new sampling algorithm, which geometrically increases the amount of data utilized for gradient and Krylov subspace iteration calculations. On a 50-hr English Broadcast News task, we find that these methodologies provide roughly a 1.5× speed-up, whereas, on a 300-hr Switchboard task, these techniques provide over a 2.3× speedup, with no loss in WER. These results suggest that even further speed-up is expected, as problems scale and complexity grows.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117244156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-09-05DOI: 10.1109/ASRU.2013.6707749
Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
{"title":"Improvements to Deep Convolutional Neural Networks for LVCSR","authors":"Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, G. Saon, H. Soltau, T. Beran, A. Aravkin, B. Ramabhadran","doi":"10.1109/ASRU.2013.6707749","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707749","url":null,"abstract":"Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127609422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-09-03DOI: 10.1109/ASRU.2013.6707719
K. Knill, M. Gales, S. Rath, P. Woodland, Chao Zhang, Shi-Xiong Zhang
The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.
{"title":"Investigation of multilingual deep neural networks for spoken term detection","authors":"K. Knill, M. Gales, S. Rath, P. Woodland, Chao Zhang, Shi-Xiong Zhang","doi":"10.1109/ASRU.2013.6707719","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707719","url":null,"abstract":"The development of high-performance speech processing systems for low-resource languages is a challenging area. One approach to address the lack of resources is to make use of data from multiple languages. A popular direction in recent years is to use bottleneck features, or hybrid systems, trained on multilingual data for speech-to-text (STT) systems. This paper presents an investigation into the application of these multilingual approaches to spoken term detection. Experiments were run using the IARPA Babel limited language pack corpora (~10 hours/language) with 4 languages for initial multilingual system development and an additional held-out target language. STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS). Further improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages. Adapted hybrid systems performed slightly worse on average than the adapted Tandem systems. A language independent acoustic model test on the target language showed that retraining or adapting of the acoustic models to the target language is currently minimally needed to achieve reasonable performance.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121261364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-15DOI: 10.1109/ASRU.2013.6707725
D. S. P. Kumar, N. Prasad, Vikas Joshi, S. Umesh
In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.
{"title":"Modified splice and its extension to non-stereo data for noise robust speech recognition","authors":"D. S. P. Kumar, N. Prasad, Vikas Joshi, S. Umesh","doi":"10.1109/ASRU.2013.6707725","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707725","url":null,"abstract":"In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115416820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/ASRU.2013.6707762
B. Ons, J. Gemmeke, H. V. hamme
This research is situated in a project aimed at the development of a vocal user interface (VUI) that learns to understand its users specifically persons with a speech impairment. The vocal interface adapts to the speech of the user by learning the vocabulary from interaction examples. Word learning is implemented through weakly supervised non-negative matrix factorization (NMF). The goal of this study is to investigate how we can improve word learning when the number of interaction examples is low. We demonstrate two approaches to train NMF models on scarce data: 1) training word models using smoothed training data, and 2) training word models that strictly correspond to the grounding information derived from a few interaction examples. We found that both approaches can substantially improve word learning from scarce training data.
{"title":"NMF-based keyword learning from scarce data","authors":"B. Ons, J. Gemmeke, H. V. hamme","doi":"10.1109/ASRU.2013.6707762","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707762","url":null,"abstract":"This research is situated in a project aimed at the development of a vocal user interface (VUI) that learns to understand its users specifically persons with a speech impairment. The vocal interface adapts to the speech of the user by learning the vocabulary from interaction examples. Word learning is implemented through weakly supervised non-negative matrix factorization (NMF). The goal of this study is to investigate how we can improve word learning when the number of interaction examples is low. We demonstrate two approaches to train NMF models on scarce data: 1) training word models using smoothed training data, and 2) training word models that strictly correspond to the grounding information derived from a few interaction examples. We found that both approaches can substantially improve word learning from scarce training data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133879205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/ASRU.2013.6707739
Reza Sahraeian, Dirk Van Compernolle
Intrinsic Spectral Analysis (ISA) has been formulated within a manifold learning setting allowing natural extensions to out-of-sample data together with feature reduction in a learning framework. In this paper, we propose two approaches to improve the performance of supervised ISA, and then we examine the effect of applying Linear Discriminant technique in the intrinsic subspace compared with the extrinsic one. In the interest of reducing complexity, we propose a preprocessing operation to find a small subset of data points being well representative of the manifold structure; this is accomplished by maximizing the quadratic Renyi entropy. Furthermore, we use class based graphs which not only simplify our problem but also can be helpful in a classification task. Experimental results for phone classification task on TIMIT dataset showed that ISA features improve the performance compared with traditional features, and supervised discriminant techniques outperform in the ISA subspace compared to conventional feature spaces.
{"title":"A study of supervised intrinsic spectral analysis for TIMIT phone classification","authors":"Reza Sahraeian, Dirk Van Compernolle","doi":"10.1109/ASRU.2013.6707739","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707739","url":null,"abstract":"Intrinsic Spectral Analysis (ISA) has been formulated within a manifold learning setting allowing natural extensions to out-of-sample data together with feature reduction in a learning framework. In this paper, we propose two approaches to improve the performance of supervised ISA, and then we examine the effect of applying Linear Discriminant technique in the intrinsic subspace compared with the extrinsic one. In the interest of reducing complexity, we propose a preprocessing operation to find a small subset of data points being well representative of the manifold structure; this is accomplished by maximizing the quadratic Renyi entropy. Furthermore, we use class based graphs which not only simplify our problem but also can be helpful in a classification task. Experimental results for phone classification task on TIMIT dataset showed that ISA features improve the performance compared with traditional features, and supervised discriminant techniques outperform in the ISA subspace compared to conventional feature spaces.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131965972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}