Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424216
Md. Akmal Haidar, D. O'Shaughnessy
We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.
{"title":"Topic n-gram count language model adaptation for speech recognition","authors":"Md. Akmal Haidar, D. O'Shaughnessy","doi":"10.1109/SLT.2012.6424216","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424216","url":null,"abstract":"We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130771343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424238
B. Picart, Thomas Drugman, T. Dutoit
This paper focuses on the automatic modification of the degree of articulation (hypo/hyperarticulation) of an existing standard neutral voice in the framework of HMM-based speech synthesis. Starting from a source speaker for which neutral, hypo and hyperarticulated speech data are available, two sets of transformations are computed during the adaptation of the neutral speech synthesizer. These transformations are then applied to a new target speaker for which no hypo/hyperarticulated recordings are available. Four statistical methods are investigated, differing in the speaking style adaptation technique (MLLR vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. This study focuses on the prosody model although such techniques can be applied to any stream of parameters exhibiting suited interpolability properties. Two subjective evaluations are performed in order to determine which statistical transformation method achieves the better segmental quality and reproduction of the articulation degree.
本文主要研究在基于hmm的语音合成框架下,对现有标准中性语音的发音程度(低/高发音)进行自动修改。从具有中性、次和高清晰度语音数据的源说话者开始,在中性语音合成器的适应过程中计算两组转换。然后将这些转换应用于新的目标说话者,其中没有低/高发音录音可用。四种统计方法被调查,不同的说话风格适应技术(MLLR vs. cllr)和说话风格换位方法(语音与声学对应),他们使用。本研究的重点是韵律模型,尽管这种技术可以应用于任何显示合适的可插入性属性的参数流。为了确定哪一种统计变换方法能获得更好的片段质量和发音度的再现,进行了两次主观评价。
{"title":"Statistical methods for varying the degree of articulation in new HMM-based voices","authors":"B. Picart, Thomas Drugman, T. Dutoit","doi":"10.1109/SLT.2012.6424238","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424238","url":null,"abstract":"This paper focuses on the automatic modification of the degree of articulation (hypo/hyperarticulation) of an existing standard neutral voice in the framework of HMM-based speech synthesis. Starting from a source speaker for which neutral, hypo and hyperarticulated speech data are available, two sets of transformations are computed during the adaptation of the neutral speech synthesizer. These transformations are then applied to a new target speaker for which no hypo/hyperarticulated recordings are available. Four statistical methods are investigated, differing in the speaking style adaptation technique (MLLR vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. This study focuses on the prosody model although such techniques can be applied to any stream of parameters exhibiting suited interpolability properties. Two subjective evaluations are performed in order to determine which statistical transformation method achieves the better segmental quality and reproduction of the articulation degree.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133893657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424255
M. Shahin, B. Ahmed, K. Ballard
Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.
{"title":"Automatic classification of unequal lexical stress patterns using machine learning algorithms","authors":"M. Shahin, B. Ahmed, K. Ballard","doi":"10.1109/SLT.2012.6424255","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424255","url":null,"abstract":"Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132354438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424221
Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi
We are interested in the problem of learning Spoken Language Understanding (SLU) models for multiple target languages. Learning such models requires annotated corpora, and porting to different languages would require corpora with parallel text translation and semantic annotations. In this paper we investigate how to learn a SLU model in a target language starting from no target text and no semantic annotation. Our proposed algorithm is based on the idea of exploiting the diversity (with regard to performance and coverage) of multiple translation systems to transfer statistically stable word-to-concept mappings in the case of the romance language pair, French and Spanish. Each translation system performs differently at the lexical level (wrt BLEU). The best translation system performances for the semantic task are gained from their combination at different stages of the portability methodology. We have evaluated the portability algorithms on the French MEDIA corpus, using French as the source language and Spanish as the target language. The experiments show the effectiveness of the proposed methods with respect to the source language SLU baseline.
{"title":"Combining multiple translation systems for Spoken Language Understanding portability","authors":"Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi","doi":"10.1109/SLT.2012.6424221","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424221","url":null,"abstract":"We are interested in the problem of learning Spoken Language Understanding (SLU) models for multiple target languages. Learning such models requires annotated corpora, and porting to different languages would require corpora with parallel text translation and semantic annotations. In this paper we investigate how to learn a SLU model in a target language starting from no target text and no semantic annotation. Our proposed algorithm is based on the idea of exploiting the diversity (with regard to performance and coverage) of multiple translation systems to transfer statistically stable word-to-concept mappings in the case of the romance language pair, French and Spanish. Each translation system performs differently at the lexical level (wrt BLEU). The best translation system performances for the semantic task are gained from their combination at different stages of the portability methodology. We have evaluated the portability algorithms on the French MEDIA corpus, using French as the source language and Spanish as the target language. The experiments show the effectiveness of the proposed methods with respect to the source language SLU baseline.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424222
Ali Orkan Bayer, G. Riccardi
Language models (LMs) are one of the main knowledge sources used by automatic speech recognition (ASR) and Spoken Language Understanding (SLU) systems. In ASR systems they are optimized to decode words from speech for a transcription task. In SLU systems they are optimized to map words into concept constructs or interpretation representations. Performance optimization is generally designed independently for ASR and SLU models in terms of word accuracy and concept accuracy respectively. However, the best word accuracy performance does not always yield the best understanding performance. In this paper we investigate how LMs originally trained to maximize word accuracy can be parametrized to account for speech understanding constraints and maximize concept accuracy. Incremental reduction in concept error rate is observed when a LM is trained on word-to-concept mappings. We show how to optimize the joint transcription and understanding task performance in the lexical-semantic relation space.
{"title":"Joint language models for automatic speech recognition and understanding","authors":"Ali Orkan Bayer, G. Riccardi","doi":"10.1109/SLT.2012.6424222","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424222","url":null,"abstract":"Language models (LMs) are one of the main knowledge sources used by automatic speech recognition (ASR) and Spoken Language Understanding (SLU) systems. In ASR systems they are optimized to decode words from speech for a transcription task. In SLU systems they are optimized to map words into concept constructs or interpretation representations. Performance optimization is generally designed independently for ASR and SLU models in terms of word accuracy and concept accuracy respectively. However, the best word accuracy performance does not always yield the best understanding performance. In this paper we investigate how LMs originally trained to maximize word accuracy can be parametrized to account for speech understanding constraints and maximize concept accuracy. Incremental reduction in concept error rate is observed when a LM is trained on word-to-concept mappings. We show how to optimize the joint transcription and understanding task performance in the lexical-semantic relation space.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121327277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424232
O. Jokisch, Y. Gebremedhin, R. Hoffmann
Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.
{"title":"Syllable-based prosodic analysis of Amharic read speech","authors":"O. Jokisch, Y. Gebremedhin, R. Hoffmann","doi":"10.1109/SLT.2012.6424232","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424232","url":null,"abstract":"Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124615247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424213
Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose
Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.
{"title":"Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose","doi":"10.1109/SLT.2012.6424213","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424213","url":null,"abstract":"Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424256
K. Riedhammer, Martin Gropp, E. Nöth
A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, directed and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos often contain longer periods of inactivity or silence, unnecessary repetitions, or corrections of prior mistakes. This paper describes the FAU Video Lecture Browser system, a web-based platform for the interactive assessment of video lectures, that helps to close the gap between a plain recording and a useful e-learning resource by displaying automatically extracted and ranked key phrases on an augmented time line based on stream graphs. In a pilot study, users of the interface were able to complete a topic localization task about 29 % faster than users provided with the video only while achieving about the same accuracy. The user interactions can be logged on the server to collect data to evaluate the quality of the phrases and rankings, and to train systems that produce customized phrase rankings.
{"title":"The FAU Video Lecture Browser system","authors":"K. Riedhammer, Martin Gropp, E. Nöth","doi":"10.1109/SLT.2012.6424256","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424256","url":null,"abstract":"A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, directed and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos often contain longer periods of inactivity or silence, unnecessary repetitions, or corrections of prior mistakes. This paper describes the FAU Video Lecture Browser system, a web-based platform for the interactive assessment of video lectures, that helps to close the gap between a plain recording and a useful e-learning resource by displaying automatically extracted and ranked key phrases on an augmented time line based on stream graphs. In a pilot study, users of the interface were able to complete a topic localization task about 29 % faster than users provided with the video only while achieving about the same accuracy. The user interactions can be logged on the server to collect data to evaluate the quality of the phrases and rankings, and to train systems that produce customized phrase rankings.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126697955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424229
Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute
This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.
{"title":"What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems","authors":"Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute","doi":"10.1109/SLT.2012.6424229","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424229","url":null,"abstract":"This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127900811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-12-01DOI: 10.1109/SLT.2012.6424243
Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern
Speaker verification systems are, in essence, statistical pattern detectors which can trade off false rejections for false acceptances. Any operating point characterized by a specific tradeoff between false rejections and false acceptances may be chosen. Training paradigms in speaker verification systems however either learn the parameters of the classifier employed without actually considering this tradeoff, or optimize the parameters for a particular operating point exemplified by the ratio of positive and negative training instances supplied. In this paper we investigate the optimization of training paradigms to explicitly consider the tradeoff between false rejections and false acceptances, by minimizing the area under the curve of the detection error tradeoff curve. To optimize the parameters, we explicitly minimize a mathematical characterization of the area under the detection error tradeoff curve, through generalized probabilistic descent. Experiments on the NIST 2008 database show that for clean signals the proposed optimization approach is at least as effective as conventional learning. On noisy data, verification performance obtained with the proposed approach is considerably better than that obtained with conventional learning methods.
{"title":"Optimization of the DET curve in speaker verification","authors":"Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern","doi":"10.1109/SLT.2012.6424243","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424243","url":null,"abstract":"Speaker verification systems are, in essence, statistical pattern detectors which can trade off false rejections for false acceptances. Any operating point characterized by a specific tradeoff between false rejections and false acceptances may be chosen. Training paradigms in speaker verification systems however either learn the parameters of the classifier employed without actually considering this tradeoff, or optimize the parameters for a particular operating point exemplified by the ratio of positive and negative training instances supplied. In this paper we investigate the optimization of training paradigms to explicitly consider the tradeoff between false rejections and false acceptances, by minimizing the area under the curve of the detection error tradeoff curve. To optimize the parameters, we explicitly minimize a mathematical characterization of the area under the detection error tradeoff curve, through generalized probabilistic descent. Experiments on the NIST 2008 database show that for clean signals the proposed optimization approach is at least as effective as conventional learning. On noisy data, verification performance obtained with the proposed approach is considerably better than that obtained with conventional learning methods.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127161673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}