Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430087
Yu Tsao, Chin-Hui Lee
Recently an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing unknown testing environments was studied for robust speech recognition. Each environment is modeled by a super-vector consisting of the entire set of mean vectors from all Gaussian densities of a set of HMMs for a particular environment. The super-vector for a new testing environment is then obtained by an affine transformation on the ensemble super-vectors. In this paper, we propose a minimum classification error training procedure to obtain discriminative ensemble elements, and a super-vector clustering technique to achieve refined ensemble structures. We test these two extentions to ESSEM on Aurora2. In a per-utterance unsupervised adaptation mode we achieved an average WER of 4.99% from OdB to 20 dB conditions with these two extentions when compared with a 5.51% WER obtained with the ML-trained gender-dependent baseline. To our knowledge this represents the best result reported in the literature on the Aurora2 connected digit recognition task.
{"title":"Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition","authors":"Yu Tsao, Chin-Hui Lee","doi":"10.1109/ASRU.2007.4430087","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430087","url":null,"abstract":"Recently an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing unknown testing environments was studied for robust speech recognition. Each environment is modeled by a super-vector consisting of the entire set of mean vectors from all Gaussian densities of a set of HMMs for a particular environment. The super-vector for a new testing environment is then obtained by an affine transformation on the ensemble super-vectors. In this paper, we propose a minimum classification error training procedure to obtain discriminative ensemble elements, and a super-vector clustering technique to achieve refined ensemble structures. We test these two extentions to ESSEM on Aurora2. In a per-utterance unsupervised adaptation mode we achieved an average WER of 4.99% from OdB to 20 dB conditions with these two extentions when compared with a 5.51% WER obtained with the ML-trained gender-dependent baseline. To our knowledge this represents the best result reported in the literature on the Aurora2 connected digit recognition task.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117125164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430196
Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters
During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.
{"title":"A fast-match approach for robust, faster than real-time speaker diarization","authors":"Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters","doi":"10.1109/ASRU.2007.4430196","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430196","url":null,"abstract":"During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132352790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430113
Ghinwa F. Choueiter, S. Seneff, James R. Glass
Most automatic speech recognizers use a dictionary that maps words to one or more canonical pronunciations. Such entries are typically hand-written by lexical experts. In this research, we investigate a new approach for automatically generating lexical pronunciations using a linguistically motivated subword model, and refining the pronunciations with spoken examples. The approach is evaluated on an isolated word recognition task with a 2 k lexicon of restaurant and street names. A letter-to-sound model is first used to generate seed baseforms for the lexicon. Then spoken utterances of words in the lexicon are presented to a subword recognizer and the top hypotheses are used to update the lexical base-forms. The spelling of each word is also used to constrain the subword search space and generate spelling-constrained baseforms. The results obtained are quite encouraging and indicate that our approach can be successfully used to learn valid pronunciations of new words.
{"title":"Automatic lexical pronunciations generation and update","authors":"Ghinwa F. Choueiter, S. Seneff, James R. Glass","doi":"10.1109/ASRU.2007.4430113","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430113","url":null,"abstract":"Most automatic speech recognizers use a dictionary that maps words to one or more canonical pronunciations. Such entries are typically hand-written by lexical experts. In this research, we investigate a new approach for automatically generating lexical pronunciations using a linguistically motivated subword model, and refining the pronunciations with spoken examples. The approach is evaluated on an isolated word recognition task with a 2 k lexicon of restaurant and street names. A letter-to-sound model is first used to generate seed baseforms for the lexicon. Then spoken utterances of words in the lexicon are presented to a subword recognizer and the top hypotheses are used to update the lexical base-forms. The spelling of each word is also used to constrain the subword search space and generate spelling-constrained baseforms. The results obtained are quite encouraging and indicate that our approach can be successfully used to learn valid pronunciations of new words.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430144
Krishna Subramanian, D. Stallard, R. Prasad, S. Saleem, P. Natarajan
In this paper, we introduce a new metric which we call the semantic translation error rate, or STER, for evaluating the performance of machine translation systems. STER is based on the previously published translation error rate (TER) (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005) metrics. Specifically, STER extends TER in two ways: first, by incorporating word equivalence measures (WordNet and Porter stemming) standardly used by METEOR, and second, by disallowing alignments of concept words to non-concept words (aka stop words). We show how these features make STER alignments better suited for human-driven analysis than standard TER. We also present experimental results that show that STER is better correlated to human judgments than TER. Finally, we compare STER to METEOR, and illustrate that METEOR scores computed using the STER alignments have similar statistical properties to METEOR scores computed using METEOR alignments.
{"title":"Semantic translation error rate for evaluating translation systems","authors":"Krishna Subramanian, D. Stallard, R. Prasad, S. Saleem, P. Natarajan","doi":"10.1109/ASRU.2007.4430144","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430144","url":null,"abstract":"In this paper, we introduce a new metric which we call the semantic translation error rate, or STER, for evaluating the performance of machine translation systems. STER is based on the previously published translation error rate (TER) (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005) metrics. Specifically, STER extends TER in two ways: first, by incorporating word equivalence measures (WordNet and Porter stemming) standardly used by METEOR, and second, by disallowing alignments of concept words to non-concept words (aka stop words). We show how these features make STER alignments better suited for human-driven analysis than standard TER. We also present experimental results that show that STER is better correlated to human judgments than TER. Finally, we compare STER to METEOR, and illustrate that METEOR scores computed using the STER alignments have similar statistical properties to METEOR scores computed using METEOR alignments.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130480239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430200
K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. Nöth, A. Maier
For many aspects of speech therapy an objective evaluation of the intelligibility of a patient's speech is needed. We investigate the evaluation of the intelligibility of speech by means of automatic speech recognition. Previous studies have shown that measures like word accuracy are consistent with human experts' ratings. To ease the patient's burden, it is highly desirable to conduct the assessment via phone. However, the telephone channel influences the quality of the speech signal which negatively affects the results. To reduce inaccuracies, we propose a combination of two speech recognizers. Experiments on two sets of pathological speech show that the combination results in consistent improvements in the correlation between the automatic evaluation and the ratings by human experts. Furthermore, the approach leads to reductions of 10% and 25% of the maximum error of the intelligibility measure.
{"title":"Towards robust automatic evaluation of pathologic telephone speech","authors":"K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. Nöth, A. Maier","doi":"10.1109/ASRU.2007.4430200","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430200","url":null,"abstract":"For many aspects of speech therapy an objective evaluation of the intelligibility of a patient's speech is needed. We investigate the evaluation of the intelligibility of speech by means of automatic speech recognition. Previous studies have shown that measures like word accuracy are consistent with human experts' ratings. To ease the patient's burden, it is highly desirable to conduct the assessment via phone. However, the telephone channel influences the quality of the speech signal which negatively affects the results. To reduce inaccuracies, we propose a combination of two speech recognizers. Experiments on two sets of pathological speech show that the combination results in consistent improvements in the correlation between the automatic evaluation and the ratings by human experts. Furthermore, the approach leads to reductions of 10% and 25% of the maximum error of the intelligibility measure.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"269 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430177
A. Sangwan, J. Hansen
In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.
{"title":"Phonological feature based variable frame rate scheme for improved speech recognition","authors":"A. Sangwan, J. Hansen","doi":"10.1109/ASRU.2007.4430177","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430177","url":null,"abstract":"In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126632039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430152
H. Meng, Y. Lo, Lan Wang, W. Lau
This work aims to derive salient mispronunciations made by Chinese (L1 being Cantonese) learners of English (L2 being American English) in order to support the design of pedagogical and remedial instructions. Our approach is grounded on the theory of language transfer and involves systematic phonological comparison between two languages to predict possible phonetic confusions that may lead to mispronunciations. We collect a corpus of speech recordings from some 21 Cantonese learners of English. We develop an automatic speech recognizer by training cross-word triphone models based on the TIMIT corpus. We also develop an "extended" pronunciation lexicon that incorporates the predicted phonetic confusions to generate additional, erroneous pronunciation variants for each word. The extended pronunciation lexicon is used to produce a confusion network in recognition of the English speech recordings of Cantonese learners. We refer to the statistics of the erroneous recognition outputs to derive salient mispronunciations that stipulates the predictions based on phonological comparison.
{"title":"Deriving salient learners’ mispronunciations from cross-language phonological comparisons","authors":"H. Meng, Y. Lo, Lan Wang, W. Lau","doi":"10.1109/ASRU.2007.4430152","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430152","url":null,"abstract":"This work aims to derive salient mispronunciations made by Chinese (L1 being Cantonese) learners of English (L2 being American English) in order to support the design of pedagogical and remedial instructions. Our approach is grounded on the theory of language transfer and involves systematic phonological comparison between two languages to predict possible phonetic confusions that may lead to mispronunciations. We collect a corpus of speech recordings from some 21 Cantonese learners of English. We develop an automatic speech recognizer by training cross-word triphone models based on the TIMIT corpus. We also develop an \"extended\" pronunciation lexicon that incorporates the predicted phonetic confusions to generate additional, erroneous pronunciation variants for each word. The extended pronunciation lexicon is used to produce a confusion network in recognition of the English speech recordings of Cantonese learners. We refer to the statistics of the erroneous recognition outputs to derive salient mispronunciations that stipulates the predictions based on phonological comparison.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115396379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430140
Junichi Tsujii
There are two streams of research in computational linguistics and natural language processing, the empiricist and rationalist traditions. Theories and computational techniques in these two streams have been developed separately and different in nature. Although the two traditions have been considered irreconcilable and have often been antagonistic toward each other, I have contention with this assertion, and thus claim that these two research streams in linguistics, despite or due to their differences, can be complementary to each other and should be combined into a unified methodology. I will demonstrate in my talk that there have been interesting developments in this direction of integration, and would like to discuss some of the recent results with their implications on engineering application.
{"title":"Combining statistical models with symbolic grammar in parsing","authors":"Junichi Tsujii","doi":"10.1109/ASRU.2007.4430140","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430140","url":null,"abstract":"There are two streams of research in computational linguistics and natural language processing, the empiricist and rationalist traditions. Theories and computational techniques in these two streams have been developed separately and different in nature. Although the two traditions have been considered irreconcilable and have often been antagonistic toward each other, I have contention with this assertion, and thus claim that these two research streams in linguistics, despite or due to their differences, can be complementary to each other and should be combined into a unified methodology. I will demonstrate in my talk that there have been interesting developments in this direction of integration, and would like to discuss some of the recent results with their implications on engineering application.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116797752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430188
S. Rosset, Olivier Galibert, G. Adda, Eric Bilinski
In this paper, we present two different question-answering systems on speech transcripts. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The preliminary results obtained on QAst (QA on speech transcripts) development data are promising ranged from 72% correct answer at 1 st rank on manually transcribed meeting data to 94% on manually transcribed lecture data.
{"title":"The LIMSI QAst systems: Comparison between human and automatic rules generation for question-answering on speech transcriptions","authors":"S. Rosset, Olivier Galibert, G. Adda, Eric Bilinski","doi":"10.1109/ASRU.2007.4430188","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430188","url":null,"abstract":"In this paper, we present two different question-answering systems on speech transcripts. These two systems are based on a complete and multi-level analysis of both queries and documents. The first system uses handcrafted rules for small text fragments (snippet) selection and answer extraction. The second one replaces the handcrafting with an automatically generated research descriptor. A score based on those descriptors is used to select documents and snippets. The extraction and scoring of candidate answers is based on proximity measurements within the research descriptor elements and a number of secondary factors. The preliminary results obtained on QAst (QA on speech transcripts) development data are promising ranged from 72% correct answer at 1 st rank on manually transcribed meeting data to 94% on manually transcribed lecture data.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121858378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430194
Scott Otterson, Mari Ostendorf
Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.
{"title":"Efficient use of overlap information in speaker diarization","authors":"Scott Otterson, Mari Ostendorf","doi":"10.1109/ASRU.2007.4430194","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430194","url":null,"abstract":"Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123828210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}