Pub Date : 2018-11-21DOI: 10.21437/iberspeech.2018-14
Javier Darna-Sequeiros, D. Toledano
This paper focuses on the audio event detection problem, in particular on Google Audio Set, a database published in 2017 whose size and breadth are unprecedented for this problem. In order to explore the possibilities of this dataset, several classifiers based on different types of deep neural networks were designed, implemented and evaluated to check the impact of factors such as the architecture of the network, the number of layers and the codification of the data in the performance of the models. From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision of 0.26652 and a mean recall of 0.30698. This result is particularly relevant since we use the embeddings provided by Google as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.
{"title":"Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs","authors":"Javier Darna-Sequeiros, D. Toledano","doi":"10.21437/iberspeech.2018-14","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-14","url":null,"abstract":"This paper focuses on the audio event detection problem, in particular on Google Audio Set, a database published in 2017 whose size and breadth are unprecedented for this problem. In order to explore the possibilities of this dataset, several classifiers based on different types of deep neural networks were designed, implemented and evaluated to check the impact of factors such as the architecture of the network, the number of layers and the codification of the data in the performance of the models. From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision of 0.26652 and a mean recall of 0.30698. This result is particularly relevant since we use the embeddings provided by Google as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IberSPEECH.2018-2
I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
This work presents an analysis of i-vectors for speaker recognition working with short utterances and methods to alleviate the loss of performance these utterances imply. Our research reveals that this degradation is strongly influenced by the phonetic mismatch between enrollment and test utterances. However, this mismatch is unused in the standard i-vector PLDA framework. It is proposed a metric to measure this phonetic mismatch and a simple yet effective compensation for the standard i-vector PLDA speaker verification system. Our results, carried out in NIST SRE10 coreext-coreext female det. 5, evidence relative improvements up to 6.65% in short utterances, and up to 9.84% in long utterances as well.
{"title":"Phonetic Variability Influence on Short Utterances in Speaker Verification","authors":"I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IberSPEECH.2018-2","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-2","url":null,"abstract":"This work presents an analysis of i-vectors for speaker recognition working with short utterances and methods to alleviate the loss of performance these utterances imply. Our research reveals that this degradation is strongly influenced by the phonetic mismatch between enrollment and test utterances. However, this mismatch is unused in the standard i-vector PLDA framework. It is proposed a metric to measure this phonetic mismatch and a simple yet effective compensation for the standard i-vector PLDA speaker verification system. Our results, carried out in NIST SRE10 coreext-coreext female det. 5, evidence relative improvements up to 6.65% in short utterances, and up to 9.84% in long utterances as well.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121095797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-55
Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez
Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.
{"title":"Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription","authors":"Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez","doi":"10.21437/IBERSPEECH.2018-55","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-55","url":null,"abstract":"Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125978729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-8
A. Martín, C. García-Mateo, Laura Docío Fernández
Language models are one of the pillars on which the performance of automatic speech recognition systems are based. Statistical language models that use word sequence probabilities (n-grams) are the most common, although deep neural networks are also now beginning to be applied here. This is possible due to the increases in computation power and improvements in algorithms. In this paper, the impact that language models have on the results of recognition is addressed in the following situations: 1) when they are adjusted to the work environment of the final application, and 2) when their complexity grows due to increases in the order of the n-gram models or by the applica-tion of deep neural networks. Specifically, an automatic speech recognition system with different language models is applied to audio recordings, these corresponding to three experimental frameworks: formal orality, talk on newscasts, and TED talks in Galician. Experimental results showed that improving the quality of language models yields improvements in recognition performance.
{"title":"Improving the Automatic Speech Recognition through the improvement of Laguage Models","authors":"A. Martín, C. García-Mateo, Laura Docío Fernández","doi":"10.21437/IBERSPEECH.2018-8","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-8","url":null,"abstract":"Language models are one of the pillars on which the performance of automatic speech recognition systems are based. Statistical language models that use word sequence probabilities (n-grams) are the most common, although deep neural networks are also now beginning to be applied here. This is possible due to the increases in computation power and improvements in algorithms. In this paper, the impact that language models have on the results of recognition is addressed in the following situations: 1) when they are adjusted to the work environment of the final application, and 2) when their complexity grows due to increases in the order of the n-gram models or by the applica-tion of deep neural networks. Specifically, an automatic speech recognition system with different language models is applied to audio recordings, these corresponding to three experimental frameworks: formal orality, talk on newscasts, and TED talks in Galician. Experimental results showed that improving the quality of language models yields improvements in recognition performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126757489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-41
Eduardo Ramos-Muguerza, Laura Docío Fernández, J. Alba-Castro
This paper explains in detail the Audiovisual system deployed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Multimodal Diarization Challenge (MDC) organized in the Iberspeech 2018 conference. This system is characterized by the use of state of the art face and speaker verification embeddings trained with publicly available Deep Neural Networks. Video and audio tracks are processed separately to obtain a matrix of confidence values of each time segment that are finally fused to make joint decisions on the speaker diarization result.
{"title":"The GTM-UVIGO System for Audiovisual Diarization","authors":"Eduardo Ramos-Muguerza, Laura Docío Fernández, J. Alba-Castro","doi":"10.21437/IBERSPEECH.2018-41","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-41","url":null,"abstract":"This paper explains in detail the Audiovisual system deployed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Multimodal Diarization Challenge (MDC) organized in the Iberspeech 2018 conference. This system is characterized by the use of state of the art face and speaker verification embeddings trained with publicly available Deep Neural Networks. Video and audio tracks are processed separately to obtain a matrix of confidence values of each time segment that are finally fused to make joint decisions on the speaker diarization result.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-20
Emilio Granell, C. Martínez-Hinarejos, Verónica Romero
State-of-the-art Natural Language Recognition systems allow transcribers to speed-up the transcription of audio, video or image documents. These systems provide transcribers an initial draft transcription that can be corrected with less effort than transcribing the documents from scratch. However, even the drafts offered by the most advanced systems based on Deep Learning contain errors. Therefore, the supervision of those drafts by a human transcriber is still necessary to obtain the correct transcription. This supervision can be eased by using interactive and assistive transcription systems, where the transcriber and the automatic system cooperate in the amending process. Moreover, the interactive system can combine different sources of information in order to improve their performance, such as text line images and the dictation of their textual contents. In this paper, the performance of a multimodal interactive and assistive transcription system is evaluated on one Spanish historical manuscript. Although the quality of the draft transcriptions provided by a Handwriting Text Recognition system based on Deep Learning is pretty good, the proposed interactive and assistive approach reveals an additional reduction of transcription effort. Besides, this effort reduction is increased when using speech dictations over an Automatic Speech Recognition system, allowing for a faster transcription process.
{"title":"Improving Transcription of Manuscripts with Multimodality and Interaction","authors":"Emilio Granell, C. Martínez-Hinarejos, Verónica Romero","doi":"10.21437/IBERSPEECH.2018-20","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-20","url":null,"abstract":"State-of-the-art Natural Language Recognition systems allow transcribers to speed-up the transcription of audio, video or image documents. These systems provide transcribers an initial draft transcription that can be corrected with less effort than transcribing the documents from scratch. However, even the drafts offered by the most advanced systems based on Deep Learning contain errors. Therefore, the supervision of those drafts by a human transcriber is still necessary to obtain the correct transcription. This supervision can be eased by using interactive and assistive transcription systems, where the transcriber and the automatic system cooperate in the amending process. Moreover, the interactive system can combine different sources of information in order to improve their performance, such as text line images and the dictation of their textual contents. In this paper, the performance of a multimodal interactive and assistive transcription system is evaluated on one Spanish historical manuscript. Although the quality of the draft transcriptions provided by a Handwriting Text Recognition system based on Deep Learning is pretty good, the proposed interactive and assistive approach reveals an additional reduction of transcription effort. Besides, this effort reduction is increased when using speech dictations over an Automatic Speech Recognition system, allowing for a faster transcription process.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122561811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-31
X. Sarasola, E. Navas, David Tavarez, Luis Serrano, I. Saratxaga
In this paper we present a novel method for automatic segmentation of speech and monophonic singing voice based only on two parameters derived from pitch: proportion of voiced segments and percentage of pitch labelled as a musical note. First, voice is located in audio files using a GMM-HMM based VAD and pitch is calculated. Using the pitch curve, automatic musical note labelling is made applying stable value sequence search. Then pitch features extracted from each voice island are classified with Support Vector Machines. Our corpus consists in recordings of live sung poetry sessions where audio files contain both singing and speech voices. The proposed system has been compared with other speech/singing discrimination systems with good results.
{"title":"Speech and monophonic singing segmentation using pitch parameters","authors":"X. Sarasola, E. Navas, David Tavarez, Luis Serrano, I. Saratxaga","doi":"10.21437/IBERSPEECH.2018-31","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-31","url":null,"abstract":"In this paper we present a novel method for automatic segmentation of speech and monophonic singing voice based only on two parameters derived from pitch: proportion of voiced segments and percentage of pitch labelled as a musical note. First, voice is located in audio files using a GMM-HMM based VAD and pitch is calculated. Using the pitch curve, automatic musical note labelling is made applying stable value sequence search. Then pitch features extracted from each voice island are classified with Support Vector Machines. Our corpus consists in recordings of live sung poetry sessions where audio files contain both singing and speech voices. The proposed system has been compared with other speech/singing discrimination systems with good results.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125164881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IberSPEECH.2018-28
M. Freixes, M. Arnela, J. Socoró, Francesc Alías, O. Guasch
One-dimensional articulatory speech models have long been used to generate synthetic voice. These models assume plane wave propagation within the vocal tract, which holds for frequencies up to ∼ 5 kHz. However, higher order modes also propagate beyond this limit, which may be relevant to produce a more natural voice. Such modes could be especially impor-tant for phonation types with significant high frequency energy (HFE) content. In this work, we study the influence of tense, modal and lax phonation on the synthesis of vowel [A] through 3D finite element modelling (FEM). The three phonation types are reproduced with an LF (Liljencrants-Fant) model controlled by the R d glottal shape parameter. The onset of the higher order modes essentially depends on the vocal tract geometry. Two of them are considered, a realistic vocal tract obtained from MRI and a simplified straight duct with varying circular cross-sections. Long-term average spectra are computed from the FEM synthesised [A] vowels, extracting the overall sound pressure level and the HFE level in the 8 kHz octave band. Results indicate that higher order modes may be perceptually relevant for the tense and modal voice qualities, but not for the lax phonation.
{"title":"Influence of tense, modal and lax phonation on the three-dimensional finite element synthesis of vowel [A]","authors":"M. Freixes, M. Arnela, J. Socoró, Francesc Alías, O. Guasch","doi":"10.21437/IberSPEECH.2018-28","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-28","url":null,"abstract":"One-dimensional articulatory speech models have long been used to generate synthetic voice. These models assume plane wave propagation within the vocal tract, which holds for frequencies up to ∼ 5 kHz. However, higher order modes also propagate beyond this limit, which may be relevant to produce a more natural voice. Such modes could be especially impor-tant for phonation types with significant high frequency energy (HFE) content. In this work, we study the influence of tense, modal and lax phonation on the synthesis of vowel [A] through 3D finite element modelling (FEM). The three phonation types are reproduced with an LF (Liljencrants-Fant) model controlled by the R d glottal shape parameter. The onset of the higher order modes essentially depends on the vocal tract geometry. Two of them are considered, a realistic vocal tract obtained from MRI and a simplified straight duct with varying circular cross-sections. Long-term average spectra are computed from the FEM synthesised [A] vowels, extracting the overall sound pressure level and the HFE level in the 8 kHz octave band. Results indicate that higher order modes may be perceptually relevant for the tense and modal voice qualities, but not for the lax phonation.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128371925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-6
B. Külebi, A. Öktem
Catalan is recognized as the largest stateless language in Europe hence it is a language well studied in the field of speech, and there exists various solutions for Automatic Speech Recognition (ASR) with large vocabulary. However, unlike many of the official languages of Europe, it neither has an open acoustic corpus sufficiently large for training ASR models, nor openly accessible acoustic models for local task execution and personal use. In order to provide the necessary tools and expertise for the resource limited languages, in this work we discuss the development of a large speech corpus of broadcast media and building of an Catalan ASR system using CMU Sphinx. The resulting models have a WER of 35,2% on a 4 hour test set of similar recordings and a 31.95% on an external 4 hour multi-speaker test set. This rate is further decreased to 11.68% with a task specific language model. 240 hours of broadcast speech data and the resulting models are distributed openly for use.
{"title":"Building an Open Source Automatic Speech Recognition System for Catalan","authors":"B. Külebi, A. Öktem","doi":"10.21437/IBERSPEECH.2018-6","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-6","url":null,"abstract":"Catalan is recognized as the largest stateless language in Europe hence it is a language well studied in the field of speech, and there exists various solutions for Automatic Speech Recognition (ASR) with large vocabulary. However, unlike many of the official languages of Europe, it neither has an open acoustic corpus sufficiently large for training ASR models, nor openly accessible acoustic models for local task execution and personal use. In order to provide the necessary tools and expertise for the resource limited languages, in this work we discuss the development of a large speech corpus of broadcast media and building of an Catalan ASR system using CMU Sphinx. The resulting models have a WER of 35,2% on a 4 hour test set of similar recordings and a 31.95% on an external 4 hour multi-speaker test set. This rate is further decreased to 11.68% with a task specific language model. 240 hours of broadcast speech data and the resulting models are distributed openly for use.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132733292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/iberspeech.2018-42
Diego Castán, Mitchell McLaren, Mahesh Kumar Nandwana
This document describes the submissions of STAR-LAB (the Speech Technology and Research Laboratory at SRI International) to the open-set condition of the IberSPEECH-RTVE 2018 Speaker Diarization Challenge. The core components of the submissions included noise-robust speech activity detection, speaker embeddings for initializing diarization with domain adaptation, and Variational Bayes (VB) diarization using a DNN bottleneck i-vector subspaces.
{"title":"The SRI International STAR-LAB System Description for IberSPEECH-RTVE 2018 Speaker Diarization Challenge","authors":"Diego Castán, Mitchell McLaren, Mahesh Kumar Nandwana","doi":"10.21437/iberspeech.2018-42","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-42","url":null,"abstract":"This document describes the submissions of STAR-LAB (the Speech Technology and Research Laboratory at SRI International) to the open-set condition of the IberSPEECH-RTVE 2018 Speaker Diarization Challenge. The core components of the submissions included noise-robust speech activity detection, speaker embeddings for initializing diarization with domain adaptation, and Variational Bayes (VB) diarization using a DNN bottleneck i-vector subspaces.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121034626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}