首页 > 最新文献

IberSPEECH Conference最新文献

英文 中文
Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs Google的Audio Set数据库上的音频事件检测:使用不同类型的dnn的初步结果
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-14
Javier Darna-Sequeiros, D. Toledano
This paper focuses on the audio event detection problem, in particular on Google Audio Set, a database published in 2017 whose size and breadth are unprecedented for this problem. In order to explore the possibilities of this dataset, several classifiers based on different types of deep neural networks were designed, implemented and evaluated to check the impact of factors such as the architecture of the network, the number of layers and the codification of the data in the performance of the models. From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision of 0.26652 and a mean recall of 0.30698. This result is particularly relevant since we use the embeddings provided by Google as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.
本文重点研究音频事件检测问题,特别是2017年发布的Google audio Set数据库,其规模和广度在该问题上都是前所未有的。为了探索该数据集的可能性,基于不同类型的深度神经网络设计、实现和评估了几个分类器,以检查网络结构、层数和数据编码等因素对模型性能的影响。在所有被测试的分类器中,LSTM神经网络表现出最好的结果,平均精度为0.26652,平均召回率为0.30698。这个结果是特别相关的,因为我们使用谷歌提供的嵌入作为dnn的输入,dnn是最多10个特征向量的序列,因此限制了lstm的序列建模能力。
{"title":"Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs","authors":"Javier Darna-Sequeiros, D. Toledano","doi":"10.21437/iberspeech.2018-14","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-14","url":null,"abstract":"This paper focuses on the audio event detection problem, in particular on Google Audio Set, a database published in 2017 whose size and breadth are unprecedented for this problem. In order to explore the possibilities of this dataset, several classifiers based on different types of deep neural networks were designed, implemented and evaluated to check the impact of factors such as the architecture of the network, the number of layers and the codification of the data in the performance of the models. From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision of 0.26652 and a mean recall of 0.30698. This result is particularly relevant since we use the embeddings provided by Google as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Phonetic Variability Influence on Short Utterances in Speaker Verification 说话人验证中语音变异性对短话语的影响
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-2
I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
This work presents an analysis of i-vectors for speaker recognition working with short utterances and methods to alleviate the loss of performance these utterances imply. Our research reveals that this degradation is strongly influenced by the phonetic mismatch between enrollment and test utterances. However, this mismatch is unused in the standard i-vector PLDA framework. It is proposed a metric to measure this phonetic mismatch and a simple yet effective compensation for the standard i-vector PLDA speaker verification system. Our results, carried out in NIST SRE10 coreext-coreext female det. 5, evidence relative improvements up to 6.65% in short utterances, and up to 9.84% in long utterances as well.
本文分析了i向量在短语音识别中的应用,并提出了减轻短语音识别性能损失的方法。我们的研究表明,这种退化受到入学和测试话语之间语音不匹配的强烈影响。然而,这种不匹配在标准的i向量PLDA框架中是不使用的。提出了一种测量这种语音不匹配的度量方法,并针对标准i矢量PLDA说话人验证系统提出了一种简单有效的补偿方法。我们在NIST SRE10 coreext-coreext女性测试5中进行的结果表明,短话语的相对改善高达6.65%,长话语的相对改善也高达9.84%。
{"title":"Phonetic Variability Influence on Short Utterances in Speaker Verification","authors":"I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IberSPEECH.2018-2","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-2","url":null,"abstract":"This work presents an analysis of i-vectors for speaker recognition working with short utterances and methods to alleviate the loss of performance these utterances imply. Our research reveals that this degradation is strongly influenced by the phonetic mismatch between enrollment and test utterances. However, this mismatch is unused in the standard i-vector PLDA framework. It is proposed a metric to measure this phonetic mismatch and a simple yet effective compensation for the standard i-vector PLDA speaker verification system. Our results, carried out in NIST SRE10 coreext-coreext female det. 5, evidence relative improvements up to 6.65% in short utterances, and up to 9.84% in long utterances as well.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121095797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription 探索开源深度学习ASR用于语音到文本电视节目转录
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-55
Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez
Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.
深度神经网络(DNN)是当前ASR的基础部分。最先进的是混合模型,其中声学模型(AM)是利用神经网络设计的。然而,人们对开发端到端深度学习解决方案越来越感兴趣,其中神经网络被训练来预测可以直接转换为单词的字符/字素或子词序列。尽管端到端ASR系统已经取得了一些令人鼓舞的成果,但目前尚不清楚它们是否能够取代混合动力系统。在这篇文章中,我们在IberSpeech-RTVE语音到文本转录挑战下评估了开源的最先进的混合和端到端深度学习ASR。混合ASR是基于Kaldi和Wav2Letter的端到端框架。使用6小时的dev1和dev2分区进行实验。在参考电视节目(LM-20171107)中,混合系统(小写格式,不带标点符号)的最低WER为22.23%。Wav2Letter的主要限制是高训练计算需求(根据训练集的不同,在6小时到1天/epoch之间)。这迫使我们停止训练过程,以满足挑战的最后期限。但我们相信,随着更多的训练时间,它将提供具有竞争力的结果与混合系统。
{"title":"Exploring Open-Source Deep Learning ASR for Speech-to-Text TV program transcription","authors":"Juan M. Perero-Codosero, Javier Antón-Martín, D. Merino, Eduardo López Gonzalo, L. A. H. Gómez","doi":"10.21437/IBERSPEECH.2018-55","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-55","url":null,"abstract":"Deep Neural Networks (DNN) are fundamental part of current ASR. State-of-the-art are hybrid models in which acoustic models (AM) are designed using neural networks. However, there is an increasing interest in developing end-to-end Deep Learning solutions where a neural network is trained to predict charac-ter/grapheme or sub-word sequences which can be converted directly to words. Though several promising results have been reported for end-to-end ASR systems, it is still not clear if they are capable to unseat hybrid systems. In this contribution, we evaluate open-source state-of-the-art hybrid and end-to-end Deep Learning ASR under the IberSpeech-RTVE Speech to Text Transcription Challenge. The hybrid ASR is based on Kaldi and Wav2Letter will be the end-to-end framework. Experiments were carried out using 6 hours of dev1 and dev2 partitions. The lowest WER on the reference TV show (LM-20171107) was 22.23% for the hybrid system (lowercase format without punctuation). Major limitation for Wav2Letter has been a high training computational demand (be-tween 6 hours and 1 day/epoch, depending on the training set). This forced us to stop the training process to meet the Challenge deadline. But we believe that with more training time it will provide competitive results with the hybrid system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125978729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving the Automatic Speech Recognition through the improvement of Laguage Models 通过语言模型的改进来改进语音自动识别
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-8
A. Martín, C. García-Mateo, Laura Docío Fernández
Language models are one of the pillars on which the performance of automatic speech recognition systems are based. Statistical language models that use word sequence probabilities (n-grams) are the most common, although deep neural networks are also now beginning to be applied here. This is possible due to the increases in computation power and improvements in algorithms. In this paper, the impact that language models have on the results of recognition is addressed in the following situations: 1) when they are adjusted to the work environment of the final application, and 2) when their complexity grows due to increases in the order of the n-gram models or by the applica-tion of deep neural networks. Specifically, an automatic speech recognition system with different language models is applied to audio recordings, these corresponding to three experimental frameworks: formal orality, talk on newscasts, and TED talks in Galician. Experimental results showed that improving the quality of language models yields improvements in recognition performance.
语言模型是自动语音识别系统性能的基础之一。使用单词序列概率(n-gram)的统计语言模型是最常见的,尽管深度神经网络现在也开始在这里应用。由于计算能力的提高和算法的改进,这是可能的。本文讨论了语言模型在以下情况下对识别结果的影响:1)当它们适应最终应用的工作环境时,以及2)由于n-gram模型的阶数增加或深度神经网络的应用而导致其复杂性增加时。具体来说,我们将一个具有不同语言模型的自动语音识别系统应用于录音,这些模型对应于三个实验框架:正式口语、新闻广播演讲和加利西亚TED演讲。实验结果表明,提高语言模型的质量可以提高识别性能。
{"title":"Improving the Automatic Speech Recognition through the improvement of Laguage Models","authors":"A. Martín, C. García-Mateo, Laura Docío Fernández","doi":"10.21437/IBERSPEECH.2018-8","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-8","url":null,"abstract":"Language models are one of the pillars on which the performance of automatic speech recognition systems are based. Statistical language models that use word sequence probabilities (n-grams) are the most common, although deep neural networks are also now beginning to be applied here. This is possible due to the increases in computation power and improvements in algorithms. In this paper, the impact that language models have on the results of recognition is addressed in the following situations: 1) when they are adjusted to the work environment of the final application, and 2) when their complexity grows due to increases in the order of the n-gram models or by the applica-tion of deep neural networks. Specifically, an automatic speech recognition system with different language models is applied to audio recordings, these corresponding to three experimental frameworks: formal orality, talk on newscasts, and TED talks in Galician. Experimental results showed that improving the quality of language models yields improvements in recognition performance.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126757489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The GTM-UVIGO System for Audiovisual Diarization 用于视听化的GTM-UVIGO系统
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-41
Eduardo Ramos-Muguerza, Laura Docío Fernández, J. Alba-Castro
This paper explains in detail the Audiovisual system deployed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Multimodal Diarization Challenge (MDC) organized in the Iberspeech 2018 conference. This system is characterized by the use of state of the art face and speaker verification embeddings trained with publicly available Deep Neural Networks. Video and audio tracks are processed separately to obtain a matrix of confidence values of each time segment that are finally fused to make joint decisions on the speaker diarization result.
本文详细介绍了维戈大学大西洋研究中心多媒体技术组(GTM)为Iberspeech 2018会议组织的Albayzin多模式Diarization挑战(MDC)部署的视听系统。该系统的特点是使用了公开可用的深度神经网络训练的最先进的面部和说话人验证嵌入。分别对视频和音频轨道进行处理,得到每个时间段的置信度矩阵,最后对两个时间段的置信度矩阵进行融合,对说话人化结果进行联合决策。
{"title":"The GTM-UVIGO System for Audiovisual Diarization","authors":"Eduardo Ramos-Muguerza, Laura Docío Fernández, J. Alba-Castro","doi":"10.21437/IBERSPEECH.2018-41","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-41","url":null,"abstract":"This paper explains in detail the Audiovisual system deployed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Multimodal Diarization Challenge (MDC) organized in the Iberspeech 2018 conference. This system is characterized by the use of state of the art face and speaker verification embeddings trained with publicly available Deep Neural Networks. Video and audio tracks are processed separately to obtain a matrix of confidence values of each time segment that are finally fused to make joint decisions on the speaker diarization result.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115522743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving Transcription of Manuscripts with Multimodality and Interaction 以多模态和互动改进抄写文稿
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-20
Emilio Granell, C. Martínez-Hinarejos, Verónica Romero
State-of-the-art Natural Language Recognition systems allow transcribers to speed-up the transcription of audio, video or image documents. These systems provide transcribers an initial draft transcription that can be corrected with less effort than transcribing the documents from scratch. However, even the drafts offered by the most advanced systems based on Deep Learning contain errors. Therefore, the supervision of those drafts by a human transcriber is still necessary to obtain the correct transcription. This supervision can be eased by using interactive and assistive transcription systems, where the transcriber and the automatic system cooperate in the amending process. Moreover, the interactive system can combine different sources of information in order to improve their performance, such as text line images and the dictation of their textual contents. In this paper, the performance of a multimodal interactive and assistive transcription system is evaluated on one Spanish historical manuscript. Although the quality of the draft transcriptions provided by a Handwriting Text Recognition system based on Deep Learning is pretty good, the proposed interactive and assistive approach reveals an additional reduction of transcription effort. Besides, this effort reduction is increased when using speech dictations over an Automatic Speech Recognition system, allowing for a faster transcription process.
最先进的自然语言识别系统允许转录员加速音频,视频或图像文件的转录。这些系统为抄写员提供了一份初稿,可以比从头开始抄写文件更省力地进行纠正。然而,即使是基于深度学习的最先进系统提供的草稿也存在错误。因此,为了获得正确的转录,由人类转录员对这些草稿进行监督仍然是必要的。这种监督可以通过使用互动式和辅助的抄写系统来减轻,在这种系统中,抄写员和自动系统在修改过程中合作。此外,交互系统可以结合不同来源的信息,以提高其性能,如文本行图像和其文本内容的听写。在本文中,一个多模式的互动和辅助转录系统的性能是评估一个西班牙历史手稿。尽管基于深度学习的手写文本识别系统提供的草稿转录质量相当好,但所提出的交互式和辅助方法显示了转录工作的额外减少。此外,当使用语音听写而不是自动语音识别系统时,这种工作量减少会增加,从而允许更快的转录过程。
{"title":"Improving Transcription of Manuscripts with Multimodality and Interaction","authors":"Emilio Granell, C. Martínez-Hinarejos, Verónica Romero","doi":"10.21437/IBERSPEECH.2018-20","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-20","url":null,"abstract":"State-of-the-art Natural Language Recognition systems allow transcribers to speed-up the transcription of audio, video or image documents. These systems provide transcribers an initial draft transcription that can be corrected with less effort than transcribing the documents from scratch. However, even the drafts offered by the most advanced systems based on Deep Learning contain errors. Therefore, the supervision of those drafts by a human transcriber is still necessary to obtain the correct transcription. This supervision can be eased by using interactive and assistive transcription systems, where the transcriber and the automatic system cooperate in the amending process. Moreover, the interactive system can combine different sources of information in order to improve their performance, such as text line images and the dictation of their textual contents. In this paper, the performance of a multimodal interactive and assistive transcription system is evaluated on one Spanish historical manuscript. Although the quality of the draft transcriptions provided by a Handwriting Text Recognition system based on Deep Learning is pretty good, the proposed interactive and assistive approach reveals an additional reduction of transcription effort. Besides, this effort reduction is increased when using speech dictations over an Automatic Speech Recognition system, allowing for a faster transcription process.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122561811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Speech and monophonic singing segmentation using pitch parameters 语音和单音歌唱分割使用音高参数
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-31
X. Sarasola, E. Navas, David Tavarez, Luis Serrano, I. Saratxaga
In this paper we present a novel method for automatic segmentation of speech and monophonic singing voice based only on two parameters derived from pitch: proportion of voiced segments and percentage of pitch labelled as a musical note. First, voice is located in audio files using a GMM-HMM based VAD and pitch is calculated. Using the pitch curve, automatic musical note labelling is made applying stable value sequence search. Then pitch features extracted from each voice island are classified with Support Vector Machines. Our corpus consists in recordings of live sung poetry sessions where audio files contain both singing and speech voices. The proposed system has been compared with other speech/singing discrimination systems with good results.
在本文中,我们提出了一种基于两个参数的语音和单音歌声自动分割的新方法:浊音段的比例和标记为音符的音高百分比。首先,使用基于GMM-HMM的VAD将声音定位到音频文件中,并计算音调。利用音高曲线,应用稳定值序列搜索实现音符自动标注。然后利用支持向量机对每个语音岛提取的音高特征进行分类。我们的语料库包括现场歌唱诗歌会议的录音,其中音频文件包含唱歌和说话的声音。该系统已与其他语音/歌唱识别系统进行了比较,取得了良好的效果。
{"title":"Speech and monophonic singing segmentation using pitch parameters","authors":"X. Sarasola, E. Navas, David Tavarez, Luis Serrano, I. Saratxaga","doi":"10.21437/IBERSPEECH.2018-31","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-31","url":null,"abstract":"In this paper we present a novel method for automatic segmentation of speech and monophonic singing voice based only on two parameters derived from pitch: proportion of voiced segments and percentage of pitch labelled as a musical note. First, voice is located in audio files using a GMM-HMM based VAD and pitch is calculated. Using the pitch curve, automatic musical note labelling is made applying stable value sequence search. Then pitch features extracted from each voice island are classified with Support Vector Machines. Our corpus consists in recordings of live sung poetry sessions where audio files contain both singing and speech voices. The proposed system has been compared with other speech/singing discrimination systems with good results.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125164881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Influence of tense, modal and lax phonation on the three-dimensional finite element synthesis of vowel [A] 时态、模态和松弛发声对元音三维有限元合成的影响[A]
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-28
M. Freixes, M. Arnela, J. Socoró, Francesc Alías, O. Guasch
One-dimensional articulatory speech models have long been used to generate synthetic voice. These models assume plane wave propagation within the vocal tract, which holds for frequencies up to ∼ 5 kHz. However, higher order modes also propagate beyond this limit, which may be relevant to produce a more natural voice. Such modes could be especially impor-tant for phonation types with significant high frequency energy (HFE) content. In this work, we study the influence of tense, modal and lax phonation on the synthesis of vowel [A] through 3D finite element modelling (FEM). The three phonation types are reproduced with an LF (Liljencrants-Fant) model controlled by the R d glottal shape parameter. The onset of the higher order modes essentially depends on the vocal tract geometry. Two of them are considered, a realistic vocal tract obtained from MRI and a simplified straight duct with varying circular cross-sections. Long-term average spectra are computed from the FEM synthesised [A] vowels, extracting the overall sound pressure level and the HFE level in the 8 kHz octave band. Results indicate that higher order modes may be perceptually relevant for the tense and modal voice qualities, but not for the lax phonation.
一维发音语音模型一直被用于合成语音。这些模型假设声道内的平面波传播,其频率高达~ 5 kHz。然而,高阶模式的传播也超过了这个限制,这可能与产生更自然的声音有关。这种模式对于具有显著高频能量(HFE)含量的发声类型尤其重要。在这项工作中,我们通过三维有限元建模(FEM)研究了时态、模态和松弛发声对元音合成的影响[A]。这三种发声类型用一个由rd声门形状参数控制的LF (Liljencrants-Fant)模型再现。高阶模式的开始基本上取决于声道的几何形状。本文考虑了其中的两种,一种是磁共振成像获得的真实声道,另一种是具有不同圆形截面的简化直管。从FEM合成的[A]元音计算长期平均谱,提取8 kHz频带内的总声压级和HFE级。结果表明,高阶模态可能与时态和情态音质感知相关,但与松弛发声无关。
{"title":"Influence of tense, modal and lax phonation on the three-dimensional finite element synthesis of vowel [A]","authors":"M. Freixes, M. Arnela, J. Socoró, Francesc Alías, O. Guasch","doi":"10.21437/IberSPEECH.2018-28","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-28","url":null,"abstract":"One-dimensional articulatory speech models have long been used to generate synthetic voice. These models assume plane wave propagation within the vocal tract, which holds for frequencies up to ∼ 5 kHz. However, higher order modes also propagate beyond this limit, which may be relevant to produce a more natural voice. Such modes could be especially impor-tant for phonation types with significant high frequency energy (HFE) content. In this work, we study the influence of tense, modal and lax phonation on the synthesis of vowel [A] through 3D finite element modelling (FEM). The three phonation types are reproduced with an LF (Liljencrants-Fant) model controlled by the R d glottal shape parameter. The onset of the higher order modes essentially depends on the vocal tract geometry. Two of them are considered, a realistic vocal tract obtained from MRI and a simplified straight duct with varying circular cross-sections. Long-term average spectra are computed from the FEM synthesised [A] vowels, extracting the overall sound pressure level and the HFE level in the 8 kHz octave band. Results indicate that higher order modes may be perceptually relevant for the tense and modal voice qualities, but not for the lax phonation.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128371925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Building an Open Source Automatic Speech Recognition System for Catalan 建立一个开源的加泰罗尼亚语自动语音识别系统
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-6
B. Külebi, A. Öktem
Catalan is recognized as the largest stateless language in Europe hence it is a language well studied in the field of speech, and there exists various solutions for Automatic Speech Recognition (ASR) with large vocabulary. However, unlike many of the official languages of Europe, it neither has an open acoustic corpus sufficiently large for training ASR models, nor openly accessible acoustic models for local task execution and personal use. In order to provide the necessary tools and expertise for the resource limited languages, in this work we discuss the development of a large speech corpus of broadcast media and building of an Catalan ASR system using CMU Sphinx. The resulting models have a WER of 35,2% on a 4 hour test set of similar recordings and a 31.95% on an external 4 hour multi-speaker test set. This rate is further decreased to 11.68% with a task specific language model. 240 hours of broadcast speech data and the resulting models are distributed openly for use.
加泰罗尼亚语被认为是欧洲最大的无国籍语言,因此它是一种在语音领域得到很好研究的语言,并且存在各种具有大词汇量的自动语音识别(ASR)解决方案。然而,与欧洲的许多官方语言不同,它既没有足够大的开放声学语料库来训练ASR模型,也没有开放的声学模型来执行本地任务和个人使用。为了为资源有限的语言提供必要的工具和专业知识,在这项工作中,我们讨论了广播媒体的大型语音语料库的开发和使用CMU Sphinx构建加泰罗尼亚语ASR系统。结果模型在类似录音的4小时测试集上的WER为35.2%,在外部4小时多扬声器测试集上的WER为31.95%。使用特定于任务的语言模型,这一比率进一步降低到11.68%。240小时的广播语音数据和由此产生的模型公开分发供使用。
{"title":"Building an Open Source Automatic Speech Recognition System for Catalan","authors":"B. Külebi, A. Öktem","doi":"10.21437/IBERSPEECH.2018-6","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-6","url":null,"abstract":"Catalan is recognized as the largest stateless language in Europe hence it is a language well studied in the field of speech, and there exists various solutions for Automatic Speech Recognition (ASR) with large vocabulary. However, unlike many of the official languages of Europe, it neither has an open acoustic corpus sufficiently large for training ASR models, nor openly accessible acoustic models for local task execution and personal use. In order to provide the necessary tools and expertise for the resource limited languages, in this work we discuss the development of a large speech corpus of broadcast media and building of an Catalan ASR system using CMU Sphinx. The resulting models have a WER of 35,2% on a 4 hour test set of similar recordings and a 31.95% on an external 4 hour multi-speaker test set. This rate is further decreased to 11.68% with a task specific language model. 240 hours of broadcast speech data and the resulting models are distributed openly for use.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132733292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
The SRI International STAR-LAB System Description for IberSPEECH-RTVE 2018 Speaker Diarization Challenge SRI国际STAR-LAB系统描述IberSPEECH-RTVE 2018演讲者分类挑战赛
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-42
Diego Castán, Mitchell McLaren, Mahesh Kumar Nandwana
This document describes the submissions of STAR-LAB (the Speech Technology and Research Laboratory at SRI International) to the open-set condition of the IberSPEECH-RTVE 2018 Speaker Diarization Challenge. The core components of the submissions included noise-robust speech activity detection, speaker embeddings for initializing diarization with domain adaptation, and Variational Bayes (VB) diarization using a DNN bottleneck i-vector subspaces.
本文件描述了STAR-LAB (SRI国际语音技术和研究实验室)提交给IberSPEECH-RTVE 2018演讲者Diarization挑战赛的开放条件。提交的核心组件包括噪声鲁棒语音活动检测,使用域自适应初始化的说话人嵌入,以及使用DNN瓶颈i向量子空间的变分贝叶斯(VB)二度化。
{"title":"The SRI International STAR-LAB System Description for IberSPEECH-RTVE 2018 Speaker Diarization Challenge","authors":"Diego Castán, Mitchell McLaren, Mahesh Kumar Nandwana","doi":"10.21437/iberspeech.2018-42","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-42","url":null,"abstract":"This document describes the submissions of STAR-LAB (the Speech Technology and Research Laboratory at SRI International) to the open-set condition of the IberSPEECH-RTVE 2018 Speaker Diarization Challenge. The core components of the submissions included noise-robust speech activity detection, speaker embeddings for initializing diarization with domain adaptation, and Variational Bayes (VB) diarization using a DNN bottleneck i-vector subspaces.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121034626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
IberSPEECH Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1