首页 > 最新文献

IberSPEECH Conference最新文献

英文 中文
Topic coherence analysis for the classification of Alzheimer's disease 阿尔茨海默病分类的主题一致性分析
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-59
A. Pompili, A. Abad, David Martins de Matos, I. Martins
Language impairment in Alzheimer’s disease is characterized by a decline in the semantic and pragmatic levels of language processing that manifests since the early stages of the disease. While semantic deficits have been widely investigated using linguistic features, pragmatic deficits are still mostly un-explored. In this work, we present an approach to automatically classify Alzheimer’s disease using a set of pragmatic features extracted from a discourse production task. Following the clinical practice, we consider an image representing a closed domain as a discourse’s elicitation form. Then, we model the elicited speech as a graph that encodes a hierarchy of topics. To do so, the proposed method relies on the integration of various NLP techniques: syntactic parsing for sentence segmentation into clauses, coreference resolution for capturing dependencies among clauses, and word embeddings for identifying semantic relations among topics. According to the experimental results, pragmatic features are able to provide promising results distinguishing individuals with Alzheimer’s disease, comparable to solutions based on other types of linguistic features.
阿尔茨海默病的语言障碍的特点是语言处理的语义和语用水平下降,从疾病的早期阶段就表现出来。虽然语义缺陷已经被广泛地研究,但语用缺陷大多尚未被探索。在这项工作中,我们提出了一种使用从话语生成任务中提取的一组语用特征来自动分类阿尔茨海默病的方法。根据临床实践,我们认为一个图像代表一个封闭的领域作为话语的引出形式。然后,我们将引出的语音建模为一个编码主题层次结构的图。为此,所提出的方法依赖于各种NLP技术的集成:句法分析将句子分割成子句,共同引用解析捕获子句之间的依赖关系,以及词嵌入识别主题之间的语义关系。根据实验结果,语用特征能够提供有希望的区分阿尔茨海默病个体的结果,可与基于其他类型语言特征的解决方案相媲美。
{"title":"Topic coherence analysis for the classification of Alzheimer's disease","authors":"A. Pompili, A. Abad, David Martins de Matos, I. Martins","doi":"10.21437/IberSPEECH.2018-59","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-59","url":null,"abstract":"Language impairment in Alzheimer’s disease is characterized by a decline in the semantic and pragmatic levels of language processing that manifests since the early stages of the disease. While semantic deficits have been widely investigated using linguistic features, pragmatic deficits are still mostly un-explored. In this work, we present an approach to automatically classify Alzheimer’s disease using a set of pragmatic features extracted from a discourse production task. Following the clinical practice, we consider an image representing a closed domain as a discourse’s elicitation form. Then, we model the elicited speech as a graph that encodes a hierarchy of topics. To do so, the proposed method relies on the integration of various NLP techniques: syntactic parsing for sentence segmentation into clauses, coreference resolution for capturing dependencies among clauses, and word embeddings for identifying semantic relations among topics. According to the experimental results, pragmatic features are able to provide promising results distinguishing individuals with Alzheimer’s disease, comparable to solutions based on other types of linguistic features.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125942203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
End-to-End Multi-Level Dialog Act Recognition 端到端多层次对话行为识别
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-63
Eugénio Ribeiro, Ricardo Ribeiro, David Martins de Matos
The three-level dialog act annotation scheme of the DIHANA corpus poses a multi-level classification problem in which the bottom levels allow multiple or no labels for a single segment. We approach automatic dialog act recognition on the three levels using an end-to-end approach, in order to implicitly capture relations between them. Our deep neural network classifier uses a combination of word- and character-based segment representation approaches, together with a summary of the dialog history and information concerning speaker changes. We show that it is important to specialize the generic segment representation in order to capture the most relevant information for each level. On the other hand, the summary of the dialog history should combine information from the three levels to capture de-pendencies between them. Furthermore, the labels generated for each level help in the prediction of those of the lower levels. Overall, we achieve results which surpass those of our previous approach using the hierarchical combination of three independent per-level classifiers. Furthermore, the results even surpass the results achieved on the simplified version of the problem approached by previous studies, which neglected the multi-label nature of the bottom levels and only considered the label combinations present in the corpus.
DIHANA语料库的三层对话行为注释方案提出了一个多级分类问题,其中底层允许对单个片段进行多个标签或不进行标签。我们使用端到端方法在三个层次上实现自动对话行为识别,以便隐式捕获它们之间的关系。我们的深度神经网络分类器结合了基于单词和字符的片段表示方法,以及对话历史和说话人变化信息的总结。我们表明,为了捕获每个级别最相关的信息,对通用段表示进行专门化是很重要的。另一方面,对话历史的摘要应该结合来自三个级别的信息,以捕获它们之间的依赖关系。此外,为每个级别生成的标签有助于预测较低级别的标签。总的来说,我们获得的结果超过了我们以前使用三个独立的每层分类器的分层组合的方法。此外,该结果甚至超过了以往研究所处理问题的简化版本所取得的结果,后者忽略了底层的多标签性质,只考虑了语料库中存在的标签组合。
{"title":"End-to-End Multi-Level Dialog Act Recognition","authors":"Eugénio Ribeiro, Ricardo Ribeiro, David Martins de Matos","doi":"10.21437/IBERSPEECH.2018-63","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-63","url":null,"abstract":"The three-level dialog act annotation scheme of the DIHANA corpus poses a multi-level classification problem in which the bottom levels allow multiple or no labels for a single segment. We approach automatic dialog act recognition on the three levels using an end-to-end approach, in order to implicitly capture relations between them. Our deep neural network classifier uses a combination of word- and character-based segment representation approaches, together with a summary of the dialog history and information concerning speaker changes. We show that it is important to specialize the generic segment representation in order to capture the most relevant information for each level. On the other hand, the summary of the dialog history should combine information from the three levels to capture de-pendencies between them. Furthermore, the labels generated for each level help in the prediction of those of the lower levels. Overall, we achieve results which surpass those of our previous approach using the hierarchical combination of three independent per-level classifiers. Furthermore, the results even surpass the results achieved on the simplified version of the problem approached by previous studies, which neglected the multi-label nature of the bottom levels and only considered the label combinations present in the corpus.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129281903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System for the Albayzin 2018 IberSPEECH-RTVE Evaluation Albayzin 2018 IberSPEECH-RTVE评估中AuDIaS-UAM系统中基于dnn的扬声器化嵌入
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-46
Alicia Lozano-Diez, Beltran Labrador, Diego de Benito-Gorrón, Pablo Ramirez, D. Toledano
This document describes the three systems submitted by the AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE speaker diarization evaluation. Two of our systems (primary and contrastive 1 submissions) are based on embeddings which are a fixed length representation of a given audio segment obtained from a deep neural network (DNN) trained for speaker classification. The third system (contrastive 2) uses the classical i-vector as representation of the audio segments. The resulting embeddings or i-vectors are then grouped using Agglomerative Hierarchical Clustering (AHC) in order to obtain the diarization labels. The new DNN-embedding approach for speaker diarization has obtained a remarkable performance over the Albayzin development dataset, similar to the performance achieved with the well-known i-vector approach.
本文档描述了AuDIaS-UAM团队为Albayzin 2018 IberSPEECH-RTVE扬声器化评估提交的三个系统。我们的两个系统(主要和对比1提交)基于嵌入,嵌入是给定音频片段的固定长度表示,这些音频片段来自用于说话人分类的深度神经网络(DNN)。第三个系统(对比2)使用经典的i向量作为音频片段的表示。然后使用聚类分层聚类(AHC)对产生的嵌入或i向量进行分组,以获得diarization标签。新的深度神经网络嵌入方法在Albayzin发展数据集上获得了显着的性能,类似于众所周知的i向量方法所取得的性能。
{"title":"DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System for the Albayzin 2018 IberSPEECH-RTVE Evaluation","authors":"Alicia Lozano-Diez, Beltran Labrador, Diego de Benito-Gorrón, Pablo Ramirez, D. Toledano","doi":"10.21437/IBERSPEECH.2018-46","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-46","url":null,"abstract":"This document describes the three systems submitted by the AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE speaker diarization evaluation. Two of our systems (primary and contrastive 1 submissions) are based on embeddings which are a fixed length representation of a given audio segment obtained from a deep neural network (DNN) trained for speaker classification. The third system (contrastive 2) uses the classical i-vector as representation of the audio segments. The resulting embeddings or i-vectors are then grouped using Agglomerative Hierarchical Clustering (AHC) in order to obtain the diarization labels. The new DNN-embedding approach for speaker diarization has obtained a remarkable performance over the Albayzin development dataset, similar to the performance achieved with the well-known i-vector approach.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128902925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Listening to Laryngectomees: A study of Intelligibility and Self-reported Listening Effort of Spanish Oesophageal Speech 割喉者的听力:西班牙语食道言语的可理解性和自述听力努力的研究
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-23
Sneha Raman, I. Hernáez, E. Navas, Luis Serrano
Oesophageal speakers face a multitude of challenges, such as difficulty in basic everyday communication and inability to interact with digital voice assistants. We aim to quantify the difficulty involved in understanding oesophageal speech (in humanhuman and human-machine interactions) by measuring intelligibility and listening effort. We conducted a web-based listening test to collect these metrics. Participants were asked to transcribe and then rate the sentences for listening effort on a 5-point Likert scale. Intelligibility, calculated as Word Error Rate (WER), showed significant correlation with user rated effort. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. Listeners familiar with oesophageal speech did not have any advantage over non familiar listeners in correctly understanding oesophageal speech. However, they reported lesser effort in listening to oesophageal speech compared to non familiar listeners. Additionally, we calculated speakerwise mean WERs and they were significantly lower when compared to an automatic speech recognition system.
食道扬声器面临着许多挑战,例如基本的日常沟通困难以及无法与数字语音助手互动。我们的目标是通过测量可理解性和听力努力来量化理解食管言语(在人与人和人机交互中)的难度。我们进行了一个基于网络的听力测试来收集这些指标。参与者被要求抄写句子,然后用5分李克特量表对听力努力进行评分。可理解性,以单词错误率(WER)计算,与用户的评分努力有显著的相关性。说话者类型(健康或食道)对可理解性和努力程度有主要影响。熟悉食道言语的听者在正确理解食道言语方面并不比不熟悉的听者有任何优势。然而,他们报告说,与不熟悉的听众相比,他们听食管言语的努力更少。此外,我们计算了扬声器的平均功率,与自动语音识别系统相比,它们明显更低。
{"title":"Listening to Laryngectomees: A study of Intelligibility and Self-reported Listening Effort of Spanish Oesophageal Speech","authors":"Sneha Raman, I. Hernáez, E. Navas, Luis Serrano","doi":"10.21437/IberSPEECH.2018-23","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-23","url":null,"abstract":"Oesophageal speakers face a multitude of challenges, such as difficulty in basic everyday communication and inability to interact with digital voice assistants. We aim to quantify the difficulty involved in understanding oesophageal speech (in humanhuman and human-machine interactions) by measuring intelligibility and listening effort. We conducted a web-based listening test to collect these metrics. Participants were asked to transcribe and then rate the sentences for listening effort on a 5-point Likert scale. Intelligibility, calculated as Word Error Rate (WER), showed significant correlation with user rated effort. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. Listeners familiar with oesophageal speech did not have any advantage over non familiar listeners in correctly understanding oesophageal speech. However, they reported lesser effort in listening to oesophageal speech compared to non familiar listeners. Additionally, we calculated speakerwise mean WERs and they were significantly lower when compared to an automatic speech recognition system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127999508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Bilingual Prosodic Dataset Compilation for Spoken Language Translation 面向口语翻译的双语韵律数据集汇编
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-5
A. Öktem, M. Farrús, A. Bonafonte
Comunicacio presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.
通讯发表于IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.
{"title":"Bilingual Prosodic Dataset Compilation for Spoken Language Translation","authors":"A. Öktem, M. Farrús, A. Bonafonte","doi":"10.21437/IBERSPEECH.2018-5","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-5","url":null,"abstract":"Comunicacio presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130256353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Exploring Advances in Real-time MRI for Speech Production Studies of European Portuguese 欧洲葡萄牙语语音生成研究的实时MRI进展探索
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-29
Conceição Cunha, Samuel S. Silva, A. Teixeira, C. Oliveira, Paula Martins, Arun A. Joseph, J. Frahm
{"title":"Exploring Advances in Real-time MRI for Speech Production Studies of European Portuguese","authors":"Conceição Cunha, Samuel S. Silva, A. Teixeira, C. Oliveira, Paula Martins, Arun A. Joseph, J. Frahm","doi":"10.21437/iberspeech.2018-29","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-29","url":null,"abstract":"","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Sign Language Gesture Classification using Neural Networks 基于神经网络的手语手势分类
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-27
Zuzanna Parcheta, C. Martínez-Hinarejos
Recent studies have demonstrated the power of neural networks for different fields of artificial intelligence. In most fields, such as machine translation or speech recognition, neural networks outperform previously used methods (Hidden Markov Models with Gaussian Mixtures, Statistical Machine Translation, etc.). In this paper, the efficiency of the LeNet convolutional neural network for isolated word sign language recognition is demonstrated. As a preprocessing step, we apply several techniques to obtain the same dimension for the input that contains gesture information. The performance of these preprocessing techniques on a Spanish Sign Language dataset is evaluated. These approaches outperform previously obtained results based on Hidden Markov Models.
最近的研究已经证明了神经网络在人工智能不同领域的强大作用。在大多数领域,如机器翻译或语音识别,神经网络优于以前使用的方法(高斯混合隐马尔可夫模型,统计机器翻译等)。本文验证了LeNet卷积神经网络在孤立词手语识别中的有效性。作为预处理步骤,我们应用了几种技术来获得包含手势信息的输入的相同维度。对这些预处理技术在西班牙手语数据集上的性能进行了评估。这些方法优于先前基于隐马尔可夫模型获得的结果。
{"title":"Sign Language Gesture Classification using Neural Networks","authors":"Zuzanna Parcheta, C. Martínez-Hinarejos","doi":"10.21437/iberspeech.2018-27","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-27","url":null,"abstract":"Recent studies have demonstrated the power of neural networks for different fields of artificial intelligence. In most fields, such as machine translation or speech recognition, neural networks outperform previously used methods (Hidden Markov Models with Gaussian Mixtures, Statistical Machine Translation, etc.). In this paper, the efficiency of the LeNet convolutional neural network for isolated word sign language recognition is demonstrated. As a preprocessing step, we apply several techniques to obtain the same dimension for the input that contains gesture information. The performance of these preprocessing techniques on a Spanish Sign Language dataset is evaluated. These approaches outperform previously obtained results based on Hidden Markov Models.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127089188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploring E2E speech recognition systems for new languages 探索新语言的端到端语音识别系统
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-22
Conrad Bernath, Aitor Álvarez, Haritz Arzelus, C. D. Martínez
Over the last few years, advances in both machine learning algorithms and computer hardware have led to significant improvements in speech recognition technology, mainly through the use of Deep Learning paradigms. As it was amply demon-strated in different studies, Deep Neural Networks (DNNs) have already outperformed traditional Gaussian Mixture Models (GMMs) at acoustic modeling in combination with Hidden Markov Models (HMMs). More recently, new attempts have focused on building end-to-end (E2E) speech recognition archi-tectures, especially in languages with many resources like English and Chinese, with the aim of overcoming the performance of LSTM-HMM and more conventional systems. The aim of this work is first to present the different techniques that have been applied to enhance state-of-the-art E2E systems for American English using publicly available datasets. Secondly, we describe the construction of E2E systems for Spanish and Basque, and explain the strategies applied to over-come the problem of the limited availability of training data, especially for Basque as a low-resource language. At the evaluation phase, the three E2E systems are also compared with LSTM-HMM based recognition engines built and tested with the same datasets.
在过去的几年里,机器学习算法和计算机硬件的进步导致了语音识别技术的重大改进,主要是通过使用深度学习范式。正如在不同的研究中充分证明的那样,深度神经网络(dnn)在结合隐马尔可夫模型(hmm)的声学建模方面已经优于传统的高斯混合模型(GMMs)。最近,新的尝试集中在构建端到端(E2E)语音识别体系结构上,特别是在有许多资源的语言中,如英语和中文,目的是克服LSTM-HMM和更传统系统的性能。这项工作的目的是首先介绍不同的技术,这些技术已经应用于使用公开可用的数据集来增强最先进的美式英语端到端翻译系统。其次,我们描述了西班牙语和巴斯克语的E2E系统的构建,并解释了用于克服训练数据可用性有限问题的策略,特别是对于巴斯克语作为一种资源匮乏的语言。在评估阶段,还将这三种E2E系统与使用相同数据集构建和测试的基于LSTM-HMM的识别引擎进行了比较。
{"title":"Exploring E2E speech recognition systems for new languages","authors":"Conrad Bernath, Aitor Álvarez, Haritz Arzelus, C. D. Martínez","doi":"10.21437/IBERSPEECH.2018-22","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-22","url":null,"abstract":"Over the last few years, advances in both machine learning algorithms and computer hardware have led to significant improvements in speech recognition technology, mainly through the use of Deep Learning paradigms. As it was amply demon-strated in different studies, Deep Neural Networks (DNNs) have already outperformed traditional Gaussian Mixture Models (GMMs) at acoustic modeling in combination with Hidden Markov Models (HMMs). More recently, new attempts have focused on building end-to-end (E2E) speech recognition archi-tectures, especially in languages with many resources like English and Chinese, with the aim of overcoming the performance of LSTM-HMM and more conventional systems. The aim of this work is first to present the different techniques that have been applied to enhance state-of-the-art E2E systems for American English using publicly available datasets. Secondly, we describe the construction of E2E systems for Spanish and Basque, and explain the strategies applied to over-come the problem of the limited availability of training data, especially for Basque as a low-resource language. At the evaluation phase, the three E2E systems are also compared with LSTM-HMM based recognition engines built and tested with the same datasets.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125211129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Towards expressive prosody generation in TTS for reading aloud applications 面向朗读应用的TTS表达韵律生成
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-9
Mónica Domínguez, Alicia Burga, M. Farrús, Leo Wanner
Comunicacio presentada a: IberSpeech 2018, celebrat a Barcelona del 21 al 23 de novembre de 2018.
交流会上发表:IberSpeech 2018,2018 年 11 月 21 日至 23 日在巴塞罗那举行。
{"title":"Towards expressive prosody generation in TTS for reading aloud applications","authors":"Mónica Domínguez, Alicia Burga, M. Farrús, Leo Wanner","doi":"10.21437/IBERSPEECH.2018-9","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-9","url":null,"abstract":"Comunicacio presentada a: IberSpeech 2018, celebrat a Barcelona del 21 al 23 de novembre de 2018.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132556085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge 2018 Albayzin挑战赛的UPC多模态扬声器分类系统
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-40
Miquel Àngel India Massana, Itziar Sagastiberri, Ponç Palau, E. Sayrol, J. Morros, J. Hernando
This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.
本文介绍了为2018年Albayzin挑战赛的多模态说话人Diarization任务提出的UPC系统。这种方法的工作原理是分别处理语音和图像信号。在语音域,使用使用i向量作为输入的三重损失DNN创建的身份嵌入来执行说话人的拨号化。三元组深度神经网络是用一个额外的正则化损失来训练的,这个正则化损失最小化了正负距离的方差。然后使用滑动窗口使用嵌入之间的余弦距离来比较语音片段和注册说话人目标。为了从人脸模态中检测身份,在视频中使用了人脸检测器和人脸跟踪器。对于每个裁剪的人脸,使用基于ResNet 34架构的深度神经网络获得特征向量,使用度量学习三重损失(可从dlib库获得)进行训练。对于每个轨迹,通过对该轨迹的每一帧获得的特征进行平均来获得人脸特征向量。然后,将该特征向量与从注册身份图像中提取的特征进行比较。该系统在RTVE2018数据库上进行了评估。
{"title":"UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge","authors":"Miquel Àngel India Massana, Itziar Sagastiberri, Ponç Palau, E. Sayrol, J. Morros, J. Hernando","doi":"10.21437/iberspeech.2018-40","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-40","url":null,"abstract":"This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133471933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
IberSPEECH Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1