Pub Date : 2018-11-21DOI: 10.21437/IberSPEECH.2018-59
A. Pompili, A. Abad, David Martins de Matos, I. Martins
Language impairment in Alzheimer’s disease is characterized by a decline in the semantic and pragmatic levels of language processing that manifests since the early stages of the disease. While semantic deficits have been widely investigated using linguistic features, pragmatic deficits are still mostly un-explored. In this work, we present an approach to automatically classify Alzheimer’s disease using a set of pragmatic features extracted from a discourse production task. Following the clinical practice, we consider an image representing a closed domain as a discourse’s elicitation form. Then, we model the elicited speech as a graph that encodes a hierarchy of topics. To do so, the proposed method relies on the integration of various NLP techniques: syntactic parsing for sentence segmentation into clauses, coreference resolution for capturing dependencies among clauses, and word embeddings for identifying semantic relations among topics. According to the experimental results, pragmatic features are able to provide promising results distinguishing individuals with Alzheimer’s disease, comparable to solutions based on other types of linguistic features.
{"title":"Topic coherence analysis for the classification of Alzheimer's disease","authors":"A. Pompili, A. Abad, David Martins de Matos, I. Martins","doi":"10.21437/IberSPEECH.2018-59","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-59","url":null,"abstract":"Language impairment in Alzheimer’s disease is characterized by a decline in the semantic and pragmatic levels of language processing that manifests since the early stages of the disease. While semantic deficits have been widely investigated using linguistic features, pragmatic deficits are still mostly un-explored. In this work, we present an approach to automatically classify Alzheimer’s disease using a set of pragmatic features extracted from a discourse production task. Following the clinical practice, we consider an image representing a closed domain as a discourse’s elicitation form. Then, we model the elicited speech as a graph that encodes a hierarchy of topics. To do so, the proposed method relies on the integration of various NLP techniques: syntactic parsing for sentence segmentation into clauses, coreference resolution for capturing dependencies among clauses, and word embeddings for identifying semantic relations among topics. According to the experimental results, pragmatic features are able to provide promising results distinguishing individuals with Alzheimer’s disease, comparable to solutions based on other types of linguistic features.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125942203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-63
Eugénio Ribeiro, Ricardo Ribeiro, David Martins de Matos
The three-level dialog act annotation scheme of the DIHANA corpus poses a multi-level classification problem in which the bottom levels allow multiple or no labels for a single segment. We approach automatic dialog act recognition on the three levels using an end-to-end approach, in order to implicitly capture relations between them. Our deep neural network classifier uses a combination of word- and character-based segment representation approaches, together with a summary of the dialog history and information concerning speaker changes. We show that it is important to specialize the generic segment representation in order to capture the most relevant information for each level. On the other hand, the summary of the dialog history should combine information from the three levels to capture de-pendencies between them. Furthermore, the labels generated for each level help in the prediction of those of the lower levels. Overall, we achieve results which surpass those of our previous approach using the hierarchical combination of three independent per-level classifiers. Furthermore, the results even surpass the results achieved on the simplified version of the problem approached by previous studies, which neglected the multi-label nature of the bottom levels and only considered the label combinations present in the corpus.
{"title":"End-to-End Multi-Level Dialog Act Recognition","authors":"Eugénio Ribeiro, Ricardo Ribeiro, David Martins de Matos","doi":"10.21437/IBERSPEECH.2018-63","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-63","url":null,"abstract":"The three-level dialog act annotation scheme of the DIHANA corpus poses a multi-level classification problem in which the bottom levels allow multiple or no labels for a single segment. We approach automatic dialog act recognition on the three levels using an end-to-end approach, in order to implicitly capture relations between them. Our deep neural network classifier uses a combination of word- and character-based segment representation approaches, together with a summary of the dialog history and information concerning speaker changes. We show that it is important to specialize the generic segment representation in order to capture the most relevant information for each level. On the other hand, the summary of the dialog history should combine information from the three levels to capture de-pendencies between them. Furthermore, the labels generated for each level help in the prediction of those of the lower levels. Overall, we achieve results which surpass those of our previous approach using the hierarchical combination of three independent per-level classifiers. Furthermore, the results even surpass the results achieved on the simplified version of the problem approached by previous studies, which neglected the multi-label nature of the bottom levels and only considered the label combinations present in the corpus.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129281903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-46
Alicia Lozano-Diez, Beltran Labrador, Diego de Benito-Gorrón, Pablo Ramirez, D. Toledano
This document describes the three systems submitted by the AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE speaker diarization evaluation. Two of our systems (primary and contrastive 1 submissions) are based on embeddings which are a fixed length representation of a given audio segment obtained from a deep neural network (DNN) trained for speaker classification. The third system (contrastive 2) uses the classical i-vector as representation of the audio segments. The resulting embeddings or i-vectors are then grouped using Agglomerative Hierarchical Clustering (AHC) in order to obtain the diarization labels. The new DNN-embedding approach for speaker diarization has obtained a remarkable performance over the Albayzin development dataset, similar to the performance achieved with the well-known i-vector approach.
{"title":"DNN-based Embeddings for Speaker Diarization in the AuDIaS-UAM System for the Albayzin 2018 IberSPEECH-RTVE Evaluation","authors":"Alicia Lozano-Diez, Beltran Labrador, Diego de Benito-Gorrón, Pablo Ramirez, D. Toledano","doi":"10.21437/IBERSPEECH.2018-46","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-46","url":null,"abstract":"This document describes the three systems submitted by the AuDIaS-UAM team for the Albayzin 2018 IberSPEECH-RTVE speaker diarization evaluation. Two of our systems (primary and contrastive 1 submissions) are based on embeddings which are a fixed length representation of a given audio segment obtained from a deep neural network (DNN) trained for speaker classification. The third system (contrastive 2) uses the classical i-vector as representation of the audio segments. The resulting embeddings or i-vectors are then grouped using Agglomerative Hierarchical Clustering (AHC) in order to obtain the diarization labels. The new DNN-embedding approach for speaker diarization has obtained a remarkable performance over the Albayzin development dataset, similar to the performance achieved with the well-known i-vector approach.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128902925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IberSPEECH.2018-23
Sneha Raman, I. Hernáez, E. Navas, Luis Serrano
Oesophageal speakers face a multitude of challenges, such as difficulty in basic everyday communication and inability to interact with digital voice assistants. We aim to quantify the difficulty involved in understanding oesophageal speech (in humanhuman and human-machine interactions) by measuring intelligibility and listening effort. We conducted a web-based listening test to collect these metrics. Participants were asked to transcribe and then rate the sentences for listening effort on a 5-point Likert scale. Intelligibility, calculated as Word Error Rate (WER), showed significant correlation with user rated effort. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. Listeners familiar with oesophageal speech did not have any advantage over non familiar listeners in correctly understanding oesophageal speech. However, they reported lesser effort in listening to oesophageal speech compared to non familiar listeners. Additionally, we calculated speakerwise mean WERs and they were significantly lower when compared to an automatic speech recognition system.
{"title":"Listening to Laryngectomees: A study of Intelligibility and Self-reported Listening Effort of Spanish Oesophageal Speech","authors":"Sneha Raman, I. Hernáez, E. Navas, Luis Serrano","doi":"10.21437/IberSPEECH.2018-23","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-23","url":null,"abstract":"Oesophageal speakers face a multitude of challenges, such as difficulty in basic everyday communication and inability to interact with digital voice assistants. We aim to quantify the difficulty involved in understanding oesophageal speech (in humanhuman and human-machine interactions) by measuring intelligibility and listening effort. We conducted a web-based listening test to collect these metrics. Participants were asked to transcribe and then rate the sentences for listening effort on a 5-point Likert scale. Intelligibility, calculated as Word Error Rate (WER), showed significant correlation with user rated effort. Speaker type (healthy or oesophageal) had a major effect on intelligibility and effort. Listeners familiar with oesophageal speech did not have any advantage over non familiar listeners in correctly understanding oesophageal speech. However, they reported lesser effort in listening to oesophageal speech compared to non familiar listeners. Additionally, we calculated speakerwise mean WERs and they were significantly lower when compared to an automatic speech recognition system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127999508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-5
A. Öktem, M. Farrús, A. Bonafonte
Comunicacio presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.
通讯发表于IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.
{"title":"Bilingual Prosodic Dataset Compilation for Spoken Language Translation","authors":"A. Öktem, M. Farrús, A. Bonafonte","doi":"10.21437/IBERSPEECH.2018-5","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-5","url":null,"abstract":"Comunicacio presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130256353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/iberspeech.2018-29
Conceição Cunha, Samuel S. Silva, A. Teixeira, C. Oliveira, Paula Martins, Arun A. Joseph, J. Frahm
{"title":"Exploring Advances in Real-time MRI for Speech Production Studies of European Portuguese","authors":"Conceição Cunha, Samuel S. Silva, A. Teixeira, C. Oliveira, Paula Martins, Arun A. Joseph, J. Frahm","doi":"10.21437/iberspeech.2018-29","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-29","url":null,"abstract":"","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/iberspeech.2018-27
Zuzanna Parcheta, C. Martínez-Hinarejos
Recent studies have demonstrated the power of neural networks for different fields of artificial intelligence. In most fields, such as machine translation or speech recognition, neural networks outperform previously used methods (Hidden Markov Models with Gaussian Mixtures, Statistical Machine Translation, etc.). In this paper, the efficiency of the LeNet convolutional neural network for isolated word sign language recognition is demonstrated. As a preprocessing step, we apply several techniques to obtain the same dimension for the input that contains gesture information. The performance of these preprocessing techniques on a Spanish Sign Language dataset is evaluated. These approaches outperform previously obtained results based on Hidden Markov Models.
{"title":"Sign Language Gesture Classification using Neural Networks","authors":"Zuzanna Parcheta, C. Martínez-Hinarejos","doi":"10.21437/iberspeech.2018-27","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-27","url":null,"abstract":"Recent studies have demonstrated the power of neural networks for different fields of artificial intelligence. In most fields, such as machine translation or speech recognition, neural networks outperform previously used methods (Hidden Markov Models with Gaussian Mixtures, Statistical Machine Translation, etc.). In this paper, the efficiency of the LeNet convolutional neural network for isolated word sign language recognition is demonstrated. As a preprocessing step, we apply several techniques to obtain the same dimension for the input that contains gesture information. The performance of these preprocessing techniques on a Spanish Sign Language dataset is evaluated. These approaches outperform previously obtained results based on Hidden Markov Models.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127089188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/IBERSPEECH.2018-22
Conrad Bernath, Aitor Álvarez, Haritz Arzelus, C. D. Martínez
Over the last few years, advances in both machine learning algorithms and computer hardware have led to significant improvements in speech recognition technology, mainly through the use of Deep Learning paradigms. As it was amply demon-strated in different studies, Deep Neural Networks (DNNs) have already outperformed traditional Gaussian Mixture Models (GMMs) at acoustic modeling in combination with Hidden Markov Models (HMMs). More recently, new attempts have focused on building end-to-end (E2E) speech recognition archi-tectures, especially in languages with many resources like English and Chinese, with the aim of overcoming the performance of LSTM-HMM and more conventional systems. The aim of this work is first to present the different techniques that have been applied to enhance state-of-the-art E2E systems for American English using publicly available datasets. Secondly, we describe the construction of E2E systems for Spanish and Basque, and explain the strategies applied to over-come the problem of the limited availability of training data, especially for Basque as a low-resource language. At the evaluation phase, the three E2E systems are also compared with LSTM-HMM based recognition engines built and tested with the same datasets.
{"title":"Exploring E2E speech recognition systems for new languages","authors":"Conrad Bernath, Aitor Álvarez, Haritz Arzelus, C. D. Martínez","doi":"10.21437/IBERSPEECH.2018-22","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-22","url":null,"abstract":"Over the last few years, advances in both machine learning algorithms and computer hardware have led to significant improvements in speech recognition technology, mainly through the use of Deep Learning paradigms. As it was amply demon-strated in different studies, Deep Neural Networks (DNNs) have already outperformed traditional Gaussian Mixture Models (GMMs) at acoustic modeling in combination with Hidden Markov Models (HMMs). More recently, new attempts have focused on building end-to-end (E2E) speech recognition archi-tectures, especially in languages with many resources like English and Chinese, with the aim of overcoming the performance of LSTM-HMM and more conventional systems. The aim of this work is first to present the different techniques that have been applied to enhance state-of-the-art E2E systems for American English using publicly available datasets. Secondly, we describe the construction of E2E systems for Spanish and Basque, and explain the strategies applied to over-come the problem of the limited availability of training data, especially for Basque as a low-resource language. At the evaluation phase, the three E2E systems are also compared with LSTM-HMM based recognition engines built and tested with the same datasets.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125211129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards expressive prosody generation in TTS for reading aloud applications","authors":"Mónica Domínguez, Alicia Burga, M. Farrús, Leo Wanner","doi":"10.21437/IBERSPEECH.2018-9","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-9","url":null,"abstract":"Comunicacio presentada a: IberSpeech 2018, celebrat a Barcelona del 21 al 23 de novembre de 2018.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132556085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-21DOI: 10.21437/iberspeech.2018-40
Miquel Àngel India Massana, Itziar Sagastiberri, Ponç Palau, E. Sayrol, J. Morros, J. Hernando
This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.
{"title":"UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge","authors":"Miquel Àngel India Massana, Itziar Sagastiberri, Ponç Palau, E. Sayrol, J. Morros, J. Hernando","doi":"10.21437/iberspeech.2018-40","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-40","url":null,"abstract":"This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133471933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}