首页 > 最新文献

IberSPEECH Conference最新文献

英文 中文
Building a global dictionary for semantic technologies 构建语义技术的全局字典
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-60
Eszter Iklódi, Gábor Recski, Gábor Borbély, María José Castro Bleda
This paper proposes a novel method for finding linear mappings among word vectors for various languages. Compared to previous approaches, this method does not learn translation matrices between two specific languages, but between a given language and a shared, universal space. The system was trained in two different modes, first between two languages, and after that applying three languages at the same time. In the first case two different training data were applied; Dinu’s English-Italian benchmark data [1], and English-Italian translation pairs extracted from the PanLex database [2]. In the second case only the PanLex database was used. The system performs on English-Italian languages with the best setting significantly better than the baseline system of Mikolov et al. [3], and it provides a comparable performance with the more sophisticated systems of Faruqui and Dyer [4] and Dinu et al. [1]. Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings among an arbitrary number languages.
本文提出了一种寻找不同语言词向量间线性映射的新方法。与之前的方法相比,该方法不学习两种特定语言之间的翻译矩阵,而是学习给定语言和共享的通用空间之间的翻译矩阵。该系统以两种不同的模式进行训练,首先在两种语言之间进行训练,然后同时使用三种语言进行训练。在第一种情况下,使用了两个不同的训练数据;Dinu的英语-意大利语基准数据[1]和从PanLex数据库中提取的英语-意大利语翻译对[2]。在第二种情况下,只使用了PanLex数据库。该系统在最佳设置下对英语-意大利语的表现明显优于Mikolov等人[3]的基线系统,并与Faruqui和Dyer[4]以及Dinu等人[1]的更复杂的系统提供相当的性能。该方法利用PanLex数据库的丰富性,使学习任意数量语言之间的线性映射成为可能。
{"title":"Building a global dictionary for semantic technologies","authors":"Eszter Iklódi, Gábor Recski, Gábor Borbély, María José Castro Bleda","doi":"10.21437/IBERSPEECH.2018-60","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-60","url":null,"abstract":"This paper proposes a novel method for finding linear mappings among word vectors for various languages. Compared to previous approaches, this method does not learn translation matrices between two specific languages, but between a given language and a shared, universal space. The system was trained in two different modes, first between two languages, and after that applying three languages at the same time. In the first case two different training data were applied; Dinu’s English-Italian benchmark data [1], and English-Italian translation pairs extracted from the PanLex database [2]. In the second case only the PanLex database was used. The system performs on English-Italian languages with the best setting significantly better than the baseline system of Mikolov et al. [3], and it provides a comparable performance with the more sophisticated systems of Faruqui and Dyer [4] and Dinu et al. [1]. Exploiting the richness of the PanLex database, the proposed method makes it possible to learn linear mappings among an arbitrary number languages.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130134261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018 ODESSA/PLUMCOT参加2018年Albayzin多式联运挑战赛
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-39
Benjamin Maurice, H. Bredin, Ruiqing Yin, Jose Patino, H. Delgado, C. Barras, N. Evans, Camille Guinaudeau
This paper describes ODESSA and PLUMCOT submissions to Albayzin Multimodal Diarization Challenge 2018. Given a list of people to recognize (alongside image and short video samples of those people), the task consists in jointly answering the two questions “who speaks when?” and “who appears when?”. Both consortia submitted 3 runs (1 primary and 2 contrastive) based on the same underlying mono-modal neural technologies : neural speaker segmentation, neural speaker embeddings, neural face embeddings, and neural talking-face detection. Our submissions aim at showing that face clustering and recognition can (hopefully) help to improve speaker diarization.
本文描述了ODESSA和PLUMCOT提交给2018年Albayzin多式联运挑战的作品。给定一个要识别的人的列表(以及这些人的图像和短视频样本),任务包括共同回答两个问题“谁在什么时候说话?”和“谁何时出现?”两个联盟都提交了3个运行(1个主要和2个对比)基于相同的底层单模态神经技术:神经说话人分割、神经说话人嵌入、神经面部嵌入和神经说话人脸检测。我们提交的论文旨在表明,人脸聚类和识别可以(希望)帮助改善说话人的特征。
{"title":"ODESSA/PLUMCOT at Albayzin Multimodal Diarization Challenge 2018","authors":"Benjamin Maurice, H. Bredin, Ruiqing Yin, Jose Patino, H. Delgado, C. Barras, N. Evans, Camille Guinaudeau","doi":"10.21437/IBERSPEECH.2018-39","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-39","url":null,"abstract":"This paper describes ODESSA and PLUMCOT submissions to Albayzin Multimodal Diarization Challenge 2018. Given a list of people to recognize (alongside image and short video samples of those people), the task consists in jointly answering the two questions “who speaks when?” and “who appears when?”. Both consortia submitted 3 runs (1 primary and 2 contrastive) based on the same underlying mono-modal neural technologies : neural speaker segmentation, neural speaker embeddings, neural face embeddings, and neural talking-face detection. Our submissions aim at showing that face clustering and recognition can (hopefully) help to improve speaker diarization.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130906457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Experimental Framework Design for Sign Language Automatic Recognition 手语自动识别实验框架设计
Pub Date : 2018-11-21 DOI: 10.21437/iberspeech.2018-16
D. T. Santiago, Ian Benderitter, C. García-Mateo
Automatic sign language recognition (ASLR) is quite a complex task, not only for the intrinsic difficulty of automatic video information retrieval, but also because almost every sign language (SL) can be considered as an under-resourced language when it comes to language technology. Spanish sign language (SSL) is one of those under-resourced languages. Developing technology for SSL implies a number of technical challenges that must be tackled down in a structured and sequential manner. In this paper, the problem of how to design an experimental framework for machine-learning-based ASLR is addressed. In our review of existing datasets, our main conclusion is that there is a need for high-quality data. We therefore propose some guidelines on how to conduct the acquisition and annotation of an SSL dataset. These guidelines were developed after conducting some preliminary ASLR experiments with small and limited subsets of existing datasets.
手语自动识别(ASLR)是一项相当复杂的任务,不仅因为自动视频信息检索的内在困难,而且因为在语言技术方面,几乎所有手语都可以被认为是资源不足的语言。西班牙手语(SSL)是资源不足的语言之一。开发SSL技术意味着必须以结构化和顺序的方式解决许多技术挑战。本文讨论了如何设计基于机器学习的ASLR实验框架的问题。在我们对现有数据集的回顾中,我们的主要结论是需要高质量的数据。因此,我们就如何进行SSL数据集的获取和注释提出了一些指导方针。这些指南是在对现有数据集的小而有限的子集进行了一些初步的ASLR实验后制定的。
{"title":"Experimental Framework Design for Sign Language Automatic Recognition","authors":"D. T. Santiago, Ian Benderitter, C. García-Mateo","doi":"10.21437/iberspeech.2018-16","DOIUrl":"https://doi.org/10.21437/iberspeech.2018-16","url":null,"abstract":"Automatic sign language recognition (ASLR) is quite a complex task, not only for the intrinsic difficulty of automatic video information retrieval, but also because almost every sign language (SL) can be considered as an under-resourced language when it comes to language technology. Spanish sign language (SSL) is one of those under-resourced languages. Developing technology for SSL implies a number of technical challenges that must be tackled down in a structured and sequential manner. In this paper, the problem of how to design an experimental framework for machine-learning-based ASLR is addressed. In our review of existing datasets, our main conclusion is that there is a need for high-quality data. We therefore propose some guidelines on how to conduct the acquisition and annotation of an SSL dataset. These guidelines were developed after conducting some preliminary ASLR experiments with small and limited subsets of existing datasets.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128858848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools 使用Kaldi工具的巴西葡萄牙语基线声学模型
Pub Date : 2018-11-21 DOI: 10.21437/IberSPEECH.2018-17
Cassio T. Batista, Ana Larissa Dias, N. Neto
Kaldi has become a very popular toolkit for automatic speech recognition, showing considerable improvements through the combination of hidden Markov models (HMM) and deep neural networks (DNN). However, in spite of its great performance for some languages (e.g. English, Italian, Serbian, etc.), the resources for Brazilian Portuguese (BP) are still quite limited. This work describes what appears to be the first attempt to cre-ate Kaldi-based scripts and baseline acoustic models for BP using Kaldi tools. Experiments were carried out for dictation tasks and a comparison to CMU Sphinx toolkit in terms of word error rate (WER) was performed. Results seem promising, since Kaldi achieved the absolute lowest WER of 4.75% with HMM-DNN and outperformed CMU Sphinx even when using Gaussian mixture models only.
Kaldi已经成为一个非常受欢迎的自动语音识别工具包,通过隐藏马尔可夫模型(HMM)和深度神经网络(DNN)的结合显示出相当大的改进。然而,尽管它在一些语言(如英语、意大利语、塞尔维亚语等)上表现出色,但巴西葡萄牙语(BP)的资源仍然相当有限。这项工作描述了似乎是第一次尝试使用Kaldi工具为BP创建基于Kaldi的脚本和基线声学模型。对听写任务进行了实验,并与CMU Sphinx工具箱在单词错误率(WER)方面进行了比较。结果看起来很有希望,因为Kaldi在HMM-DNN中实现了4.75%的绝对最低的WER,并且即使只使用高斯混合模型也优于CMU Sphinx。
{"title":"Baseline Acoustic Models for Brazilian Portuguese Using Kaldi Tools","authors":"Cassio T. Batista, Ana Larissa Dias, N. Neto","doi":"10.21437/IberSPEECH.2018-17","DOIUrl":"https://doi.org/10.21437/IberSPEECH.2018-17","url":null,"abstract":"Kaldi has become a very popular toolkit for automatic speech recognition, showing considerable improvements through the combination of hidden Markov models (HMM) and deep neural networks (DNN). However, in spite of its great performance for some languages (e.g. English, Italian, Serbian, etc.), the resources for Brazilian Portuguese (BP) are still quite limited. This work describes what appears to be the first attempt to cre-ate Kaldi-based scripts and baseline acoustic models for BP using Kaldi tools. Experiments were carried out for dictation tasks and a comparison to CMU Sphinx toolkit in terms of word error rate (WER) was performed. Results seem promising, since Kaldi achieved the absolute lowest WER of 4.75% with HMM-DNN and outperformed CMU Sphinx even when using Gaussian mixture models only.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116023191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
JHU Diarization System Description JHU分级系统描述
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-49
Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak
We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the final speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.
提出了一种用于自由语音- rtve说话人特征化评价的JHU系统。该评估将西班牙语和广播音频结合在同一录音中,这是我们的系统以前从未测试过的条件。为了解决这个问题,我们的通用系统的管道,完全在Kaldi开发,包括声学特征提取,SAD,嵌入提取器,PLDA和聚类阶段。该管道用于开放和封闭两种情况(在评价计划中描述)。所有提出的解决方案都使用宽带数据(16KHz)和mfc作为输入。对于封闭条件,系统使用Albayzin2016数据训练DNN SAD。由于可用数据量少,i向量嵌入提取是该任务探索的唯一方法。PLDA训练利用Albayzin数据和AHC聚类(Agglomerative Hierarchical Clustering, AHC)来获得说话人分割。打开条件使用在关闭条件下获得的DNN SAD。提取了x-vector-basic、x-vector-factor、i-vector-basic和BNF-i-vector四种类型的嵌入。x-vector-basic是在增强Voxceleb1和Voxceleb2上训练的TDNN。x-vector- factors是在SRE12-micphn、MX6-micphn、VoxCeleb和SITW-dev-core上训练的因子- tdnn (TDNN-F)。在Voxceleb1和Voxceleb2数据上训练i-vector-basic(无增强)。bnf -i向量是用与x向量因子相同的数据训练的bnf -后验i向量。新场景的PLDA训练使用Albayzin2016数据。这四个系统在分数水平上融合在一起。AHC再一次计算出最终的说话人分割。我们在Albayzin2018 dev2数据中测试了我们的系统,并观察到SAD对于改善结果非常重要。此外,我们注意到x向量比i向量更好,这在之前的实验中已经观察到。
{"title":"JHU Diarization System Description","authors":"Zili Huang, Leibny Paola García-Perera, J. Villalba, Daniel Povey, N. Dehak","doi":"10.21437/IBERSPEECH.2018-49","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-49","url":null,"abstract":"We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed en-tirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data fol-lowed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, x-vector-basic, x-vector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (TDNN-F) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITW-dev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as x-vector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC computed the final speaker segmentation. We tested our systems in the Albayzin2018 dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that x-vectors were better than i-vectors, as already observed in previous experiments.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"697 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116117218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Intelligent Voice ASR system for Iberspeech 2018 Speech to Text Transcription Challenge Iberspeech 2018语音到文本转录挑战赛的智能语音ASR系统
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-57
Nazim Dugan, C. Glackin, Gérard Chollet, Nigel Cannings
This paper describes the system developed by the Empathic team for the open set condition of the Iberspeech 2018 Speech to Text Transcription Challenge. A DNN-HMM hybrid acoustic model is developed, with MFCC's and iVectors as input features, using the Kaldi framework. The provided ground truth transcriptions for training and development are cleaned up using customized clean-up scripts and then realigned using a two-step alignment procedure which uses word lattice results coming from a previous ASR system. 261 hours of data is selected from train and dev1 subsections of the provided data, by applying a selection criterion on the utterance level scoring results. The selected data is merged with the 91 hours of training data used to train the previous ASR system with a factor 3 times data augmentation by reverberation using a noise corpus on the total training data, resulting a total of 1057 hours of final …
本文描述了移情团队为Iberspeech 2018 Speech to Text Transcription Challenge的开放设置条件开发的系统。采用Kaldi框架,建立了以MFCC和矢量为输入特征的DNN-HMM混合声学模型。为培训和发展提供的基础真相转录使用定制的清理脚本进行清理,然后使用两步对齐程序重新对齐,该程序使用来自先前ASR系统的词格结果。通过对话语水平评分结果应用选择标准,从所提供数据的train和dev1小节中选择261小时的数据。选择的数据与之前用于训练ASR系统的91小时训练数据合并,并在总训练数据上使用噪声语料库进行混响,使数据增加3倍,从而获得总计1057小时的最终数据。
{"title":"Intelligent Voice ASR system for Iberspeech 2018 Speech to Text Transcription Challenge","authors":"Nazim Dugan, C. Glackin, Gérard Chollet, Nigel Cannings","doi":"10.21437/IBERSPEECH.2018-57","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-57","url":null,"abstract":"This paper describes the system developed by the Empathic team for the open set condition of the Iberspeech 2018 Speech to Text Transcription Challenge. A DNN-HMM hybrid acoustic model is developed, with MFCC's and iVectors as input features, using the Kaldi framework. The provided ground truth transcriptions for training and development are cleaned up using customized clean-up scripts and then realigned using a two-step alignment procedure which uses word lattice results coming from a previous ASR system. 261 hours of data is selected from train and dev1 subsections of the provided data, by applying a selection criterion on the utterance level scoring results. The selected data is merged with the 91 hours of training data used to train the previous ASR system with a factor 3 times data augmentation by reverberation using a noise corpus on the total training data, resulting a total of 1057 hours of final …","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124297959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing 基于多模态、互动性和众包的历史抄写研究进展
Pub Date : 2018-11-21 DOI: 10.4995/Thesis/10251/86137
Emilio Granell, C. Martínez-Hinarejos, Verónica Romero
Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. · Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just
自然语言处理(NLP)是计算机科学、语言学和模式识别的跨学科研究领域,主要研究人类自然语言在人机交互(HCI)中的应用。大多数NLP研究任务都可以应用于解决现实问题。这就是自然语言识别和自然语言翻译的例子,可以用来建立自动系统,用于文档转录和文档翻译。在数字化手写文本文件方面,由于简单的图像数字化在大多数情况下只提供按图像搜索,而不是按语言内容(关键字、表达式、句法或语义类别)搜索,因此转录是用来获得对内容的简单数字化访问。抄写在历史手稿中更为重要,因为大多数这些文件都是独一无二的,出于文化和历史原因,保存它们的内容至关重要。历史手稿的抄写通常由古文字学家完成,他们是古代文字和词汇方面的专家。最近,手写体文本识别(HTR)已经成为协助古抄写员完成任务的一种常用工具,通过提供抄写稿,他们可以用或多或少复杂的方法对其进行修改。当它呈现的错误率足够低,使修改过程比从头开始的完整转录更舒适时,这个草稿转录是有用的。因此,获得具有可接受的低错误率的草稿转录对于将这种NLP技术纳入转录过程至关重要。本文所描述的工作重点是改进HTR系统提供的草稿转录,目的是减少古文字工作者为获取数字化历史手稿的实际转录而付出的努力。这个问题可以从三种不同但互补的情况来面对:·多模式:使用HTR系统可以使古抄写员加快手动抄写过程,因为他们能够对草稿抄写进行纠正。另一种选择是通过向自动语音识别(ASR)系统口述内容来获得草稿转录。当两个来源(图像和语音)都可用时,多模态组合是可能的,并且可以使用迭代过程来改进最终假设。·互动性:在转录过程中使用辅助技术可以减少获得实际转录所需的时间和人力,因为辅助系统和古生物学家合作产生完美的转录。多模态反馈可用于为辅助系统提供额外的信息来源,通过使用代表整个相同单词序列的信号来转录(例如文本图像,以及该文本图像内容的听写语音),或仅代表一个单词或字符来纠正(例如在线手写单词)。·众包:开放的分布式协作以相对较低的成本成为大规模转录的强大工具,因为古学家的监督工作可能会大大减少。多模态组合允许人们在多模态众包平台上使用手写文本行语音听写,合作者可以使用自己的移动设备而不是使用台式机或笔记本电脑提供他们的演讲,这使得招募更多的合作者成为可能。
{"title":"Advances on the Transcription of Historical Manuscripts based on Multimodality, Interactivity and Crowdsourcing","authors":"Emilio Granell, C. Martínez-Hinarejos, Verónica Romero","doi":"10.4995/Thesis/10251/86137","DOIUrl":"https://doi.org/10.4995/Thesis/10251/86137","url":null,"abstract":"Natural Language Processing (NLP) is an interdisciplinary research field of Computer Science, Linguistics, and Pattern Recognition that studies, among others, the use of human natural languages in Human-Computer Interaction (HCI). Most of NLP research tasks can be applied for solving real-world problems. This is the case of natural language recognition and natural language translation, that can be used for building automatic systems for document transcription and document translation. \u0000Regarding digitalised handwritten text documents, transcription is used to obtain an easy digital access to the contents, since simple image digitalisation only provides, in most cases, search by image and not by linguistic contents (keywords, expressions, syntactic or semantic categories). Transcription is even more important in historical manuscripts, since most of these documents are unique and the preservation of their contents is crucial for cultural and historical reasons. \u0000The transcription of historical manuscripts is usually done by paleographers, who are experts on ancient script and vocabulary. Recently, Handwritten Text Recognition (HTR) has become a common tool for assisting paleographers in their task, by providing a draft transcription that they may amend with more or less sophisticated methods. This draft transcription is useful when it presents an error rate low enough to make the amending process more comfortable than a complete transcription from scratch. Thus, obtaining a draft transcription with an acceptable low error rate is crucial to have this NLP technology incorporated into the transcription process. \u0000The work described in this thesis is focused on the improvement of the draft transcription offered by an HTR system, with the aim of reducing the effort made by paleographers for obtaining the actual transcription on digitalised historical manuscripts. \u0000This problem is faced from three different, but complementary, scenarios: · Multimodality: The use of HTR systems allow paleographers to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is to obtain the draft transcription by dictating the contents to an Automatic Speech Recognition (ASR) system. When both sources (image and speech) are available, a multimodal combination is possible and an iterative process can be used in order to refine the final hypothesis. \u0000· Interactivity: The use of assistive technologies in the transcription process allows one to reduce the time and human effort required for obtaining the actual transcription, given that the assistive system and the palaeographer cooperate to generate a perfect transcription. \u0000Multimodal feedback can be used to provide the assistive system with additional sources of information by using signals that represent the whole same sequence of words to transcribe (e.g. a text image, and the speech of the dictation of the contents of this text image), or that represent just ","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121748460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speaker Recognition under Stress Conditions 应激条件下的说话人识别
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-4
Esther Rituerto-González, A. Gallardo-Antolín, Carmen Peláez-Moreno
Speaker Recognition systems exhibit a decrease in performance when the input speech is not in optimal circumstances, for example when the user is under emotional or stress conditions. The objective of this paper is measuring the effects of stress on speech to ultimately try to mitigate its consequences on a speaker recognition task. On this paper, we develop a stress-robust speaker identification system using data selection and augmentation by means of the manipulation of the original speech utterances. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we concluded that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples, improves the performance of the system.
说话人识别系统在输入语音不处于最佳状态时表现出性能下降,例如当用户处于情绪或压力状态时。本文的目的是测量压力对语音的影响,最终试图减轻其对说话人识别任务的影响。在本文中,我们开发了一个应力鲁棒的说话人识别系统,该系统使用数据选择和增强,通过对原始语音的处理。为了评估所建议的技术的有效性,已经进行了广泛的实验。首先,我们得出结论,当训练集中包含自然应力样本时,总是获得最佳性能;其次,当这些样本不可用时,用合成的类应力样本替换和增强它们,可以提高系统的性能。
{"title":"Speaker Recognition under Stress Conditions","authors":"Esther Rituerto-González, A. Gallardo-Antolín, Carmen Peláez-Moreno","doi":"10.21437/IBERSPEECH.2018-4","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-4","url":null,"abstract":"Speaker Recognition systems exhibit a decrease in performance when the input speech is not in optimal circumstances, for example when the user is under emotional or stress conditions. The objective of this paper is measuring the effects of stress on speech to ultimately try to mitigate its consequences on a speaker recognition task. On this paper, we develop a stress-robust speaker identification system using data selection and augmentation by means of the manipulation of the original speech utterances. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we concluded that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples, improves the performance of the system.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130428050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Domain Data 广播域数据音频分割的递归神经网络方法
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-19
Pablo Gimeno, I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO
This paper presents a new approach for automatic audio segmentation based on Recurrent Neural Networks. Our system takes advantage of the capability of Bidirectional Long Short Term Memory Networks (BLSTM) for modeling temporal dy-namics of the input signals. The DNN is complemented by a resegmentation module, gaining long-term stability by means of the tied-state concept in Hidden Markov Models. Further-more, feature exploration has been performed to best represent the information in the input data. The acoustic features that have been included are spectral log-filter-bank energies and musical features such as chroma. This new approach has been evaluated with the Albayz´ın 2010 audio segmentation evaluation dataset. The evaluation requires to differentiate five audio conditions: music, speech, speech with music, speech with noise and others. Competitive results were obtained, achieving a relative improvement of 15.75% compared to the best results found in the literature for this database.
提出了一种基于递归神经网络的音频自动分割方法。我们的系统利用双向长短期记忆网络(BLSTM)的能力来建模输入信号的时间动态。DNN由一个再分割模块补充,通过隐马尔可夫模型中的绑定状态概念获得长期稳定性。此外,还进行了特征探索,以最好地表示输入数据中的信息。已经包含的声学特征是频谱对数滤波器组能量和音乐特征,如色度。这种新方法已经用Albayz´ın 2010音频分割评估数据集进行了评估。评估需要区分五种音频条件:音乐、语音、带音乐的语音、带噪音的语音和其他。获得了具有竞争力的结果,与该数据库在文献中发现的最佳结果相比,实现了15.75%的相对改进。
{"title":"A Recurrent Neural Network Approach to Audio Segmentation for Broadcast Domain Data","authors":"Pablo Gimeno, I. Viñals, A. Ortega, A. Miguel, EDUARDO LLEIDA SOLANO","doi":"10.21437/IBERSPEECH.2018-19","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-19","url":null,"abstract":"This paper presents a new approach for automatic audio segmentation based on Recurrent Neural Networks. Our system takes advantage of the capability of Bidirectional Long Short Term Memory Networks (BLSTM) for modeling temporal dy-namics of the input signals. The DNN is complemented by a resegmentation module, gaining long-term stability by means of the tied-state concept in Hidden Markov Models. Further-more, feature exploration has been performed to best represent the information in the input data. The acoustic features that have been included are spectral log-filter-bank energies and musical features such as chroma. This new approach has been evaluated with the Albayz´ın 2010 audio segmentation evaluation dataset. The evaluation requires to differentiate five audio conditions: music, speech, speech with music, speech with noise and others. Competitive results were obtained, achieving a relative improvement of 15.75% compared to the best results found in the literature for this database.","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114068282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization Challenge 智能语音系统为IberSPEECH-RTVE 2018扬声器拨号挑战赛
Pub Date : 2018-11-21 DOI: 10.21437/IBERSPEECH.2018-48
Abbas Khosravani, C. Glackin, Nazim Dugan, G. Chollet, Nigel Cannings
This paper describes the Intelligent Voice (IV) speaker diarization system for IberSPEECH-RTVE 2018 speaker diarization challenge. We developed a new speaker diarization built on the success of deep neural network based speaker embeddings in speaker verification systems. In contrary to acoustic features such as MFCCs, deep neural network embeddings are much better at discerning speaker identities especially for speech acquired without constraint on recording equipment and environment. We perform spectral clustering on our proposed CNNLSTM-based speaker embeddings to find homogeneous segments and generate speaker log likelihood for each frame. A HMM is then used to refine the speaker posterior probabilities through limiting the probability of switching between speakers when changing frames. We present results obtained on the development set (dev2) as well as the evaluation set …
本文介绍了智能语音(IV)扬声器拨号系统,用于IberSPEECH-RTVE 2018扬声器拨号挑战赛。基于基于深度神经网络的说话人嵌入在说话人验证系统中的成功,我们开发了一种新的说话人化方法。与mfccc等声学特征相反,深度神经网络嵌入在识别说话者身份方面要好得多,特别是对于不受录音设备和环境限制的语音。我们对我们提出的基于cnnlstm的说话人嵌入进行频谱聚类,以找到均匀的片段,并为每帧生成说话人的日志似然。然后使用HMM通过限制切换帧时说话人之间切换的概率来细化说话人后验概率。我们给出了在开发集(dev2)和评估集上得到的结果。
{"title":"The Intelligent Voice System for the IberSPEECH-RTVE 2018 Speaker Diarization Challenge","authors":"Abbas Khosravani, C. Glackin, Nazim Dugan, G. Chollet, Nigel Cannings","doi":"10.21437/IBERSPEECH.2018-48","DOIUrl":"https://doi.org/10.21437/IBERSPEECH.2018-48","url":null,"abstract":"This paper describes the Intelligent Voice (IV) speaker diarization system for IberSPEECH-RTVE 2018 speaker diarization challenge. We developed a new speaker diarization built on the success of deep neural network based speaker embeddings in speaker verification systems. In contrary to acoustic features such as MFCCs, deep neural network embeddings are much better at discerning speaker identities especially for speech acquired without constraint on recording equipment and environment. We perform spectral clustering on our proposed CNNLSTM-based speaker embeddings to find homogeneous segments and generate speaker log likelihood for each frame. A HMM is then used to refine the speaker posterior probabilities through limiting the probability of switching between speakers when changing frames. We present results obtained on the development set (dev2) as well as the evaluation set …","PeriodicalId":115963,"journal":{"name":"IberSPEECH Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114197936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
IberSPEECH Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1