2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献

英文中文

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/o-cocosda46868.2019.9050341

Akihiro Fujiwara, H. Irie, Y. Kakuda, Hiroyuki Sato

We are delighted to welcome you all to Cebu for the 22nd Conference of the Oriental COCOSDA (Oriental Chapter of the International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques). This conference is technically co-sponsored by the IEEE Philippine section and organized by the Computing Society of the Philippines – Special Interest Group on Natural Language Processing, National University, and the University of San Carlos.

我们很高兴欢迎大家来到宿雾参加第22届东方COCOSDA会议(国际语音数据库和评估技术协调和标准化委员会东方分会)。本次会议在技术上由IEEE菲律宾分会共同赞助，由菲律宾计算学会——自然语言处理特别兴趣小组、国立大学和圣卡洛斯大学组织。

引用次数: 0

Indic TIMIT and Indic English lexicon: A speech database of Indian speakers using TIMIT stimuli and a lexicon from their mispronunciations 印度语TIMIT和印度语英语词汇:一个使用TIMIT刺激的印度人的语音数据库和一个来自他们发音错误的词汇

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041230

Chiranjeevi Yarra, Ritu Aggarwal, Avni Rajpal, P. Ghosh

With the advancements in the speech technology, demand for larger speech corpora is increasing particularly those from non-native English speakers. In order to cater to this demand under Indian context, we acquire a database named Indic TIMIT, a phonetically rich Indian English speech corpus. It contains ~240 hours of speech recordings from 80 subjects, in which, each subject has spoken a set of 2342 stimuli available in the TIMIT corpus. Further, the corpus also contains phoneme transcriptions for a sub-set of recordings, which are manually annotated by two linguists reflecting speaker's pronunciation. Considering these, Indic TIMIT is unique with respect to the existing corpora that are available in Indian context. Along with Indic TIMIT, a lexicon named Indic English lexicon is provided, which is constructed by incorporating pronunciation variations specific to Indians obtained from their errors to the existing word pronunciations in a native English lexicon. In this paper, the effectiveness of Indic TIMIT and Indic English lexicon is shown respectively in comparison with the data from TIMIT and a lexicon augmented with all the word pronunciations from CMU, Beep and the lexicon available in the TIMIT corpus. Indic TIMIT and Indic English lexicon could be useful for a number of potential applications in Indian context including automatic speech recognition, mispronunciation detection & diagnosis, native language identification, accent adaptation, accent conversion, voice conversion, speech synthesis, grapheme-to-phoneme conversion, automatic phoneme unit discovery and pronunciation error analysis.

随着语音技术的进步，对大型语音语料库的需求越来越大，尤其是那些非英语母语者。为了满足印度语境下的这一需求，我们获得了一个语音丰富的印度英语语料库——印度TIMIT数据库。它包含来自80个被试的约240小时的语音记录，其中每个被试都说了一组TIMIT语料库中提供的2342个刺激。此外，语料库还包含一组录音的音素转录，由两位语言学家手工注释，反映说话者的发音。考虑到这些，印度语的TIMIT相对于印度上下文中可用的现有语料库是独一无二的。除了印度语的TIMIT外，还提供了一个名为“印度英语词典”的词典，该词典是通过将印度人从其错误中获得的特定于印度人的发音变化与英语母语词典中的现有单词发音相结合而构建的。本文将印度语TIMIT和印度语英语词典分别与来自TIMIT的数据和一个由CMU、Beep和TIMIT语料库中现有词汇扩充的词典进行对比，证明了印度语TIMIT和印度语英语词典的有效性。印度语TIMIT和印度语英语词典可以在印度语境中用于许多潜在的应用，包括自动语音识别、错误发音检测和诊断、母语识别、口音适应、口音转换、语音转换、语音合成、字素到音素转换、自动音素单元发现和发音错误分析。

{"title":"Indic TIMIT and Indic English lexicon: A speech database of Indian speakers using TIMIT stimuli and a lexicon from their mispronunciations","authors":"Chiranjeevi Yarra, Ritu Aggarwal, Avni Rajpal, P. Ghosh","doi":"10.1109/O-COCOSDA46868.2019.9041230","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041230","url":null,"abstract":"With the advancements in the speech technology, demand for larger speech corpora is increasing particularly those from non-native English speakers. In order to cater to this demand under Indian context, we acquire a database named Indic TIMIT, a phonetically rich Indian English speech corpus. It contains ~240 hours of speech recordings from 80 subjects, in which, each subject has spoken a set of 2342 stimuli available in the TIMIT corpus. Further, the corpus also contains phoneme transcriptions for a sub-set of recordings, which are manually annotated by two linguists reflecting speaker's pronunciation. Considering these, Indic TIMIT is unique with respect to the existing corpora that are available in Indian context. Along with Indic TIMIT, a lexicon named Indic English lexicon is provided, which is constructed by incorporating pronunciation variations specific to Indians obtained from their errors to the existing word pronunciations in a native English lexicon. In this paper, the effectiveness of Indic TIMIT and Indic English lexicon is shown respectively in comparison with the data from TIMIT and a lexicon augmented with all the word pronunciations from CMU, Beep and the lexicon available in the TIMIT corpus. Indic TIMIT and Indic English lexicon could be useful for a number of potential applications in Indian context including automatic speech recognition, mispronunciation detection & diagnosis, native language identification, accent adaptation, accent conversion, voice conversion, speech synthesis, grapheme-to-phoneme conversion, automatic phoneme unit discovery and pronunciation error analysis.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123219966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Distributing and Sharing Resources for Automatic Speech Recognition Applications 自动语音识别应用程序的资源分配和共享

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041201

Sila Chunwijitra, Surasak Boonkla, Vataya Chunwijitra, Nattapong Kurpukdee, P. Sertsi, S. Kasuriya

Implementation of automatic speech recognition (ASR) system to the real scenarios has been discovered many difficulties in two main topics: processing time and resource demands. These obstructions are such big issues in deploying ASR system. This paper proposed three approaches to deal with those problems, which are applying multithread processing to separate sub-processes, exploiting multiplexing and demultiplexing technique to network socket, and improving the distribution of speech recognition engine in audio streaming. In the experiment, we evaluated our approaches with two types of speech input (audio files and audio streams). The results showed that our approaches are using fewer resources (sharing working memory) and also reduce the processing time since the real-time factor (RTF) is reduced by 15 % approximately comparing with the baseline system.

在实际场景中实现自动语音识别(ASR)系统存在着处理时间和资源需求两个主要问题。这些障碍是部署ASR系统的大问题。针对这些问题，本文提出了三种解决方法:将多线程处理应用于分离子进程，将多路复用和解路复用技术应用于网络套接字，以及改进音频流中语音识别引擎的分布。在实验中，我们用两种类型的语音输入(音频文件和音频流)评估了我们的方法。结果表明，我们的方法使用更少的资源(共享工作内存)，并且还减少了处理时间，因为实时因子(RTF)与基线系统相比减少了大约15%。

引用次数: 2

Motion detection of articulatory movement with paralinguistic information using real-time MRI movie 基于实时MRI影像的辅助语言信息的发音运动检测

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060850

Takuya Asai, H. Kikuchi, K. Maekawa

The goal of this study is to establish the analytical method of articulatory movement by real-time magnetic resonance images (rtMRI) without estimation of articulatory contours. We present the result of motion detection using a background subtraction method. As a result of applying the background subtraction method to the rtMRI data of one speaker, some motions were detected in tongue, lip, and lower jaw which are important places for speech generation. By the experiments with the movies of multiple speakers, we confirmed that it is possible to detect the motion of the basic articulatory movement, and some movements were different for each speaker. Furthermore, we adapted the proposed method for motion detection to the utterances aiming at the transmission of paralinguistic information. As a result, some similar movements to the previous research were observed.

本研究的目的是建立一种不需要估计关节轮廓的实时磁共振图像(rtMRI)分析关节运动的方法。我们给出了使用背景减法的运动检测结果。将背景减法应用于一个说话人的rtMRI数据，在舌头、嘴唇和下颌这些重要的语音生成部位检测到一些运动。通过对多个说话者的视频进行实验，我们证实了可以检测到基本发音运动的运动，并且每个说话者的一些运动是不同的。此外，我们将所提出的运动检测方法应用于以副语言信息传递为目的的话语。结果，观察到一些与先前研究相似的运动。

引用次数: 1

Recognition and translation of code-switching speech utterances 语码转换语音的识别与翻译

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060847

Sahoko Nakayama, Takatomo Kano, Andros Tjandra, S. Sakti, Satoshi Nakamura

Code-switching (CS), a hallmark of worldwide bilingual communities, refers to a strategy adopted by bilinguals (or multilinguals) who mix two or more languages in a discourse often with little change of interlocutor or topic. The units and the locations of the switches may vary widely from single-word switches to whole phrases (beyond the length of the loanword units). Such phenomena pose challenges for spoken language technologies, i.e., automatic speech recognition (ASR), since the systems need to be able to handle the input in a multilingual setting. Several works constructed a CS ASR on many different language pairs. But the common aim of developing a CS ASR is merely for transcribing CS-speech utterances into CS-text sentences within a single individual. In contrast, in this study, we address the situational context that happens during dialogs between CS and non-CS (monolingual) speakers and support monolingual speakers who want to understand CS speakers. We construct a system that recognizes and translates from codeswitching speech to monolingual text. We investigated several approaches, including a cascade of ASR and a neural machine translation (NMT), a cascade of ASR and a deep bidirectional language model (BERT), an ASR that directly outputs monolingual transcriptions from CS speech, and multi-task learning. Finally, we evaluate and discuss these four ways on a Japanese- English CS to English monolingual task.

语码转换(Code-switching, CS)是全球双语社区的一个特征，是指双语者(或多语者)在对话者或话题很少改变的情况下，将两种或两种以上语言混合在一个话语中所采用的一种策略。开关的单位和位置可能变化很大，从单个单词开关到整个短语(超出外来词单位的长度)。这种现象对口语技术，即自动语音识别(ASR)提出了挑战，因为系统需要能够处理多语言设置中的输入。一些作品在许多不同的语言对上构建了CS ASR。但是，开发CS语音识别系统的共同目标仅仅是将单个个体的CS语音话语转录成CS文本句子。相比之下，在本研究中，我们关注的是CS和非CS(单语)说话者之间对话时的情景语境，并支持单语说话者想要理解CS说话者。我们构建了一个识别和翻译从码转换语音到单语文本的系统。我们研究了几种方法，包括ASR级联和神经机器翻译(NMT)， ASR级联和深度双向语言模型(BERT)，直接从CS语音输出单语转录的ASR，以及多任务学习。最后，我们对这四种方法进行了评价和讨论。

{"title":"Recognition and translation of code-switching speech utterances","authors":"Sahoko Nakayama, Takatomo Kano, Andros Tjandra, S. Sakti, Satoshi Nakamura","doi":"10.1109/O-COCOSDA46868.2019.9060847","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9060847","url":null,"abstract":"Code-switching (CS), a hallmark of worldwide bilingual communities, refers to a strategy adopted by bilinguals (or multilinguals) who mix two or more languages in a discourse often with little change of interlocutor or topic. The units and the locations of the switches may vary widely from single-word switches to whole phrases (beyond the length of the loanword units). Such phenomena pose challenges for spoken language technologies, i.e., automatic speech recognition (ASR), since the systems need to be able to handle the input in a multilingual setting. Several works constructed a CS ASR on many different language pairs. But the common aim of developing a CS ASR is merely for transcribing CS-speech utterances into CS-text sentences within a single individual. In contrast, in this study, we address the situational context that happens during dialogs between CS and non-CS (monolingual) speakers and support monolingual speakers who want to understand CS speakers. We construct a system that recognizes and translates from codeswitching speech to monolingual text. We investigated several approaches, including a cascade of ASR and a neural machine translation (NMT), a cascade of ASR and a deep bidirectional language model (BERT), an ASR that directly outputs monolingual transcriptions from CS speech, and multi-task learning. Finally, we evaluate and discuss these four ways on a Japanese- English CS to English monolingual task.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

O-COCOSDA 2019 Thailand report October 2019 O-COCOSDA 2019泰国报告2019年10月

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/o-cocosda46868.2019.9060839

引用次数: 0

LOTUS-BI: A Thai-English Code-mixing Speech Corpus LOTUS-BI:泰英码混合语音语料库

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041195

Sumonmas Thatphithakkul, Vataya Chunwijitra, P. Sertsi, P. Chootrakool, S. Kasuriya

Nowadays, English words mixed in Thai speech are usually found in a typical speaking style. Consequently, to increase the performance of the speech recognition system, a Thai-English code-mixing speech corpus is required. This paper describes the design and construction of LOTUS-BI corpus: a Thai-English code-mixing speech corpus aimed to be the essential speech database for training acoustic model and language model in order to obtain the better speech recognition accuracy. LOTUS-BI corpus contains 16.5 speech hours from 4 speech tasks: interview, talk, seminar, and meeting. Now, 11.5 speech hours of data from the interview, talk, and seminar acquire from the internet have been transcribed and annotated. Whereas, the rest of 5 speech hours from meeting task has been transcribing. Therefore, only 11.5 speech hours of data were analyzed in this paper. Furthermore, the pronunciation dictionary of vocabularies from LOTUS-BI corpus is created based on Thai phoneme set. The statistical analysis of LOTUS-BI corpus revealed that there are 37.96% of code-mixing utterances, including 34.23% intra-sentential and 3.73% inter-sentential utterances. The occurrence of English vocabularies is 29.04% of the total vocabularies in the corpus. Besides, nouns are found in 90% of all English vocabularies in the corpus and 10% in the other grammatical categories.

如今，泰语中夹杂的英语单词通常以一种典型的说话方式出现。因此，为了提高语音识别系统的性能，需要一个泰英码混合语音语料库。本文介绍了LOTUS-BI语料库的设计和构建，LOTUS-BI语料库是一个泰英混码语音语料库，旨在成为训练声学模型和语言模型的必要语音数据库，以获得更好的语音识别精度。LOTUS-BI语料库包含来自4个演讲任务的16.5个演讲小时:采访、演讲、研讨会和会议。目前，从网络上获取的11.5个演讲小时的访谈、演讲和研讨会数据已被转录和注释。然而，剩下的5个小时的会议任务已经转录。因此，本文只分析了11.5个语音小时的数据。在此基础上，建立了基于泰语音素集的LOTUS-BI语料库词汇发音词典。对LOTUS-BI语料库的统计分析显示，混码话语占37.96%，其中句子内的占34.23%，句子间的占3.73%。英语词汇的出现率占语料库总词汇量的29.04%。此外，语料库中90%的英语词汇中都有名词，其他语法类词汇中也有10%是名词。

{"title":"LOTUS-BI: A Thai-English Code-mixing Speech Corpus","authors":"Sumonmas Thatphithakkul, Vataya Chunwijitra, P. Sertsi, P. Chootrakool, S. Kasuriya","doi":"10.1109/O-COCOSDA46868.2019.9041195","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041195","url":null,"abstract":"Nowadays, English words mixed in Thai speech are usually found in a typical speaking style. Consequently, to increase the performance of the speech recognition system, a Thai-English code-mixing speech corpus is required. This paper describes the design and construction of LOTUS-BI corpus: a Thai-English code-mixing speech corpus aimed to be the essential speech database for training acoustic model and language model in order to obtain the better speech recognition accuracy. LOTUS-BI corpus contains 16.5 speech hours from 4 speech tasks: interview, talk, seminar, and meeting. Now, 11.5 speech hours of data from the interview, talk, and seminar acquire from the internet have been transcribed and annotated. Whereas, the rest of 5 speech hours from meeting task has been transcribing. Therefore, only 11.5 speech hours of data were analyzed in this paper. Furthermore, the pronunciation dictionary of vocabularies from LOTUS-BI corpus is created based on Thai phoneme set. The statistical analysis of LOTUS-BI corpus revealed that there are 37.96% of code-mixing utterances, including 34.23% intra-sentential and 3.73% inter-sentential utterances. The occurrence of English vocabularies is 29.04% of the total vocabularies in the corpus. Besides, nouns are found in 90% of all English vocabularies in the corpus and 10% in the other grammatical categories.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124015834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A linguistic representation scheme for depression prediction - with a case study 抑郁症预测的语言表示方案——附案例研究

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9060849

Yuan Jia, Yuzhu Liang, T. Zhu

In this paper, we propose a representation scheme for modeling linguistic and paralinguistic features (emotion and speech act features) of depression patients, based on which a diagnostic model is constructed. The model can be used to assist the identification of depression and predict the degree of depression. A case study with the micro-blog data from a real depression patient and three non-patients, is carried out to illustrate the discriminative power of the linguistic and paralinguistic features. The results demonstrate the ability of the proposed representation scheme to not only distinguish the patient from non-patients but also distinguish different stages of the patient.

本文提出了一种对抑郁症患者的语言和副语言特征(情绪和言语行为特征)进行建模的表示方案，并在此基础上构建诊断模型。该模型可用于辅助抑郁症的识别和抑郁程度的预测。以一名真实抑郁症患者和三名非抑郁症患者的微博数据为例，说明语言和副语言特征的辨析能力。结果表明，所提出的表征方案不仅能够区分患者和非患者，而且能够区分患者的不同阶段。

引用次数: 0

X-vectors based Urdu Speaker Identification for short utterances 基于x向量的乌尔都语短话语识别

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041237

M. Farooq, F. Adeeba, S. Hussain

In context of commercial applications, robustness of a Speaker Identification (SI) system is adversely effected by short utterances. Performance of SI systems fairly depends upon extracted feature sets. This paper investigates the effect of various feature extraction techniques on performance of i-vectors and x-vectors based Urdu speakers' identification models. The scope of this paper is restricted to text independent speaker identification for short utterances (up to 4 seconds). SI systems demand for a large data covering sufficient inter-speaker and intra-speaker variability. Available Urdu speech corpus is used to measure performance of various feature sets on SI systems. A minimum percentage Equal Error Rate (%EER) of 0.113 is achieved using x-vectors with Linear Frequency Cepstral Coefficients (LFCCs) feature set.

在商业应用中，短话语对说话人识别(SI)系统的鲁棒性有不利影响。SI系统的性能很大程度上取决于提取的特征集。本文研究了不同特征提取技术对基于i向量和x向量的乌尔都语识别模型性能的影响。本文的范围仅限于短话语(最多4秒)的文本独立说话人识别。SI系统需要大量的数据，包括足够的说话人之间和说话人内部的变化。可用的乌尔都语语料库用于测量SI系统中各种特征集的性能。使用具有线性频率倒谱系数(LFCCs)特征集的x向量实现了0.113的最小百分比相等错误率(%EER)。

引用次数: 1

A New Corpus of Elderly Japanese Speech for Acoustic Modeling, and a Preliminary Investigation of Dialect-Dependent Speech Recognition 一种新的老年日语语音语料库声学建模及方言语音识别的初步研究

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2019-10-01 DOI: 10.1109/O-COCOSDA46868.2019.9041216

Meiko Fukuda, Ryota Nishimura, H. Nishizaki, Y. Iribe, N. Kitaoka

We have constructed a new speech data corpus consisting of the utterances of 221 elderly Japanese people (average age: 79.2) with the aim of improving the accuracy of automatic speech recognition (ASR) for the elderly. ASR is a beneficial modality for people with impaired vision or limited hand movement, including the elderly. However, speech recognition systems using standard recognition models, especially acoustic models, have been unable to achieve satisfactory performance for the elderly. Thus, creating more accurate acoustic models of the speech of elderly users is essential for improving speech recognition for the elderly. Using our new corpus, which includes the speech of elderly people living in three regions of Japan, we conducted speech recognition experiments using a variety of DNN-HNN acoustic models. As training data for our acoustic models, we examined whether a standard adult Japanese speech corpus (JNAS), an elderly speech corpus (S-JNAS) or a spontaneous speech corpus (CSJ) was most suitable, and whether or not adaptation to the dialect of each region improved recognition results. We adapted each of our three acoustic models to all of our speech data, and then re-adapt them using speech from each region. Without adaptation, the best recognition results were obtained when using the S-JNAS trained acoustic models (total corpus: 21.85% Word Error Rate). However, after adaptation of our acoustic models to our entire corpus, the CSJ trained models achieved the lowest WERs (entire corpus: 17.42%). Moreover, after readaptation to each regional dialect, the CSJ trained acoustic model with adaptation to regional speech data showed tendencies of improved recognition rates. We plan to collect more utterances from all over Japan, so that our corpus can be used as a key resource for elderly speech recognition in Japanese. We also hope to achieve further improvement in recognition performance for elderly speech.

为了提高老年人自动语音识别(ASR)的准确率，我们构建了一个由221名日本老年人(平均年龄79.2岁)的话语组成的新的语音数据语料库。对于视力受损或手部运动受限的人，包括老年人，ASR是一种有益的方式。然而，使用标准识别模型的语音识别系统，特别是声学模型，已经无法达到令人满意的老年人识别性能。因此，建立更准确的老年用户语音声学模型对于提高老年人语音识别水平至关重要。使用我们的新语料库，其中包括生活在日本三个地区的老年人的语音，我们使用各种DNN-HNN声学模型进行了语音识别实验。作为声学模型的训练数据，我们研究了标准成人日语语音语料库(JNAS)、老年人语音语料库(S-JNAS)和自发语音语料库(CSJ)是否最合适，以及对每个地区方言的适应是否改善了识别结果。我们根据所有的语音数据调整了我们的三个声学模型，然后使用来自每个地区的语音重新调整它们。在不进行自适应的情况下，使用S-JNAS训练的声学模型(总语料库错误率21.85%)获得了最好的识别效果。然而，在我们的声学模型适应我们的整个语料库之后，CSJ训练的模型获得了最低的wer(整个语料库:17.42%)。此外，CSJ训练的适应区域语音数据的声学模型在重新适应各区域方言后，识别率有提高的趋势。我们计划从全日本收集更多的话语，使我们的语料库可以作为老年人日语语音识别的关键资源。我们也希望在老年人语音的识别性能上有进一步的提升。

{"title":"A New Corpus of Elderly Japanese Speech for Acoustic Modeling, and a Preliminary Investigation of Dialect-Dependent Speech Recognition","authors":"Meiko Fukuda, Ryota Nishimura, H. Nishizaki, Y. Iribe, N. Kitaoka","doi":"10.1109/O-COCOSDA46868.2019.9041216","DOIUrl":"https://doi.org/10.1109/O-COCOSDA46868.2019.9041216","url":null,"abstract":"We have constructed a new speech data corpus consisting of the utterances of 221 elderly Japanese people (average age: 79.2) with the aim of improving the accuracy of automatic speech recognition (ASR) for the elderly. ASR is a beneficial modality for people with impaired vision or limited hand movement, including the elderly. However, speech recognition systems using standard recognition models, especially acoustic models, have been unable to achieve satisfactory performance for the elderly. Thus, creating more accurate acoustic models of the speech of elderly users is essential for improving speech recognition for the elderly. Using our new corpus, which includes the speech of elderly people living in three regions of Japan, we conducted speech recognition experiments using a variety of DNN-HNN acoustic models. As training data for our acoustic models, we examined whether a standard adult Japanese speech corpus (JNAS), an elderly speech corpus (S-JNAS) or a spontaneous speech corpus (CSJ) was most suitable, and whether or not adaptation to the dialect of each region improved recognition results. We adapted each of our three acoustic models to all of our speech data, and then re-adapt them using speech from each region. Without adaptation, the best recognition results were obtained when using the S-JNAS trained acoustic models (total corpus: 21.85% Word Error Rate). However, after adaptation of our acoustic models to our entire corpus, the CSJ trained models achieved the lowest WERs (entire corpus: 17.42%). Moreover, after readaptation to each regional dialect, the CSJ trained acoustic model with adaptation to regional speech data showed tendencies of improved recognition rates. We plan to collect more utterances from all over Japan, so that our corpus can be used as a key resource for elderly speech recognition in Japanese. We also hope to achieve further improvement in recognition performance for elderly speech.","PeriodicalId":263209,"journal":{"name":"2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124303078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀