2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)最新文献

英文中文

Using Taigi Dramas with Mandarin Chinese Subtitles to Improve Taigi Speech Recognition 用普通话字幕的太极剧提高太极语音识别

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295005

Pin-Yuan Chen, Chia-Hua Wu, Hung-Shin Lee, Shao-Kang Tsao, M. Ko, Hsin-Min Wang

An obvious problem with automatic speech recognition (ASR) for Taigi is that the amount of training data is far from enough to build a practical ASR system. Collecting speech data with reliable transcripts for training the acoustic model (AM) is feasible but expensive. Moreover, text data used for language model (LM) training is extremely scarce and difficult to collect because Taigi is a spoken language, not a commonly used written language. Interestingly, the subtitles of Taigi drama in Taiwan have long been in Chinese characters for Mandarin. Since a large amount of Taigi drama episodes with Mandarin Chinese subtitles are available on YouTube, we propose a method to augment the training data for AM and LM of Taigi ASR. The idea is to use an initial Taigi ASR system to convert a Mandarin Chinese subtitle into the most likely Taigi word sequence by referring to the speech. Experimental results show that our ASR system can be remarkably improved by such training data augmentation.

taiigi自动语音识别(ASR)的一个明显问题是，训练数据的数量远远不足以构建一个实用的ASR系统。收集具有可靠转录本的语音数据用于训练声学模型(AM)是可行的，但成本昂贵。此外，用于语言模型(LM)训练的文本数据极其稀缺，难以收集，因为太极是一种口语，而不是常用的书面语言。有趣的是，台湾太极剧的字幕一直都是中文。由于YouTube上有大量带有中文字幕的太极剧集，我们提出了一种方法来增强太极ASR的AM和LM训练数据。这个想法是使用一个初始的太极ASR系统，通过参考演讲，将普通话中文字幕转换成最可能的太极单词序列。实验结果表明，通过训练数据的增强，我们的ASR系统得到了显著的改善。

{"title":"Using Taigi Dramas with Mandarin Chinese Subtitles to Improve Taigi Speech Recognition","authors":"Pin-Yuan Chen, Chia-Hua Wu, Hung-Shin Lee, Shao-Kang Tsao, M. Ko, Hsin-Min Wang","doi":"10.1109/O-COCOSDA50338.2020.9295005","DOIUrl":"https://doi.org/10.1109/O-COCOSDA50338.2020.9295005","url":null,"abstract":"An obvious problem with automatic speech recognition (ASR) for Taigi is that the amount of training data is far from enough to build a practical ASR system. Collecting speech data with reliable transcripts for training the acoustic model (AM) is feasible but expensive. Moreover, text data used for language model (LM) training is extremely scarce and difficult to collect because Taigi is a spoken language, not a commonly used written language. Interestingly, the subtitles of Taigi drama in Taiwan have long been in Chinese characters for Mandarin. Since a large amount of Taigi drama episodes with Mandarin Chinese subtitles are available on YouTube, we propose a method to augment the training data for AM and LM of Taigi ASR. The idea is to use an initial Taigi ASR system to convert a Mandarin Chinese subtitle into the most likely Taigi word sequence by referring to the speech. Experimental results show that our ASR system can be remarkably improved by such training data augmentation.","PeriodicalId":385266,"journal":{"name":"2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122147303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Analysis of Acoustic Features in Reading Speech from Chinese Patients with Depression 中国抑郁症患者阅读言语的声学特征分析

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295039

Yuan Jia, Yuzhu Liang, T. Zhu

This paper investigates acoustic features of depression patients in voice quality and formants, from the perspective of experimental phonetics. The analysis on voice quality based on large samples shows that jitter, shimmer and HNR can distinguish the patients with different degrees of depression, while F0, standard deviation of F0 and HNR can distinguish depression patients from non-patients. These features indicate that the voice of patients tends to be hoarse and rough, with a lower pitch falling into a narrower range. The analysis on formants shows that depression patients tend to centralize monophthongs and simplify diphthongs, reflected by a lower opening degree and slower movement of tongue. Moreover, the patients tend to show a lower spectrum energy than healthy people. Finally, our analysis results suggest that these acoustic features can be used as objective markers for recognition of depression.

本文从实验语音学的角度探讨了抑郁症患者在语音质量和共振峰方面的声学特征。基于大样本的语音质量分析表明，抖动、闪烁和HNR可以区分不同抑郁程度的患者，F0、F0和HNR的标准差可以区分抑郁症患者和非抑郁症患者。这些特征表明患者的声音趋于嘶哑和粗糙，音调越低，范围越窄。共振体分析显示，抑郁症患者的单音集中，双音简化，开口程度低，舌动慢。此外，患者比健康人表现出更低的频谱能量。最后，我们的分析结果表明，这些声学特征可以作为识别抑郁症的客观标记。

引用次数: 0

Question Answering based University Chatbot using Sequence to Sequence Model 基于序列到序列模型的大学问答聊天机器人

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295021

Naing Naing Khin, K. Soe

Educational chatbots have great potential to help students, teachers and education staff. They provide useful information in educational sectors for inquirers. Neural chatbots are more scalable and popular than earlier ruled-based chatbots. Recurrent Neural Network based Sequence to Sequence (Seq2Seq) model can be used to create chatbots. Seq2Seq is adapted for good conversational model for sequences especially in question answering systems. In this paper, we explore the ways of communication through neural network chatbot by using the Sequence to Sequence model with Attention Mechanism based on RNN encoder decoder model. This chatbot is intended to be used in university education sector for frequently asked questions about the university and its related information. It is the first Myanmar Language University Chatbot using neural network model and gets 0.41 BLEU score.

教育聊天机器人在帮助学生、教师和教育工作者方面具有巨大的潜力。它们为询问者提供教育部门的有用信息。神经聊天机器人比早期基于规则的聊天机器人更具可扩展性，也更受欢迎。基于循环神经网络的序列到序列(Seq2Seq)模型可用于创建聊天机器人。Seq2Seq适用于序列的良好会话模型，特别是在问答系统中。本文采用基于RNN编码器和解码器模型的序列对序列模型和注意机制，探讨了神经网络聊天机器人之间的通信方式。该聊天机器人旨在用于大学教育部门，解决有关大学及其相关信息的常见问题。这是首个使用神经网络模型的缅甸语大学聊天机器人，BLEU得分为0.41。

引用次数: 9

Collection and Analyses of Exemplary Speech Data to Establish Easy-to-Understand Speech Synthesis for Japanese Elderly Adults 收集和分析典型语音数据以建立易于理解的日本老年人语音合成

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295000

Hideharu Nakajima, Y. Aono

This paper describes a newly developed Japanese speech database in order to find the features of speech and speaking styles that elderly adults actually think is easy to understand for establishing speech synthesis for elderly adults. This speech database is characterized by two features: i) its sentences are largely taken from newsletters beyond just the contents that elderly adults tend to know; ii) the sentences are spoken by exemplary speakers selected through an audition process from the perspective that elderly adults actually think is easy to understand. This paper describes the design of our database and the basic characteristics measured by applying conventional theories. Finally it indicates the extension directions of the conventional theories to establish an easy-to-understand speech synthesis method for elderly adults.

本文介绍了一个新开发的日语语音数据库，旨在找到老年人实际认为容易理解的语音特征和说话风格，为老年人语音合成的建立提供依据。该语音数据库有两个特点:1)除了老年人容易知道的内容外，其句子大部分取自时事通讯;Ii)这些句子是通过听音过程选出的典型说话人从老年人实际认为容易理解的角度讲出来的。本文介绍了我们的数据库的设计和应用常规理论测量的基本特性。最后指出了传统理论的扩展方向，以建立一种易于理解的老年人语音合成方法。

引用次数: 3

A Front-End Technique for Automatic Noisy Speech Recognition 噪声语音自动识别的前端技术

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295006

Hay Mar Soe Naing, Risanuri Hidayat, Rudy Hartanto, Y. Miyanaga

The sounds in a real environment not often take place in isolation because sounds are building complex and usually happen concurrently. Auditory masking relates to the perceptual interaction between sound components. This paper proposes modeling the effect of simultaneous masking into the Mel frequency cepstral coefficient (MFCC) and effectively improve the performance of the resulting system. Moreover, the Gammatone frequency integration is presented to warp the energy spectrum which can provide gradually decaying the weights and compensate for the loss of spectral correlation. Experiments are carried out on the Aurora-2 database, and frame-level cross entropy-based deep neural network (DNN-HMM) training is used to build an acoustic model. While given models trained on multi-condition speech data, the accuracy of our proposed feature extraction method achieves up to 98.14% in case of 10dB, 94.40% in 5dB, 81.67% in 0dB and 51.5% in −5dB, respectively.

真实环境中的声音通常不会孤立地发生，因为声音是复杂的，通常是同时发生的。听觉掩蔽与声音成分之间的感知相互作用有关。本文提出将同时掩蔽的影响建模为Mel频率倒谱系数(MFCC)，有效地提高了系统的性能。此外，提出了伽玛酮频率积分对能谱进行扭曲，使权值逐渐衰减，弥补了谱相关性的损失。在Aurora-2数据库上进行实验，采用基于帧级交叉熵的深度神经网络(DNN-HMM)训练方法建立声学模型。在给定的多条件语音数据训练模型中，我们提出的特征提取方法在10dB、5dB、0dB和- 5dB情况下的准确率分别达到98.14%、94.40%、81.67%和51.5%。

引用次数: 1

Blind Phone Segmentation Using Contrast Function 使用对比度函数的盲电话分割

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295035

Dac-Thang Hoang, Van-Thuy Mai, Tung-Lam Phi

Phone segmentation is a process of detecting the boundaries between phones in a spoken utterance. In this paper, the phone boundaries are detected without knowing contents of speech. Contrast, a concept in image processing, is investigated for phone segmentation. The speech signal is first transformed into frequency domain. Then band energy is extracted and considered as luminance in an image. A contrast function of a frequency band is defined on band energy. The peaks on the curve of contrast function present phone boundaries. The boundaries detected by eight bands are combined using probability mass function. Experiment is conducted on TIMIT corpus and results are promising. This method is also conducted on Vietnamese corpus yielding good results.

电话分割是一种检测语音话语中电话之间边界的过程。本文在不知道语音内容的情况下检测电话边界。对比是图像处理中的一个概念，研究了手机分割。首先将语音信号变换到频域。然后提取波段能量作为图像的亮度。一个频带的对比函数是根据频带能量定义的。对比函数曲线上的峰表示手机边界。利用概率质量函数对八个波段检测到的边界进行组合。在TIMIT语料库上进行了实验，结果令人满意。该方法在越南语语料上也得到了很好的效果。

引用次数: 0

Acoustic modeling for Thai- English code-switching speech 泰英码转换语音的声学建模

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295026

Vataya Chunwijitra, Sumonmas Thatphithakkul, P. Chootrakool, S. Kasuriya

Due to the effect of globalization, mixing languages between Thai and English has been commonly used in typical conversations in Thailand, even if talking with Thai natives. Consequently, Thai automatic speech recognition that is deployed in multilingual communities are able to handle Thai-English code-switching. One of the main challenges in building a system is selecting phone set for Thai-English pairs which mother tongue-like accent interferes with the English pronunciation. This paper shows evidence that an acoustic model with a Thai phoneme set improves the recognition performance for Thai-English code-mixing speech. The baseline system for comparison built with merge phoneme for Thai and English where the phones were simply combined. The experimental results shown that the word error rate of monolingual Thai phones is reduced by 4.5%.

由于全球化的影响，泰语和英语之间的混合语言在泰国的典型对话中已经被普遍使用，即使是与泰国本地人交谈。因此，部署在多语言社区的泰语自动语音识别能够处理泰英代码转换。建立一个系统的主要挑战之一是为泰英对话组选择电话机，因为母语口音会干扰英语发音。本文证明了一个带有泰语音素集的声学模型可以提高泰英混码语音的识别性能。比较的基线系统建立在泰语和英语的合并音素上，这两个电话简单地组合在一起。实验结果表明，单语泰语手机的单词错误率降低了4.5%。

引用次数: 0

Intonation Patterns of Wh-questions by EFL Learners from Jinan Dialectal Region 济南方言地区英语学习者的wh疑问句语调模式

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295038

Ai-jun Li, Chunru Qu, Na Zhi

The study aims at investigating Jinan EFL learners' intonation patterns of Wh-questions from the phonological perspective. When the nuclear accent locates at the beginning of a sentence, both American speakers and Jinan EFL learners adopt the falling intonation pattern. When the nuclear accent is in the middle of or at the end of a sentence, Americans apply a high tone (H*), while Jinan learners adopt a high tone (H*) or a low tone (L*). The final boundary tone of wh-questions produced by Americans and Jinan learners both ended with a L%. Jinan learners tend to put accent on the wh-word, no matter where the nuclear accent is, and use a H* or a L+H*. Patterns they use in other pitch accents are also in a variety.

本研究旨在从音韵学的角度考察济南英语学习者的wh疑问句的语调模式。当核音位于句首时，美国人和济南的英语学习者都采用降调模式。当核音在句中或句尾时，美国人使用高音(H*)，而济南学习者则使用高音(H*)或低音(L*)。美国人和济南学习者的wh疑问句的最后边界音都以L%结尾。济南的学习者倾向于把重音放在“wh”字上，无论重音在哪里，都使用H*或L+H*。他们在其他音高口音中使用的模式也多种多样。

引用次数: 0

Country report (Korea) 国别报告(韩国)

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/o-cocosda50338.2020.9295030

This article consists only of a collection of slides from the author's conference presentation.

本文仅由作者在会议上发表的一些幻灯片组成。

引用次数: 2

Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information 利用语言信息改进多维语音情感识别中的价态预测

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

Pub Date : 2020-11-05 DOI: 10.1109/O-COCOSDA50338.2020.9295032

Bagus Tris Atmaja, M. Akagi

In dimensional emotion recognition, a model called valence, arousal, and dominance is widely used. The current research in dimensional speech emotion recognition has shown a problem that the performance of valence prediction is lower than arousal and dominance. This paper presents an approach to tackle this problem: improving the low score of valence prediction by utilizing linguistic information. Our approach fuses acoustic features with linguistic features, which is a conversion from words to vectors. The results doubled the performance of valence prediction on both single-task learning single-output (predicting valence only) and multitask learning multi-output (predicting valence, arousal, and dominance). Using a proper combination of acoustic and linguistic features not only improved valence prediction, but also improved arousal and dominance predictions in multitask learning.

在维度情绪识别中，一种被称为效价-唤醒-支配的模型被广泛使用。目前在多维语音情感识别研究中存在着效价预测低于唤醒和优势预测的问题。本文提出了一种解决这一问题的方法:利用语言信息改善价态预测的低分。我们的方法融合了声学特征和语言特征，这是一个从单词到向量的转换。结果表明，效价预测在单任务学习单输出(仅预测效价)和多任务学习多输出(预测效价、唤醒和支配)上的表现都提高了一倍。在多任务学习中，适当地结合声学和语言特征不仅可以提高效价预测，而且可以提高唤醒和优势预测。

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀