A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure
{"title":"A method of phonemic annotation for Chinese dialects based on a deep learning model with adaptive temporal attention and a feature disentangling structure","authors":"Bowen Jiang , Qianhui Dong , Guojin Liu","doi":"10.1016/j.csl.2024.101624","DOIUrl":null,"url":null,"abstract":"<div><p>Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101624"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082400007X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Phonemic annotation is aimed at annotating a speech fragment with phonemic symbols. As the phonetic features of a speech fragment vary greatly among different languages including their dialects, it is a significant way to describe and write down the phonetic system of a language utilizing phonemic symbols. It is meaningful to develop an automatic and effective method for this task. In this paper, we first establish a Chinese dataset where each datum consists of an original speech signal and the corresponding phonemic characters which are annotated manually. Furthermore, we propose a deep learning model to realize automatic phonemic annotation for speech fragments spoken in diverse Chinese dialects. The overall structure of the model is a many-to-many deep bi-directional gated recurrent unit (GRU) network, and an adaptive temporal attention mechanism is applied to communicate the encoder and decoder modules to prevent any loss of features adaptively. Meanwhile, a feature disentangling structure based on a generative adversarial network (GAN) is adopted to attenuate the interference towards the phonemic annotation task caused by unrelated tone features in the original speech signal and further improve the phonemic annotation performance. Extensive experimental results have verified the superiority of our model and proposed strategies over the utilized dataset.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.