Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.

IF 2.1 2区物理与天体物理 Q2 ACOUSTICS Journal of the Acoustical Society of America Pub Date : 2024-11-01 DOI:10.1121/10.0034359

Yi-Fen Liu, Xiang-Li Lu

{"title":"Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.","authors":"Yi-Fen Liu, Xiang-Li Lu","doi":"10.1121/10.0034359","DOIUrl":null,"url":null,"abstract":"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"156 5","pages":"3353-3372"},"PeriodicalIF":2.1000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0034359","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过转换器从 F0 序列和持续时间变化中学习和巩固音调的上下文轮廓表征。

许多语音特征，包括传统的声学特征，如 mel 频率倒频谱系数和 mel 频谱图，以及预先训练的上下文化声学表示，如 wav2vec2.0，都被用于深度神经网络，或成功地用连接主义时序分类进行微调，以进行普通话音调分类。在本研究中，作者提出了一种基于变压器的音调分类架构 TNet-Full，该架构使用估计的基频（F0）值以及音节和词的对齐边界信息。模型框架的关键组成部分是轮廓编码器和节奏编码器，以及交互编码器中建立的轮廓和节奏之间的交叉注意。TNet-Full 使用上下文音调轮廓作为参考，并使用从持续时间变化中获得的节奏信息来加强音调识别的轮廓表示，与天真、简单的基线转换器 TNet-base 相比，阅读语音的绝对性能提高了 24.4%（从 71.4% 提高到 95.8%），对话语音的绝对性能提高了 6.3%（从 52.1% 提高到 58.4%）。相对改进幅度分别为 34.2% 和 12.1%。人类在感知音调时，音调的轮廓抽象只能从 F0 序列中得出，如果音节的时间组织是稳定和可预测的，而不是会话中的波动，那么音调识别率就会提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the Acoustical Society of America 物理-声学

CiteScore

4.60

自引率

16.70%

发文量

1433

审稿时长

4.7 months

期刊介绍： Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.