通过转换器从 F0 序列和持续时间变化中学习和巩固音调的上下文轮廓表征。

IF 2.1 2区 物理与天体物理 Q2 ACOUSTICS Journal of the Acoustical Society of America Pub Date : 2024-11-01 DOI:10.1121/10.0034359
Yi-Fen Liu, Xiang-Li Lu
{"title":"通过转换器从 F0 序列和持续时间变化中学习和巩固音调的上下文轮廓表征。","authors":"Yi-Fen Liu, Xiang-Li Lu","doi":"10.1121/10.0034359","DOIUrl":null,"url":null,"abstract":"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"156 5","pages":"3353-3372"},"PeriodicalIF":2.1000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.\",\"authors\":\"Yi-Fen Liu, Xiang-Li Lu\",\"doi\":\"10.1121/10.0034359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>\",\"PeriodicalId\":17168,\"journal\":{\"name\":\"Journal of the Acoustical Society of America\",\"volume\":\"156 5\",\"pages\":\"3353-3372\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Acoustical Society of America\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.1121/10.0034359\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0034359","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

许多语音特征,包括传统的声学特征,如 mel 频率倒频谱系数和 mel 频谱图,以及预先训练的上下文化声学表示,如 wav2vec2.0,都被用于深度神经网络,或成功地用连接主义时序分类进行微调,以进行普通话音调分类。在本研究中,作者提出了一种基于变压器的音调分类架构 TNet-Full,该架构使用估计的基频(F0)值以及音节和词的对齐边界信息。模型框架的关键组成部分是轮廓编码器和节奏编码器,以及交互编码器中建立的轮廓和节奏之间的交叉注意。TNet-Full 使用上下文音调轮廓作为参考,并使用从持续时间变化中获得的节奏信息来加强音调识别的轮廓表示,与天真、简单的基线转换器 TNet-base 相比,阅读语音的绝对性能提高了 24.4%(从 71.4% 提高到 95.8%),对话语音的绝对性能提高了 6.3%(从 52.1% 提高到 58.4%)。相对改进幅度分别为 34.2% 和 12.1%。人类在感知音调时,音调的轮廓抽象只能从 F0 序列中得出,如果音节的时间组织是稳定和可预测的,而不是会话中的波动,那么音调识别率就会提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.

Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.60
自引率
16.70%
发文量
1433
审稿时长
4.7 months
期刊介绍: Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.
期刊最新文献
Influence of variable sound-absorbing devices on room acoustical parameters of reverberation and intelligibility in medium-to-large multipurpose halls. Integer multi-wavelength gradient phase metagrating for perfect refraction: Phase choice freedom in supercella). Measurement of ocean currents by seafloor distributed optical-fiber acoustic sensing. Neville Fletcher's vibrant valve voyage. Office soundscape assessment: A model of acoustic environment perception in open-plan officesa).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1