{"title":"通过转换器从 F0 序列和持续时间变化中学习和巩固音调的上下文轮廓表征。","authors":"Yi-Fen Liu, Xiang-Li Lu","doi":"10.1121/10.0034359","DOIUrl":null,"url":null,"abstract":"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>","PeriodicalId":17168,"journal":{"name":"Journal of the Acoustical Society of America","volume":"156 5","pages":"3353-3372"},"PeriodicalIF":2.1000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.\",\"authors\":\"Yi-Fen Liu, Xiang-Li Lu\",\"doi\":\"10.1121/10.0034359\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.</p>\",\"PeriodicalId\":17168,\"journal\":{\"name\":\"Journal of the Acoustical Society of America\",\"volume\":\"156 5\",\"pages\":\"3353-3372\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Acoustical Society of America\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://doi.org/10.1121/10.0034359\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of America","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.1121/10.0034359","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Learning and consolidating the contextualized contour representations of tones from F0 sequences and durational variations via transformers.
Many speech characteristics, including conventional acoustic features such as mel frequency cepstrum coefficients and mel-spectrograms, as well as pre-trained contextualized acoustic representations such as wav2vec2.0, are used in a deep neural network or successfully fine-tuned with a connectionist temporal classification for Mandarin tone classification. In this study, the authors propose a transformer-based tone classification architecture, TNet-Full, which uses estimated fundamental frequency (F0) values and aligned boundary information on syllables and words. Key components of the model framework are the contour encoder and rhythm encoder, as well as the cross-attention between contours and rhythms established in the interaction encoder. Using contextual tonal contours as a reference, as well as rhythmic information derived from duration variations to consolidate more on contour representations for tone recognition, TNet-Full achieves absolute performance improvements of 24.4% for read speech (from 71.4% to 95.8%) and 6.3% for conversational speech (from 52.1% to 58.4%) when compared to a naive, simple baseline transformer, TNet-base. The relative improvements are 34.2% and 12.1%. As humans perceive tones, contour abstractions of tones can only be derived from F0 sequences, and tone recognition would be improved if syllable temporal organization was stable and predictable instead of fluctuating as seen in conversations.
期刊介绍:
Since 1929 The Journal of the Acoustical Society of America has been the leading source of theoretical and experimental research results in the broad interdisciplinary study of sound. Subject coverage includes: linear and nonlinear acoustics; aeroacoustics, underwater sound and acoustical oceanography; ultrasonics and quantum acoustics; architectural and structural acoustics and vibration; speech, music and noise; psychology and physiology of hearing; engineering acoustics, transduction; bioacoustics, animal bioacoustics.