Formant diphone parameter extraction utilising a labelled single-speaker database

R. Mannell
{"title":"Formant diphone parameter extraction utilising a labelled single-speaker database","authors":"R. Mannell","doi":"10.21437/ICSLP.1998-36","DOIUrl":null,"url":null,"abstract":"This paper examines a method for formant parameter extraction from a labeled single speaker database for use in a formant-parameter diphone-concatenation speech synthesis system. This procedure commences with an initial formant analysis of the labelled database, which is then used to obtain formant (F1-F5) probability spaces for each phoneme. These probability spaces guide a more careful speaker- specific extraction of formant frequencies. An analysis-by-synthesis procedure is then used to provide best-matching formant intensity and bandwidth parameters. The great majority of the parameters so extracted produce speech which is highly intelligible and which has a voice quality close to the original speaker. Synthesis techniques based upon LPC-parameter or waveform concatenation are much less vulnerable to the effects of poorly extracted parameters. The formant model is, however, more straightforwardly related to the source-filter model and thus to speech production. Whilst it is true that overlap-add concatenation of waveform-based diphones can easily model a voice with quite high fidelity, new voices and voice qualities require the recording of new speakers (or the same speaker utilising a different voice quality) and the extraction of a new diphone database. Such systems can be used to examine the effects of intonation and rhythm on voice quality or vocal affect but formant-based systems can much more readily examine the effect of frequency-domain modifications on voice quality. Such modifications might include formant frequency shifting, bandwidth modification, modification of relative formant intensities and spectral slope variation. It is even possible, if the synthesiser design allows it, to experiment with the insertion of additional poles and zeroes into the spectrum such as might occur when modelling the \"singer's formant\" for certain styles of singing voice. Such research requires a parallel formant synthesiser with a great deal of flexibility of control. Further, and most importantly, it requires a diphone database that is extremely accurate. Formant errors must be minor and few in number and this should be achieved without excessive hand correction. Formant tracks should display, as far as possible, pole continuity across fricatives, stops and affricates. Extracted intensities and","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"5th International Conference on Spoken Language Processing (ICSLP 1998)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ICSLP.1998-36","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

This paper examines a method for formant parameter extraction from a labeled single speaker database for use in a formant-parameter diphone-concatenation speech synthesis system. This procedure commences with an initial formant analysis of the labelled database, which is then used to obtain formant (F1-F5) probability spaces for each phoneme. These probability spaces guide a more careful speaker- specific extraction of formant frequencies. An analysis-by-synthesis procedure is then used to provide best-matching formant intensity and bandwidth parameters. The great majority of the parameters so extracted produce speech which is highly intelligible and which has a voice quality close to the original speaker. Synthesis techniques based upon LPC-parameter or waveform concatenation are much less vulnerable to the effects of poorly extracted parameters. The formant model is, however, more straightforwardly related to the source-filter model and thus to speech production. Whilst it is true that overlap-add concatenation of waveform-based diphones can easily model a voice with quite high fidelity, new voices and voice qualities require the recording of new speakers (or the same speaker utilising a different voice quality) and the extraction of a new diphone database. Such systems can be used to examine the effects of intonation and rhythm on voice quality or vocal affect but formant-based systems can much more readily examine the effect of frequency-domain modifications on voice quality. Such modifications might include formant frequency shifting, bandwidth modification, modification of relative formant intensities and spectral slope variation. It is even possible, if the synthesiser design allows it, to experiment with the insertion of additional poles and zeroes into the spectrum such as might occur when modelling the "singer's formant" for certain styles of singing voice. Such research requires a parallel formant synthesiser with a great deal of flexibility of control. Further, and most importantly, it requires a diphone database that is extremely accurate. Formant errors must be minor and few in number and this should be achieved without excessive hand correction. Formant tracks should display, as far as possible, pole continuity across fricatives, stops and affricates. Extracted intensities and
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用标记的单扬声器数据库提取峰峰diphone参数
本文研究了一种从标记的单说话者数据库中提取共振峰参数的方法,用于共振峰参数diphonesconcatation语音合成系统。该程序从标记数据库的初始构象分析开始,然后用于获得每个音素的构象(F1-F5)概率空间。这些概率空间指导更仔细地提取特定于说话人的共振峰频率。然后使用合成分析程序来提供最匹配的形成峰强度和带宽参数。绝大多数提取的参数产生的语音具有高度可理解性,语音质量接近原始说话人。基于lpc参数或波形串联的合成技术更不容易受到提取参数差的影响。然而,共振峰模型更直接地与源-滤波器模型相关,因此与语音产生相关。虽然基于波形的diphone的重叠添加连接确实可以很容易地以相当高保真度建模声音,但新的声音和声音质量需要记录新的扬声器(或使用不同声音质量的同一扬声器)和提取新的diphone数据库。这样的系统可以用来检查语调和节奏对语音质量或声音影响的影响,但基于共振峰的系统可以更容易地检查频域修改对语音质量的影响。这些修改可能包括形成峰频移、带宽修改、相对形成峰强度的修改和频谱斜率的变化。如果合成器的设计允许的话,甚至有可能在频谱中插入额外的极点和零点,就像在为某些风格的歌唱声音建模“歌手的共振峰”时可能发生的那样。这样的研究需要一个并行的形成峰合成器,具有很大的控制灵活性。此外,最重要的是,它需要一个极其准确的diphone数据库。共振峰误差必须小,数量少,这应该在没有过多的手工校正的情况下实现。形成音轨应尽可能地显示出摩擦音、停顿音和非重叠音之间的极点连续性。提取强度和
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Assimilation of place in Japanese and dutch Articulatory analysis using a codebook for articulatory based low bit-rate speech coding Phonetic and phonological characteristics of paralinguistic information in spoken Japanese HMM-based visual speech recognition using intensity and location normalization Speech recognition via phonetically featured syllables
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1