自动配音中使用语音内外韵律对齐的字幕合成

Giridhar Pamisetty, S. Kodukula
{"title":"自动配音中使用语音内外韵律对齐的字幕合成","authors":"Giridhar Pamisetty, S. Kodukula","doi":"10.1109/NCC55593.2022.9806799","DOIUrl":null,"url":null,"abstract":"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing\",\"authors\":\"Giridhar Pamisetty, S. Kodukula\",\"doi\":\"10.1109/NCC55593.2022.9806799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.\",\"PeriodicalId\":403870,\"journal\":{\"name\":\"2022 National Conference on Communications (NCC)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 National Conference on Communications (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC55593.2022.9806799\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806799","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

自动配音或机器配音是将源视频中的语音替换为所需语言语音的过程,该过程使用文本到语音合成(TTS)系统进行合成。合成语音应该与源视频中的事件保持一致,以获得真实的体验。现有的韵律对齐方法大多是通过控制语速来控制合成语音。在本文中,我们提出了字幕合成,这是一种在特征层面上进行韵律对齐的统一方法。在特征层面修改韵律参数不会降低合成语音的自然度。在韵律对齐过程中,我们既使用话语内部对齐也使用话语内部对齐。我们应该控制音素的持续时间,在特征级进行对齐,以实现合成语音和源语音之间的同步。因此,我们使用韵律- tts系统来合成语音,该系统在合成过程中具有控制音素持续时间和基频(f0)的规定。对翻译的视听内容(讲座视频)的主观评价导致平均意见得分(MOS)为4.104,表明所提出的韵律对齐过程的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing
Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CoRAL: Coordinated Resource Allocation for Intercell D2D Communication in Cellular Networks Modelling the Impact of Multiple Pro-inflammatory Cytokines Using Molecular Communication STPGANsFusion: Structure and Texture Preserving Generative Adversarial Networks for Multi-modal Medical Image Fusion Intelligent On/Off Switching of mmRSUs in Urban Vehicular Networks: A Deep Q-Learning Approach Classification of Auscultation Sounds into Objective Spirometry Findings using MVMD and 3D CNN
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1