自动配音中使用语音内外韵律对齐的字幕合成

2022 National Conference on Communications (NCC) Pub Date : 2022-05-24 DOI:10.1109/NCC55593.2022.9806799

Giridhar Pamisetty, S. Kodukula

{"title":"自动配音中使用语音内外韵律对齐的字幕合成","authors":"Giridhar Pamisetty, S. Kodukula","doi":"10.1109/NCC55593.2022.9806799","DOIUrl":null,"url":null,"abstract":"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing\",\"authors\":\"Giridhar Pamisetty, S. Kodukula\",\"doi\":\"10.1109/NCC55593.2022.9806799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.\",\"PeriodicalId\":403870,\"journal\":{\"name\":\"2022 National Conference on Communications (NCC)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 National Conference on Communications (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC55593.2022.9806799\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806799","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自动配音或机器配音是将源视频中的语音替换为所需语言语音的过程，该过程使用文本到语音合成(TTS)系统进行合成。合成语音应该与源视频中的事件保持一致，以获得真实的体验。现有的韵律对齐方法大多是通过控制语速来控制合成语音。在本文中，我们提出了字幕合成，这是一种在特征层面上进行韵律对齐的统一方法。在特征层面修改韵律参数不会降低合成语音的自然度。在韵律对齐过程中，我们既使用话语内部对齐也使用话语内部对齐。我们应该控制音素的持续时间，在特征级进行对齐，以实现合成语音和源语音之间的同步。因此，我们使用韵律- tts系统来合成语音，该系统在合成过程中具有控制音素持续时间和基频(f0)的规定。对翻译的视听内容(讲座视频)的主观评价导致平均意见得分(MOS)为4.104，表明所提出的韵律对齐过程的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Subtitle Synthesis using Inter and Intra utterance Prosodic Alignment for Automatic Dubbing

Automatic dubbing or machine dubbing is the process of replacing the speech in the source video with the desired language speech, which is synthesized using a text-to-speech synthesis (TTS) system. The synthesized speech should align with the events in the source video to have a realistic experience. Most of the existing prosodic alignment processes operate on the synthesized speech by controlling the speaking rate. In this paper, we propose subtitle synthesis, a unified approach for the prosodic alignment that operates at the feature level. Modifying the prosodic parameters at the feature level will not degrade the naturalness of the synthesized speech. We use both inter and intra utterance alignment in the prosodic alignment process. We should have control over the duration of the phonemes to perform alignment at the feature level to achieve synchronization between the synthesized and the source speech. So, we use the Prosody-TTS system to synthesize the speech, which has the provision to control the duration of phonemes and fundamental frequency (f0) during the synthesis. The subjective evaluation of the translated audiovisual content (lecture videos) resulted in a mean opinion score (MOS) of 4.104 that indicates the effectiveness of the proposed prosodic alignment process.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 National Conference on Communications (NCC)

自引率

0.00%

发文量