{"title":"通过调整DPTNet的编码器和损失函数来改进DPTNet语音增强系统的初步研究","authors":"Yu-Yu Hsiao, Ming-Hsuan Wu, Kuan-Yu Tsai, J. Hung","doi":"10.1109/ICASI55125.2022.9774458","DOIUrl":null,"url":null,"abstract":"This study analyzes the celebrated speech enhancement method, Dual-Path Transformer Network (DPTNet), attempting to revise the respective arrangement to get superior performance.The DPTNet consists of three parts: encoder, separation layer and decoder. The encoder creates features from input speech signals. The separation layer mainly consists of two improved Transformers to perform mask-wise speech and noise separation on encoded features. Finally, the decoder reconstructs the speech signal from the masked features.We modify the DPTNet in two parts. First, we concatenate time- and frequency-domain features and then send them into a bottleneck block to create a compact feature representation. Second, we test several widely used loss functions at the terminal of the decoder and find that the hybrid loss used in another SE deep network, DEMUCS, behaves the best.To sum up, the new arrangement mentioned above provides the test set in the VoiceBank-DEMAND task with 2.85 in PESQ and 0.945 in STOI, which represents the speech quality and intelligibility, respectively.","PeriodicalId":190229,"journal":{"name":"2022 8th International Conference on Applied System Innovation (ICASI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The preliminary study of improving the DPTNet speech enhancement system by adjusting its encoder and loss function\",\"authors\":\"Yu-Yu Hsiao, Ming-Hsuan Wu, Kuan-Yu Tsai, J. Hung\",\"doi\":\"10.1109/ICASI55125.2022.9774458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study analyzes the celebrated speech enhancement method, Dual-Path Transformer Network (DPTNet), attempting to revise the respective arrangement to get superior performance.The DPTNet consists of three parts: encoder, separation layer and decoder. The encoder creates features from input speech signals. The separation layer mainly consists of two improved Transformers to perform mask-wise speech and noise separation on encoded features. Finally, the decoder reconstructs the speech signal from the masked features.We modify the DPTNet in two parts. First, we concatenate time- and frequency-domain features and then send them into a bottleneck block to create a compact feature representation. Second, we test several widely used loss functions at the terminal of the decoder and find that the hybrid loss used in another SE deep network, DEMUCS, behaves the best.To sum up, the new arrangement mentioned above provides the test set in the VoiceBank-DEMAND task with 2.85 in PESQ and 0.945 in STOI, which represents the speech quality and intelligibility, respectively.\",\"PeriodicalId\":190229,\"journal\":{\"name\":\"2022 8th International Conference on Applied System Innovation (ICASI)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 8th International Conference on Applied System Innovation (ICASI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASI55125.2022.9774458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 8th International Conference on Applied System Innovation (ICASI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASI55125.2022.9774458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The preliminary study of improving the DPTNet speech enhancement system by adjusting its encoder and loss function
This study analyzes the celebrated speech enhancement method, Dual-Path Transformer Network (DPTNet), attempting to revise the respective arrangement to get superior performance.The DPTNet consists of three parts: encoder, separation layer and decoder. The encoder creates features from input speech signals. The separation layer mainly consists of two improved Transformers to perform mask-wise speech and noise separation on encoded features. Finally, the decoder reconstructs the speech signal from the masked features.We modify the DPTNet in two parts. First, we concatenate time- and frequency-domain features and then send them into a bottleneck block to create a compact feature representation. Second, we test several widely used loss functions at the terminal of the decoder and find that the hybrid loss used in another SE deep network, DEMUCS, behaves the best.To sum up, the new arrangement mentioned above provides the test set in the VoiceBank-DEMAND task with 2.85 in PESQ and 0.945 in STOI, which represents the speech quality and intelligibility, respectively.