Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10309
Qi Chen, Binghuai Lin, Yanlu Xie
Mispronunciation Detection and Diagnosis (MD&D) technology is used for detecting mispronunciations and providing feedback. Most MD&D systems are based on phoneme recognition. However, few studies have made use of the predefined reference text which has been provided to second language (L2) learners while practicing pronunciation. In this paper, we propose a novel alignment method based on linguistic knowledge of articulatory manner and places to align the phone sequences of the reference text with L2 learners speech. After getting the alignment results, we concatenate the corresponding phoneme embedding and the acoustic features of each speech frame as input. This method makes reasonable use of the reference text information as extra input. Experimental results show that the model can implicitly learn valid information in the reference text by this method. Meanwhile, it avoids introducing misleading information in the reference text, which will cause false acceptance (FA). Besides, the method incorporates articulatory features, which helps the model recognize phonemes. We evaluate the method on the L2-ARCTIC dataset and it turns out that our approach improves the F1-score over the state-of-the-art system by 4.9% relative.
{"title":"An Alignment Method Leveraging Articulatory Features for Mispronunciation Detection and Diagnosis in L2 English","authors":"Qi Chen, Binghuai Lin, Yanlu Xie","doi":"10.21437/interspeech.2022-10309","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10309","url":null,"abstract":"Mispronunciation Detection and Diagnosis (MD&D) technology is used for detecting mispronunciations and providing feedback. Most MD&D systems are based on phoneme recognition. However, few studies have made use of the predefined reference text which has been provided to second language (L2) learners while practicing pronunciation. In this paper, we propose a novel alignment method based on linguistic knowledge of articulatory manner and places to align the phone sequences of the reference text with L2 learners speech. After getting the alignment results, we concatenate the corresponding phoneme embedding and the acoustic features of each speech frame as input. This method makes reasonable use of the reference text information as extra input. Experimental results show that the model can implicitly learn valid information in the reference text by this method. Meanwhile, it avoids introducing misleading information in the reference text, which will cause false acceptance (FA). Besides, the method incorporates articulatory features, which helps the model recognize phonemes. We evaluate the method on the L2-ARCTIC dataset and it turns out that our approach improves the F1-score over the state-of-the-art system by 4.9% relative.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4342-4346"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46451269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-284
Z. Zhang, D. Williamson, Yi Shen
Many existing phase-aware speech enhancement algorithms consider the phase at all spectral frequencies to be equally important to perceptual quality and intelligibility. Although im-provements are observed according to both objective and subjective measures, as compared to phase-insensitive approaches, it is not clear whether phase information is equally important across the frequency spectrum. In this paper, we investigate the importance of estimating phase across spectral regions, by conducting a pairwise listening study to determine if phase enhancement can be limited to certain frequency bands. Our experimental results suggest that estimating phase at lower-frequency bands is mostly important for speech quality in normal-hearing (NH) listeners. We further propose a hybrid deep-learning framework that adopts two sub-networks for handling phase differently across the spectrum. The proposed hybrid-net significantly improves the model compatibility with low-resource platforms while achieving superior performance to the original phase-aware speech enhancement approaches.
{"title":"Investigation on the Band Importance of Phase-aware Speech Enhancement","authors":"Z. Zhang, D. Williamson, Yi Shen","doi":"10.21437/interspeech.2022-284","DOIUrl":"https://doi.org/10.21437/interspeech.2022-284","url":null,"abstract":"Many existing phase-aware speech enhancement algorithms consider the phase at all spectral frequencies to be equally important to perceptual quality and intelligibility. Although im-provements are observed according to both objective and subjective measures, as compared to phase-insensitive approaches, it is not clear whether phase information is equally important across the frequency spectrum. In this paper, we investigate the importance of estimating phase across spectral regions, by conducting a pairwise listening study to determine if phase enhancement can be limited to certain frequency bands. Our experimental results suggest that estimating phase at lower-frequency bands is mostly important for speech quality in normal-hearing (NH) listeners. We further propose a hybrid deep-learning framework that adopts two sub-networks for handling phase differently across the spectrum. The proposed hybrid-net significantly improves the model compatibility with low-resource platforms while achieving superior performance to the original phase-aware speech enhancement approaches.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4651-4655"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47581177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10821
J. Barker, M. Akeroyd, T. Cox, J. Culling, J. Firth, S. Graetzer, Holly Griffiths, Lara Harris, G. Naylor, Zuzanna Podwinska, Eszter Porter, R. V. Muñoz
{"title":"The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction","authors":"J. Barker, M. Akeroyd, T. Cox, J. Culling, J. Firth, S. Graetzer, Holly Griffiths, Lara Harris, G. Naylor, Zuzanna Podwinska, Eszter Porter, R. V. Muñoz","doi":"10.21437/interspeech.2022-10821","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10821","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3508-3512"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46317919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-741
Hancheng Lei, Ning-qiang Chen
Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.
{"title":"Audio-Visual Scene Classification Based on Multi-modal Graph Fusion","authors":"Hancheng Lei, Ning-qiang Chen","doi":"10.21437/interspeech.2022-741","DOIUrl":"https://doi.org/10.21437/interspeech.2022-741","url":null,"abstract":"Audio-Visual Scene Classification (AVSC) task tries to achieve scene classification through joint analysis of the audio and video modalities. Most of the existing AVSC models are based on feature-level or decision-level fusion. The possible problems are: i) Due to the distribution difference of the corresponding features in different modalities is large, the direct concatenation of them in the feature-level fusion may not result in good performance. ii) The decision-level fusion cannot take full advantage of the common as well as complementary properties between the features and corresponding similarities of different modalities. To solve these problems, Graph Convolutional Network (GCN)-based multi-modal fusion algorithm is proposed for AVSC task. First, the Deep Neural Network (DNN) is trained to extract essential feature from each modality. Then, the Sample-to-Sample Cross Similarity Graph (SSCSG) is constructed based on each modality features. Finally, the DynaMic GCN (DM-GCN) and the ATtention GCN (AT-GCN) are introduced respectively to realize both feature-level and similarity-level fusion to ensure the classification accuracy. Experimental results on TAU Audio-Visual Urban Scenes 2021 development dataset demonstrate that the proposed scheme, called AVSC-MGCN achieves higher classification accuracy and lower computational complexity than state-of-the-art schemes.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4157-4161"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46404103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-136
R. Rajan, Ananya Ayasi
A distinguishing feature of the music repertoire of the Syrian tradition is the system of classifying melodies into eight tunes, called ’oktoe¯chos’. It inspired many traditions, such as Greek and Indian liturgical music. In oktoe¯chos tradition, liturgical hymns are sung in eight modes or eight colours (known as eight ’niram’, regionally). In this paper, the automatic oktoe¯chos genre classification is addressed using musical texture features (MTF), i-vectors and Mel-spectrograms through stacked bidirectional and unidirectional long-short term memory (SBU-LSTM) and GRU (SB-GRU) architectures. The performance of the proposed approaches is evaluated using a newly created corpus of liturgical music in Malayalam. SBU-LSTM and SB-GRU frameworks report average classification accuracy of 88.19% and 87.50%, with a significant margin over other frameworks. The experiments demonstrate the potential of stacked architectures in learning temporal information from MTF for the proposed task.
{"title":"Oktoechos Classification in Liturgical Music Using SBU-LSTM/GRU","authors":"R. Rajan, Ananya Ayasi","doi":"10.21437/interspeech.2022-136","DOIUrl":"https://doi.org/10.21437/interspeech.2022-136","url":null,"abstract":"A distinguishing feature of the music repertoire of the Syrian tradition is the system of classifying melodies into eight tunes, called ’oktoe¯chos’. It inspired many traditions, such as Greek and Indian liturgical music. In oktoe¯chos tradition, liturgical hymns are sung in eight modes or eight colours (known as eight ’niram’, regionally). In this paper, the automatic oktoe¯chos genre classification is addressed using musical texture features (MTF), i-vectors and Mel-spectrograms through stacked bidirectional and unidirectional long-short term memory (SBU-LSTM) and GRU (SB-GRU) architectures. The performance of the proposed approaches is evaluated using a newly created corpus of liturgical music in Malayalam. SBU-LSTM and SB-GRU frameworks report average classification accuracy of 88.19% and 87.50%, with a significant margin over other frameworks. The experiments demonstrate the potential of stacked architectures in learning temporal information from MTF for the proposed task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2403-2407"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46409817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10600
Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingchen Kong, Weifeng Zhao, Tao Jiang
Cover song identification (CSI) has been a challenging task and an import topic in music information retrieval (MIR) commu-nity. In recent years, CSI problems have been extensively stud-ied based on deep learning methods. In this paper, we propose a novel framework for CSI based on a joint representation learning method inspired by multi-task learning. In specific, we propose a joint learning strategy which combines classification and metric learning for optimizing the cover song model based on WideResNet, called LyraC-Net. Classification objective learns separable embeddings from different classes, while metric learning optimizes embedding similarity by decreasing the inter-class distance and increasing the intra-classs separabil-ity. This joint optimization strategy is expected to learn a more robust cover song representation than methods with single training objectives. For the metric learning, prototypical network is introduced to stabilize and accelerate the training process, to-gether with triplet loss. Furthermore, we introduce SpecAugment, a popular augmentation method in speech recognition, to further improve the performance. Experiment results show that our proposed method achieves promising results and outperforms other recent CSI methods in the evaluations.
{"title":"WideResNet with Joint Representation Learning and Data Augmentation for Cover Song Identification","authors":"Shichao Hu, Bin Zhang, Jinhong Lu, Yiliang Jiang, Wucheng Wang, Lingchen Kong, Weifeng Zhao, Tao Jiang","doi":"10.21437/interspeech.2022-10600","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10600","url":null,"abstract":"Cover song identification (CSI) has been a challenging task and an import topic in music information retrieval (MIR) commu-nity. In recent years, CSI problems have been extensively stud-ied based on deep learning methods. In this paper, we propose a novel framework for CSI based on a joint representation learning method inspired by multi-task learning. In specific, we propose a joint learning strategy which combines classification and metric learning for optimizing the cover song model based on WideResNet, called LyraC-Net. Classification objective learns separable embeddings from different classes, while metric learning optimizes embedding similarity by decreasing the inter-class distance and increasing the intra-classs separabil-ity. This joint optimization strategy is expected to learn a more robust cover song representation than methods with single training objectives. For the metric learning, prototypical network is introduced to stabilize and accelerate the training process, to-gether with triplet loss. Furthermore, we introduce SpecAugment, a popular augmentation method in speech recognition, to further improve the performance. Experiment results show that our proposed method achieves promising results and outperforms other recent CSI methods in the evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4187-4191"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46421152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-468
Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng
This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.
{"title":"Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation","authors":"Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng","doi":"10.21437/interspeech.2022-468","DOIUrl":"https://doi.org/10.21437/interspeech.2022-468","url":null,"abstract":"This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"931-935"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46609959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11189
Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi
We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.
{"title":"Linguistically Informed Post-processing for ASR Error correction in Sanskrit","authors":"Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi","doi":"10.21437/interspeech.2022-11189","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11189","url":null,"abstract":"We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2293-2297"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46613564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-697
Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma
Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.
{"title":"Bring dialogue-context into RNN-T for streaming ASR","authors":"Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma","doi":"10.21437/interspeech.2022-697","DOIUrl":"https://doi.org/10.21437/interspeech.2022-697","url":null,"abstract":"Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2048-2052"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46695012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11152
Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki
Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1
{"title":"MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks","authors":"Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki","doi":"10.21437/interspeech.2022-11152","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11152","url":null,"abstract":"Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1631-1635"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41526941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}