Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10277
Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu
This paper takes efforts to tackle the challenge of “live” oneshot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.
{"title":"Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion","authors":"Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu","doi":"10.21437/interspeech.2022-10277","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10277","url":null,"abstract":"This paper takes efforts to tackle the challenge of “live” oneshot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2578-2582"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45650340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10879
T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei
Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.
{"title":"A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling","authors":"T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei","doi":"10.21437/interspeech.2022-10879","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10879","url":null,"abstract":"Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3003-3007"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45997376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-378
Yuhang He, A. Markham
A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.
{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-378","url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47392486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-657
Sunmook Choi, Il-Youp Kwak, Seungsang Oh
Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.
{"title":"Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure","authors":"Sunmook Choi, Il-Youp Kwak, Seungsang Oh","doi":"10.21437/interspeech.2022-657","DOIUrl":"https://doi.org/10.21437/interspeech.2022-657","url":null,"abstract":"Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3558-3562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47675680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10867
Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh
Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.
{"title":"How do our eyebrows respond to masks and whispering? The case of Persians","authors":"Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh","doi":"10.21437/interspeech.2022-10867","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10867","url":null,"abstract":"Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2023-2027"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47894190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.
{"title":"End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training","authors":"Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando","doi":"10.21437/interspeech.2022-11357","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11357","url":null,"abstract":"This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3218-3222"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-878
Zhan Zhang, Yuehai Wang, Jianyi Yang
Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. So far, most existing CAPT methods are discriminative and focus on detecting where the mispronunciation is. Although learners receive feedback about their current pronunciation, they may still not be able to learn the correct pronunciation. Nevertheless, there has been little discussion about speech-based teaching in CAPT. To fill this gap, we propose a novel bidirectional CAPT method to detect mispronunciations and generate the corrected pronunciations simultaneously. This correction-based feedback can better preserve the speaking style to make the learning process more personalized. In addition, we propose to adopt normalizing flows to share the latent for these two mirrored discriminative-generative tasks, making the whole model more compact. Experiments show that our method is efficient for mispronunciation detection and can naturally correct the speech under different CAPT granularity requirements.
{"title":"BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows","authors":"Zhan Zhang, Yuehai Wang, Jianyi Yang","doi":"10.21437/interspeech.2022-878","DOIUrl":"https://doi.org/10.21437/interspeech.2022-878","url":null,"abstract":"Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. So far, most existing CAPT methods are discriminative and focus on detecting where the mispronunciation is. Although learners receive feedback about their current pronunciation, they may still not be able to learn the correct pronunciation. Nevertheless, there has been little discussion about speech-based teaching in CAPT. To fill this gap, we propose a novel bidirectional CAPT method to detect mispronunciations and generate the corrected pronunciations simultaneously. This correction-based feedback can better preserve the speaking style to make the learning process more personalized. In addition, we propose to adopt normalizing flows to share the latent for these two mirrored discriminative-generative tasks, making the whole model more compact. Experiments show that our method is efficient for mispronunciation detection and can naturally correct the speech under different CAPT granularity requirements.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4332-4336"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48927215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10109
Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf
Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational
{"title":"Homophone Disambiguation Profits from Durational Information","authors":"Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf","doi":"10.21437/interspeech.2022-10109","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10109","url":null,"abstract":"Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3198-3202"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48930719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-204
John Harvill, M. Hasegawa-Johnson, C. Yoo
Previous studies on the detection of stuttered speech have focused on classification at the utterance level (e.g., for speech therapy applications), and on the correct insertion of stutter events in sequence into an orthographic transcript. In this paper, we propose the task of frame-level stutter detection which seeks to identify the time alignment of stutter events in a speech ut-terance, and we evaluate our approach on the stutter correction task. Limited previous work on stutter correction has relied on simple signal processing techniques and only been evaluated on small datasets. Our approach is the first large scale data-driven technique proposed to identify stuttering probabilistically at the frame level, and we make use of the largest available stuttering dataset to date during training. Predicted frame-level probabilities of different stuttering events can be used in downstream applications for Automatic Speech Recognition (ASR) as either additional features or part of a speech preprocessing pipeline to clean speech before analysis by an ASR system.
{"title":"Frame-Level Stutter Detection","authors":"John Harvill, M. Hasegawa-Johnson, C. Yoo","doi":"10.21437/interspeech.2022-204","DOIUrl":"https://doi.org/10.21437/interspeech.2022-204","url":null,"abstract":"Previous studies on the detection of stuttered speech have focused on classification at the utterance level (e.g., for speech therapy applications), and on the correct insertion of stutter events in sequence into an orthographic transcript. In this paper, we propose the task of frame-level stutter detection which seeks to identify the time alignment of stutter events in a speech ut-terance, and we evaluate our approach on the stutter correction task. Limited previous work on stutter correction has relied on simple signal processing techniques and only been evaluated on small datasets. Our approach is the first large scale data-driven technique proposed to identify stuttering probabilistically at the frame level, and we make use of the largest available stuttering dataset to date during training. Predicted frame-level probabilities of different stuttering events can be used in downstream applications for Automatic Speech Recognition (ASR) as either additional features or part of a speech preprocessing pipeline to clean speech before analysis by an ASR system.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2843-2847"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48948053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10400
Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi
Mean opinion score (MOS) is a widely used subjective metric to assess the quality of speech, and usually involves multiple human to judge each speech file. To reduce the labor cost of MOS, no-intrusive speech quality assessment methods have been extensively studied. However, due to the highly subjective bias of speech quality label, the performance of models to accurately represent speech quality scores is difficult to be trained. In this paper, we propose a convolutional self-attention neural network (Conformer) for MOS score prediction of conference speech to effectively alleviate the disadvantage of subjective bias on model training. In addition to this novel architecture, we further improve the generalization and accuracy of the predictor by utilizing attention label pooling and soft-label learning. We demonstrate that our proposed method achieves RMSE cost of 0.458 and PLCC score of 0.792 on evaluation test datasets of Conferencing Speech 2022 Challenge.
{"title":"Soft-label Learn for No-Intrusive Speech Quality Assessment","authors":"Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi","doi":"10.21437/interspeech.2022-10400","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10400","url":null,"abstract":"Mean opinion score (MOS) is a widely used subjective metric to assess the quality of speech, and usually involves multiple human to judge each speech file. To reduce the labor cost of MOS, no-intrusive speech quality assessment methods have been extensively studied. However, due to the highly subjective bias of speech quality label, the performance of models to accurately represent speech quality scores is difficult to be trained. In this paper, we propose a convolutional self-attention neural network (Conformer) for MOS score prediction of conference speech to effectively alleviate the disadvantage of subjective bias on model training. In addition to this novel architecture, we further improve the generalization and accuracy of the predictor by utilizing attention label pooling and soft-label learning. We demonstrate that our proposed method achieves RMSE cost of 0.458 and PLCC score of 0.792 on evaluation test datasets of Conferencing Speech 2022 Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3303-3307"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44548627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}