Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-443
Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki
Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.
{"title":"Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement","authors":"Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki","doi":"10.21437/interspeech.2022-443","DOIUrl":"https://doi.org/10.21437/interspeech.2022-443","url":null,"abstract":"Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"176-180"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42627367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10772
Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri
Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1
{"title":"Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks","authors":"Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri","doi":"10.21437/interspeech.2022-10772","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10772","url":null,"abstract":"Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2258-2262"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42743244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11347
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang
Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.
{"title":"Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech","authors":"Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang","doi":"10.21437/interspeech.2022-11347","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11347","url":null,"abstract":"Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2748-2752"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42908253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10617
Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.
{"title":"Cross-Modal Decision Regularization for Simultaneous Speech Translation","authors":"Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim","doi":"10.21437/interspeech.2022-10617","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10617","url":null,"abstract":"Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"116-120"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43012436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-327
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.
{"title":"Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion","authors":"Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida","doi":"10.21437/interspeech.2022-327","DOIUrl":"https://doi.org/10.21437/interspeech.2022-327","url":null,"abstract":"Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2688-2692"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47792200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11288
Andreas Liesenfeld, Mark Dingemanse
Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.
{"title":"Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages","authors":"Andreas Liesenfeld, Mark Dingemanse","doi":"10.21437/interspeech.2022-11288","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11288","url":null,"abstract":"Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1126-1130"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48906637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .
{"title":"FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion","authors":"Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang","doi":"10.21437/interspeech.2022-577","DOIUrl":"https://doi.org/10.21437/interspeech.2022-577","url":null,"abstract":"Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2558-2562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48445659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-736
M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano
Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.
{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":"https://doi.org/10.21437/interspeech.2022-736","url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48701115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-100
T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano
Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.
{"title":"Perceptual Evaluation of Penetrating Voices through a Semantic Differential Method","authors":"T. Kitamura, Naoki Kunimoto, Hideki Kawahara, S. Amano","doi":"10.21437/interspeech.2022-100","DOIUrl":"https://doi.org/10.21437/interspeech.2022-100","url":null,"abstract":"Some speakers have penetrating voices that can be popped out and heard clearly, even in loud noise or from a long distance. This study investigated the voice quality of the penetrating voices using factor analysis. Eleven participants scored how the voices of 124 speakers popped out from the babble noise. By assuming the score as an index of penetration, ten each of high- and low-scored speakers were selected for a rating experiment with a semantic differential method. Forty undergraduate students rated a Japanese sentence produced by these speakers using 14 bipolar 7-point scales concerning voice quality. A factor analysis was conducted using the data of 13 scales (i.e., excluding one scale of penetrating from 14 scales). Three main factors were obtained: (1) powerful and metallic, (2) feminine, and (3) esthetic. The first factor (powerful and metallic) highly correlated with the ratings of penetrating. These results sug-gest that penetrating voices have multi-dimensional voice quality and that the characteristics of penetrating voice related to powerful and metallic aspects of voices.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3063-3067"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45037272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}