Pub Date : 2023-11-14DOI: 10.1016/j.specom.2023.103009
Cuicui Zhu , Aishan Wumaier , Dongping Wei , Zhixing Fan , Jianlei Yang , Heng Yu , Zaokere Kadeer , Liejun Wang
Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions.
{"title":"Pronunciation error detection model based on feature fusion","authors":"Cuicui Zhu , Aishan Wumaier , Dongping Wei , Zhixing Fan , Jianlei Yang , Heng Yu , Zaokere Kadeer , Liejun Wang","doi":"10.1016/j.specom.2023.103009","DOIUrl":"10.1016/j.specom.2023.103009","url":null,"abstract":"<div><p>Mispronunciation detection and diagnosis (MDD) is a specific speech recognition task that aims to recognize the phoneme sequence produced by a user, compare it with the standard phoneme sequence, and identify the type and location of any mispronunciations. However, the lack of large amounts of phoneme-level annotated data limits the performance improvement of the model. In this paper, we propose a joint training approach, Acoustic Error_Type Linguistic (AEL) that utilizes the error type information, acoustic information, and linguistic information from the annotated data, and achieves feature fusion through multiple attention mechanisms. To address the issue of uneven distribution of phonemes in the MDD data, which can cause the model to make overconfident predictions when using the CTC loss, we propose a new loss function, Focal Attention Loss, to improve the performance of the model, such as F1 score accuracy and other metrics. The proposed method in this paper was evaluated on the TIMIT and L2-Arctic public corpora. In ideal conditions, it was compared with the baseline model CNN-RNN-CTC. The F1 score, diagnostic accuracy, and precision were improved by 31.24%, 16.6%, and 17.35% respectively. Compared to the baseline model, our model reduced the phoneme error rate from 29.55% to 8.49% and showed significant improvements in other metrics. Furthermore, experimental results demonstrated that when we have a model capable of accurately obtaining pronunciation error types, our model can achieve results close to the ideal conditions.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103009"},"PeriodicalIF":3.2,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001437/pdfft?md5=a9f2df8a1ec5c7e52d687f13a603a861&pid=1-s2.0-S0167639323001437-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135764092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-09DOI: 10.1016/j.specom.2023.103006
Eduardo Barrientos , Edson Cataldo
This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study proposes adapting the Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME) method for formant frequency estimation. Specific methods for glottal closure instant estimation were required due to differences in glottal closure patterns between speech and singing. The study evaluates the accuracy of the proposed method by comparing its performance with the LPC method through different pitch series arranged in an ascending musical scale. The results indicated that the adapted WLP-AME method consistently outperformed the LPC method in estimating formant frequencies of vowels sung by a lyric soprano. In addition, by estimating the formant frequencies of a synthetic /i/ vowel sung by a soprano singer at the musical note E5, the study showed that the adapted WLP-AME method provided formant frequency values closer to the correct values than those estimated by the LPC method. In general, these results suggest parameter values of AME function that optimize its performance, which can have applications in fields such as singing and medicine.
{"title":"Adapted Weighted Linear Prediction with Attenuated Main Excitation for formant frequency estimation in high-pitched singing","authors":"Eduardo Barrientos , Edson Cataldo","doi":"10.1016/j.specom.2023.103006","DOIUrl":"10.1016/j.specom.2023.103006","url":null,"abstract":"<div><p>This paper aims to show how to improve the accuracy of formant frequency estimation in the singing voice of a lyric soprano. Conventional methods of formant frequency estimation may not accurately capture the formant frequencies of the singing voice, particularly in the highest pitch range of a lyric soprano, where the lowest formants are biased by the pitch harmonics. To address this issue, the study proposes adapting the Weighted Linear Prediction with Attenuated Main Excitation (WLP-AME) method for formant frequency estimation. Specific methods for glottal closure instant estimation were required due to differences in glottal closure patterns between speech and singing. The study evaluates the accuracy of the proposed method by comparing its performance with the LPC method through different pitch series arranged in an ascending musical scale. The results indicated that the adapted WLP-AME method consistently outperformed the LPC method in estimating formant frequencies of vowels sung by a lyric soprano. In addition, by estimating the formant frequencies of a synthetic /i/ vowel sung by a soprano singer at the musical note E5, the study showed that the adapted WLP-AME method provided formant frequency values closer to the correct values than those estimated by the LPC method. In general, these results suggest parameter values of AME function that optimize its performance, which can have applications in fields such as singing and medicine.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103006"},"PeriodicalIF":3.2,"publicationDate":"2023-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001401/pdfft?md5=ae4d5be07478b88a7d7394a12ce3f36c&pid=1-s2.0-S0167639323001401-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135565162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora either lack phrase diversity or focus on a small number of emotions, which makes it difficult to analyze the characteristics of Japanese NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect 420 audio clips from 4 speakers that cover 6 emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that (1) the emotions of the collected NVs can be recognized with high accuracy by both human evaluators and statistical models; (2) the collected NVs have a high authenticity comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese and conduct feature importance analysis to show discriminative acoustic features between emotion categories in Japanese NVs. We publicate JNV to advance further development in this field.
{"title":"JNV corpus: A corpus of Japanese nonverbal vocalizations with diverse phrases and emotions","authors":"Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari","doi":"10.1016/j.specom.2023.103004","DOIUrl":"10.1016/j.specom.2023.103004","url":null,"abstract":"<div><p>We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora either lack phrase diversity or focus on a small number of emotions, which makes it difficult to analyze the characteristics of Japanese NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based on crowd-sourcing; (2) recording NVs by stimulating speakers with emotional scenarios. We then collect 420 audio clips from 4 speakers that cover 6 emotions based on the proposed method. Results of comprehensive objective and subjective experiments demonstrate that (1) the emotions of the collected NVs can be recognized with high accuracy by both human evaluators and statistical models; (2) the collected NVs have a high authenticity comparable to previous corpora of English NVs. Additionally, we analyze the distributions of vowel types in Japanese and conduct feature importance analysis to show discriminative acoustic features between emotion categories in Japanese NVs. We publicate JNV to advance further development in this field.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"156 ","pages":"Article 103004"},"PeriodicalIF":3.2,"publicationDate":"2023-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639323001383/pdfft?md5=a483e24acbbf292a674e285ddd58df8a&pid=1-s2.0-S0167639323001383-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135455635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1016/j.specom.2023.103002
Yuqin Lin , Jianwu Dang , Longbiao Wang , Sheng Li , Chenchen Ding
The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, e.g, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.
{"title":"Disordered speech recognition considering low resources and abnormal articulation","authors":"Yuqin Lin , Jianwu Dang , Longbiao Wang , Sheng Li , Chenchen Ding","doi":"10.1016/j.specom.2023.103002","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103002","url":null,"abstract":"<div><p><span>The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, </span><em>e.g</em><span><span>, unclear, unstable, and incorrect pronunciations. To realize the ASR of disordered speech, this study addresses the problems of disordered speech in two respects, low resources, and articulatory abnormality. In order to solve the problem of low resources, this study proposes staged knowledge distillation<span> (KD), which provides different references to the student models according to their mastery of knowledge, so as to avoid feature overfitting. To tackle the articulatory abnormalities in dysarthria, we propose an intended phonological perception method (IPPM) by applying the </span></span>motor theory of speech perception to ASR, in which pieces of intended phonological features are estimated and provided to ASR. And further, we solve the challenges of disordered ASR by combining the staged KD and the IPPM. TORGO database and UASEECH corpus are two commonly used datasets of dysarthria which is the main cause of speech disorders. Experiments on the two datasets validated the effectiveness of the proposed methods. Compared with the baseline, the proposed method achieves 35.14%</span><span><math><mo>∼</mo></math></span><span>38.12% relative phoneme error rate reductions (PERRs) for speakers with varying degrees of dysarthria on the TORGO database and relative 8.17%</span><span><math><mo>∼</mo></math></span>13.00% PERRs on the UASPEECH corpus. The experiments demonstrated that addressing disordered speech from both low resources and speech abnormality is an effective way to solve the problems, and the proposed methods significantly improved the performance of ASR for disordered speech.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103002"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92005883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1016/j.specom.2023.103005
Noora Al Roken, Gerassimos Barlas
Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.
{"title":"Multimodal Arabic emotion recognition using deep learning","authors":"Noora Al Roken, Gerassimos Barlas","doi":"10.1016/j.specom.2023.103005","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103005","url":null,"abstract":"<div><p>Emotion Recognition has been an active area for decades due to the complexity of the problem and its significance in human–computer interaction. Various methods have been employed to tackle this problem, leveraging different inputs such as speech, 2D and 3D images, audio signals, and text, all of which can convey emotional information. Recently, researchers have started combining multiple modalities to enhance the accuracy of emotion classification, recognizing that different emotions may be better expressed through different input types. This paper introduces a novel Arabic audio-visual natural-emotion dataset, investigates two existing multimodal classifiers, and proposes a new classifier trained on our Arabic dataset. Our evaluation encompasses different aspects, including variations in visual dataset sizes, joint and disjoint training, single and multimodal networks, as well as consecutive and overlapping segmentation. Through 5-fold cross-validation, our proposed classifier achieved exceptional results with an average F1-score of 0.912 and an accuracy of 0.913 for natural emotion recognition.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103005"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92006241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1016/j.specom.2023.103001
Yibo Duan , Yanhua Long , Jiaen Liang
Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available VoxCeleb2 and VoxMovies datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.
{"title":"Dual-model self-regularization and fusion for domain adaptation of robust speaker verification","authors":"Yibo Duan , Yanhua Long , Jiaen Liang","doi":"10.1016/j.specom.2023.103001","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103001","url":null,"abstract":"<div><p>Learning robust representations of speaker identity is a key challenge in speaker verification, as it results in good generalization for many real-world speaker verification scenarios with domain or intra-speaker variations. In this study, we aim to improve the well-established ECAPA-TDNN framework to enhance its domain robustness for low-resource cross-domain speaker verification tasks. Specifically, a novel dual-model self-learning approach is first proposed to produce robust speaker identity embeddings, where the ECAPA-TDNN is extended into a dual-model structure and then trained and regularized using self-supervised learning between different intermediate acoustic representations; Then, we enhance the dual-model by combining self-supervised loss and supervised loss in a time-dependent manner, thus enhancing the model’s overall generalization capabilities. Furthermore, to better utilize the complementary information in the dual-model’s outputs, we explore various methods for similarity computation and score fusion. Our experiments, conducted on the publicly available <span>VoxCeleb2</span> and <span>VoxMovies</span><span><span> datasets, have demonstrated that our proposed dual-model regularization and fusion methods outperformed the strong baseline by a relative 9.07%–11.6% </span>EER reduction across various in-domain and cross-domain evaluation sets. Importantly, our approach exhibits effectiveness in both supervised and unsupervised scenarios for low-resource cross-domain speaker verification tasks.</span></p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103001"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92066325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1016/j.specom.2023.103003
Xue Yang, Changchun Bao, Xianhong Chen
Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.
{"title":"Coarse-to-fine speech separation method in the time-frequency domain","authors":"Xue Yang, Changchun Bao, Xianhong Chen","doi":"10.1016/j.specom.2023.103003","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103003","url":null,"abstract":"<div><p>Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103003"},"PeriodicalIF":3.2,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92140662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-12DOI: 10.1016/j.specom.2023.103000
Puyang Geng , Ningxue Fan , Rong Ling , Hong Guo , Qimeng Lu , Xingwen Chen
Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience difficulties in emotion recognition and speech illusion, which are essential for interpersonal communication and a healthy life experience. However, previous research has yielded divergent results regarding whether drug addicts are more attracted to negative stimuli or positive stimuli. Additionally, little attention has been paid to the speech illusion experienced by drug addicts. Therefore, the current study aimed to investigate the effect of drugs on patterns of emotion recognition through two basic channels: auditory (speech) and visual (facial expression), as well as the speech illusions of drug addicts. The current study conducted a perceptual experiment in which 52 stimuli of four emotions (happy, angry, sad, and neutral) in three modalities (auditory, visual, auditory + visual [congruent & incongruent]) were presented to address Question 1 regarding the multi-modal emotional speech perception of drug addicts. Additionally, 26 stimuli of white noise and speech of three emotions in two noise conditions were presented to investigate Question 2 concerning the speech illusion of drug addicts. A total of thirty-five male drug addicts (25 heroin addicts and 10 ketamine addicts) and thirty-five male healthy controls were recruited for the perception experiment. The results, with heroin and ketamine addicts as examples, revealed that drug addicts exhibited lower accuracies in multi-modal emotional speech perception and relied more on visual cues for emotion recognition, especially when auditory and visual inputs were incongruent. Furthermore, both heroin and ketamine addicts showed a higher incidence of emotional responses when only exposed to white noise, suggesting the presence of psychotic-like symptoms (i.e., speech illusion) in drug addicts. Our results preliminarily indicate a disorder or deficit in multi-modal emotional speech processing among drug addicts, and the use of visual cues (e.g., facial expressions) may be recommended to improve their interpretation of emotional expressions. Moreover, the speech illusions experienced by drug addicts warrant greater attention and awareness. This paper not only fills the research gap in understanding multi-modal emotion processing and speech illusion in drug addicts but also contributes to a deeper understanding of the effects of drugs on human behavior and provides insights for the theoretical foundations of detoxification and speech rehabilitation for drug addicts.
{"title":"The Role of Auditory and Visual Cues in the Perception of Mandarin Emotional Speech in Male Drug Addicts","authors":"Puyang Geng , Ningxue Fan , Rong Ling , Hong Guo , Qimeng Lu , Xingwen Chen","doi":"10.1016/j.specom.2023.103000","DOIUrl":"https://doi.org/10.1016/j.specom.2023.103000","url":null,"abstract":"<div><p>Evidence from previous neurological studies has revealed that drugs can cause severe damage to the human brain structure, leading to significant cognitive disorders in emotion processing, such as psychotic-like symptoms (e.g., speech illusion: reporting positive/negative responses when hearing white noise) and negative reinforcement. Due to these emotion processing disorders, drug addicts may experience difficulties in emotion recognition and speech illusion, which are essential for interpersonal communication and a healthy life experience. However, previous research has yielded divergent results regarding whether drug addicts are more attracted to negative stimuli or positive stimuli. Additionally, little attention has been paid to the speech illusion experienced by drug addicts. Therefore, the current study aimed to investigate the effect of drugs on patterns of emotion recognition through two basic channels: auditory (speech) and visual (facial expression), as well as the speech illusions of drug addicts. The current study conducted a perceptual experiment in which 52 stimuli of four emotions (happy, angry, sad, and neutral) in three modalities (auditory, visual, auditory + visual [congruent & incongruent]) were presented to address Question 1 regarding the multi-modal emotional speech perception of drug addicts. Additionally, 26 stimuli of white noise and speech of three emotions in two noise conditions were presented to investigate Question 2 concerning the speech illusion of drug addicts. A total of thirty-five male drug addicts (25 heroin addicts and 10 ketamine addicts) and thirty-five male healthy controls were recruited for the perception experiment. The results, with heroin and ketamine addicts as examples, revealed that drug addicts exhibited lower accuracies in multi-modal emotional speech perception and relied more on visual cues for emotion recognition, especially when auditory and visual inputs were incongruent. Furthermore, both heroin and ketamine addicts showed a higher incidence of emotional responses when only exposed to white noise, suggesting the presence of psychotic-like symptoms (i.e., speech illusion) in drug addicts. Our results preliminarily indicate a disorder or deficit in multi-modal emotional speech processing among drug addicts, and the use of visual cues (e.g., facial expressions) may be recommended to improve their interpretation of emotional expressions. Moreover, the speech illusions experienced by drug addicts warrant greater attention and awareness. This paper not only fills the research gap in understanding multi-modal emotion processing and speech illusion in drug addicts but also contributes to a deeper understanding of the effects of drugs on human behavior and provides insights for the theoretical foundations of detoxification and speech rehabilitation for drug addicts.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103000"},"PeriodicalIF":3.2,"publicationDate":"2023-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49701210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-06DOI: 10.1016/j.specom.2023.102989
Kiran Reddy Mittapalle , Madhu Keerthana Yagnavajjula , Paavo Alku
Functional dysphonia (FD) refers to an abnormality in voice quality in the absence of an identifiable lesion. In this paper, we propose an approach based on the tunable Q wavelet transform (TQWT) to automatically classify two types of FD (hyperfunctional dysphonia and hypofunctional dysphonia) from a healthy voice using the acoustic voice signal. Using TQWT, voice signals were decomposed into sub-bands and the entropy values extracted from the sub-bands were utilized as features for the studied 3-class classification problem. In addition, the Mel-frequency cepstral coefficient (MFCC) and glottal features were extracted from the acoustic voice signal and the estimated glottal source signal, respectively. A convolutional neural network (CNN) classifier was trained separately for the TQWT, MFCC and glottal features. Experiments were conducted using voice signals of 57 healthy speakers and 113 FD patients (72 with hyperfunctional dysphonia and 41 with hypofunctional dysphonia) taken from the VOICED database. These experiments revealed that the TQWT features yielded an absolute improvement of 5.5% and 4.5% compared to the baseline MFCC features and glottal features, respectively. Furthermore, the highest classification accuracy (67.91%) was obtained using the combination of the TQWT and glottal features, which indicates the complementary nature of these features.
{"title":"Classification of functional dysphonia using the tunable Q wavelet transform","authors":"Kiran Reddy Mittapalle , Madhu Keerthana Yagnavajjula , Paavo Alku","doi":"10.1016/j.specom.2023.102989","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102989","url":null,"abstract":"<div><p>Functional dysphonia (FD) refers to an abnormality in voice quality in the absence of an identifiable lesion. In this paper, we propose an approach based on the tunable Q wavelet transform (TQWT) to automatically classify two types of FD (hyperfunctional dysphonia and hypofunctional dysphonia) from a healthy voice using the acoustic voice signal. Using TQWT, voice signals were decomposed into sub-bands and the entropy values extracted from the sub-bands were utilized as features for the studied 3-class classification problem. In addition, the Mel-frequency cepstral coefficient (MFCC) and glottal features were extracted from the acoustic voice signal and the estimated glottal source signal, respectively. A convolutional neural network (CNN) classifier was trained separately for the TQWT, MFCC and glottal features. Experiments were conducted using voice signals of 57 healthy speakers and 113 FD patients (72 with hyperfunctional dysphonia and 41 with hypofunctional dysphonia) taken from the VOICED database. These experiments revealed that the TQWT features yielded an absolute improvement of 5.5% and 4.5% compared to the baseline MFCC features and glottal features, respectively. Furthermore, the highest classification accuracy (67.91%) was obtained using the combination of the TQWT and glottal features, which indicates the complementary nature of these features.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102989"},"PeriodicalIF":3.2,"publicationDate":"2023-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-05DOI: 10.1016/j.specom.2023.102991
Yi Wei, Haiyan Guo, Zirui Ge, Zhen Yang
Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.
{"title":"Graph attention-based deep embedded clustering for speaker diarization","authors":"Yi Wei, Haiyan Guo, Zirui Ge, Zhen Yang","doi":"10.1016/j.specom.2023.102991","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102991","url":null,"abstract":"<div><p>Deep speaker embedding extraction models have recently served as the cornerstone for modular speaker diarization systems. However, in current modular systems, the extracted speaker embeddings (namely, speaker features) do not effectively leverage their intrinsic relationships, and moreover, are not tailored specifically for the clustering task. In this paper, inspired by deep embedded clustering (DEC), we propose a speaker diarization method using the graph attention-based deep embedded clustering (GADEC) to address the aforementioned issues. First, considering the temporal nature of speech signals, when segmenting the speech signal into small segments, the speech in the current segment and its neighboring segments may likely belong to the same speaker. This suggests that embeddings extracted from neighboring segments could help generate a more informative speaker representation for the current segment. To better describe the complex relationships between segments and leverage the local structural information among their embeddings, we construct a graph for the pre-extracted speaker embeddings in a continuous audio signal. On this basis, we introduce a graph attentional encoder (GAE) module to integrate information from neighboring nodes (i.e., neighboring segments) in the graph and learn latent speaker embeddings. Moreover, we further jointly optimize both the latent speaker embeddings and the clustering results within a unified framework, leading to more discriminative speaker embeddings for the clustering task. Experimental results demonstrate that our proposed GADEC-based speaker diarization system significantly outperforms the baseline systems and several other recent speaker diarization systems concerning diarization error rate (DER) on the NIST SRE 2000 CALLHOME, AMI, and VoxConverse datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102991"},"PeriodicalIF":3.2,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}