首页 > 最新文献

Speech Communication最新文献

英文 中文
Human and automatic voice comparison with regionally variable speech samples 人类和自动语音比较区域可变语音样本
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-12 DOI: 10.1016/j.specom.2025.103253
Vincent Hughes , Carmen Llamas , Thomas Kettig
In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, Cllr=0.48) outperformed the human listeners (EER=23.55 %, Cllr=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.
在本文中,我们比较和结合基于短的、区域可变的语音样本的人类语音和自动语音比较结果。从总共896名英国英语听众中提取了120对相同(45)和不同(75)说话者样本的似然比得分。这些样本包括来自纽卡斯尔和米德尔斯堡(英格兰东北部)的说话者的声音,以及标准英国南部英语(现代RP)的说话者的声音。除了口音内比较之外,实验还包括米德尔斯堡和纽卡斯尔的口音之间、不同说话者之间的比较,这两种口音在感知上和区域上都是相近的。分数也使用x向量PLDA自动说话人识别(ASR)系统计算。ASR系统(EER= 10.88%, Cllr=0.48)的整体表现优于人类听者(EER= 23.55%, Cllr=0.75),当与听者评分融合时,ASR输出没有改善。不出所料,听众之间存在相当大的差异,个人错误率从0%到100%不等。根据说话者的地区口音,他们的表现也会有所不同。值得注意的是,ASR系统在处理纽卡斯尔样本时表现最差,而人类在处理纽卡斯尔样本时表现最好。与ASR系统相比,人类听众对高显著性的口音间比较也更敏感,从而得出几乎完全不同的说话者结论,ASR系统在这些样本中的表现与口音内比较相似。
{"title":"Human and automatic voice comparison with regionally variable speech samples","authors":"Vincent Hughes ,&nbsp;Carmen Llamas ,&nbsp;Thomas Kettig","doi":"10.1016/j.specom.2025.103253","DOIUrl":"10.1016/j.specom.2025.103253","url":null,"abstract":"<div><div>In this paper, we compare and combine human and automatic voice comparison results based on short, regionally variable speech samples. Likelihood ratio-like scores were extracted for 120 pairs of same- (45) and different-speaker (75) samples from a total of 896 British English listeners. The samples contained the voices of speakers from Newcastle and Middlesbrough (in North-East England), as well as speakers of Standard Southern British English (modern RP). In addition to within-accent comparisons, the experiment included between-accent, different-speaker comparisons for Middlesbrough and Newcastle, which are perceptually and regionally proximate accents. Scores were also computed using an x-vector PLDA automatic speaker recognition (ASR) system. The ASR system (EER=10.88 %, <em>C</em><sub>llr</sub>=0.48) outperformed the human listeners (EER=23.55 %, <em>C</em><sub>llr</sub>=0.75) overall and no improvement was found in the ASR output when fused with the listener scores. There was, unsurprisingly, considerable between-listener variability, with individual error rates varying from 0 % to 100 %. Performance was also variable according to the regional accent of the speakers. Notably, the ASR system performed worst with Newcastle samples, while humans performed best with the Newcastle samples. Human listeners were also more sensitive to high-salience between-accent comparisons, leading to almost categorical different-speaker conclusions, compared with the ASR system, whose performance with these samples was similar to within-accent comparisons.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103253"},"PeriodicalIF":2.4,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144072385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-path and interactive UNET for speech enhancement with multi-order fractional features 多阶分数特征语音增强的双路径交互式UNET
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-09 DOI: 10.1016/j.specom.2025.103248
Liyun Xu, Tong Zhang
Preprocessing techniques for denoising and enhancement play a crucial role in significantly improving speech recognition performance. In neural-network-based speech enhancement methods, input features provide the network with essential information to learn from the data. In this study, we introduced multi-order fractional features into a speech enhancement network. These features can represent fine details and offer the advantages of multidomain joint analysis, thereby expanding the input information available to the network. Subsequently, a new dual-path UNET network was designed, in which pure speech and noise are estimated separately. By leveraging the complementarity of the two-branch target estimation, we introduced a fractional information interaction module between the two paths for parameter optimization. Finally, the association module combined the two output information streams to enhance the speech performance. The results from ablation experiments demonstrated the effectiveness of both the multi-order fractional features and the improved dual-path network. Comparison experiments revealed that the proposed algorithm significantly improved speech quality and intelligibility.
预处理去噪和增强技术在显著提高语音识别性能方面起着至关重要的作用。在基于神经网络的语音增强方法中,输入特征为网络提供了从数据中学习的基本信息。在本研究中,我们将多阶分数阶特征引入语音增强网络。这些特征可以表示精细的细节,提供了多域联合分析的优势,从而扩大了网络可用的输入信息。随后,设计了一种新的双路UNET网络,其中纯语音和噪声分别估计。利用两支路目标估计的互补性,在两支路之间引入分数信息交互模块进行参数优化。最后,关联模块将两个输出的信息流结合起来,增强语音性能。烧蚀实验结果证明了多阶分数特征和改进的双路径网络的有效性。对比实验表明,该算法显著提高了语音质量和可理解性。
{"title":"Dual-path and interactive UNET for speech enhancement with multi-order fractional features","authors":"Liyun Xu,&nbsp;Tong Zhang","doi":"10.1016/j.specom.2025.103248","DOIUrl":"10.1016/j.specom.2025.103248","url":null,"abstract":"<div><div>Preprocessing techniques for denoising and enhancement play a crucial role in significantly improving speech recognition performance. In neural-network-based speech enhancement methods, input features provide the network with essential information to learn from the data. In this study, we introduced multi-order fractional features into a speech enhancement network. These features can represent fine details and offer the advantages of multidomain joint analysis, thereby expanding the input information available to the network. Subsequently, a new dual-path UNET network was designed, in which pure speech and noise are estimated separately. By leveraging the complementarity of the two-branch target estimation, we introduced a fractional information interaction module between the two paths for parameter optimization. Finally, the association module combined the two output information streams to enhance the speech performance. The results from ablation experiments demonstrated the effectiveness of both the multi-order fractional features and the improved dual-path network. Comparison experiments revealed that the proposed algorithm significantly improved speech quality and intelligibility.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103248"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gradient or categorical? Towards a phonological typology of illusory vowels in Mandarin 梯度还是分类?汉语虚元音的语音类型学研究
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-09 DOI: 10.1016/j.specom.2025.103252
Yizhou Wang , Rikke Bundgaard-Nielsen , Brett Baker , Olga Maxwell
This paper argues that illusory vowel perception, i.e., the perception of non-existent vowels between two consonants by nonnative listeners, is gradient rather than categorical in Mandarin Chinese, and that the strength of illusion is predictable from the mismatches between the nonnative speech input and the listeners’ native phonological grammar. We examined five phonological scenarios where illusory vowels with different qualities can be perceived, and different illusion levels can be predicted by factors including syllable phonotactic constraints, vowel minimality, and the place of articulation consistency between the illusory vowel and its preceding consonant. The predictions were examined in an AXB discrimination task (Experiment 1) and an identification task (Experiment 2), which confirmed the predictions overall, while some paradigmatic differences were also observed. By comparing the current results and previous reports, we argue that a gradient rather than categorical account of illusory vowel is more suitable for explaining and predicting nonnative cluster perception. Specifically, the place of articulation feature of the preceding consonant is important for predicting contextual illusory vowels, which reflects nonnative listeners’ interpretation of perceived gestural score across multiple segments, supporting a direct realist view of speech perception.
本文认为,在普通话中,非母语听者对两个辅音之间不存在的元音的错觉感知是梯度的,而不是绝对的,并且错觉的强度可以从非母语语音输入与听者的母语语音语法不匹配中预测出来。我们研究了五种语音情景,在这些情景中,不同质量的虚幻元音可以被感知,不同的幻觉水平可以通过音节语音限制、元音最小度以及虚幻元音与其前辅音之间的发音一致性等因素来预测。实验1和实验2对预测结果进行了验证,总体上证实了预测结果,但也观察到一些范式差异。通过比较目前的结果和以前的报告,我们认为梯度而不是分类的虚假元音更适合解释和预测非母语集群感知。具体而言,前辅音的发音位置特征对于预测上下文虚幻元音很重要,这反映了非母语听者对多个片段感知的手势分数的解释,支持直接现实主义的语言感知观点。
{"title":"Gradient or categorical? Towards a phonological typology of illusory vowels in Mandarin","authors":"Yizhou Wang ,&nbsp;Rikke Bundgaard-Nielsen ,&nbsp;Brett Baker ,&nbsp;Olga Maxwell","doi":"10.1016/j.specom.2025.103252","DOIUrl":"10.1016/j.specom.2025.103252","url":null,"abstract":"<div><div>This paper argues that illusory vowel perception, i.e., the perception of non-existent vowels between two consonants by nonnative listeners, is gradient rather than categorical in Mandarin Chinese, and that the strength of illusion is predictable from the mismatches between the nonnative speech input and the listeners’ native phonological grammar. We examined five phonological scenarios where illusory vowels with different qualities can be perceived, and different illusion levels can be predicted by factors including syllable phonotactic constraints, vowel minimality, and the place of articulation consistency between the illusory vowel and its preceding consonant. The predictions were examined in an AXB discrimination task (Experiment 1) and an identification task (Experiment 2), which confirmed the predictions overall, while some paradigmatic differences were also observed. By comparing the current results and previous reports, we argue that a gradient rather than categorical account of illusory vowel is more suitable for explaining and predicting nonnative cluster perception. Specifically, the place of articulation feature of the preceding consonant is important for predicting contextual illusory vowels, which reflects nonnative listeners’ interpretation of perceived gestural score across multiple segments, supporting a direct realist view of speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103252"},"PeriodicalIF":2.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
"I said simPle, not symBol!"Is clear speech tailored to the listener's feedback “我说的是简单,不是符号!”清晰的演讲是根据听众的反馈量身定制的吗
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-08 DOI: 10.1016/j.specom.2025.103251
Maëva Garnier, Marion Dohen
This study investigates variation in the production of French stop consonants in two situations of speech clarity enhancement – when addressing an interlocutor experiencing listening difficulties in a disrupted communication environment (clear speech), and when correcting specific listener misunderstandings (corrected speech). Of interest is whether speech modifications are similar in both situations, or if adjustments during correction specifically address listeners' errors.
Twelve native French speakers interacted with the experimenter in a gaming task, first in conversational speech ('Conv') under normal conditions, then in clear speech prompted by apparent listening difficulties from the interlocutor ('Clear'). In the disrupted situation, some words were misunderstood by the listener (errors in either voicing or articulation place of stop consonants), resulting in additional corrections by the speaker ('Clear+Corr').
Significant changes in the timing and spectral cues of stop consonants (closure duration, Voice Onset Time, burst spectrum) were observed in both clear and corrected speech, improving distinctions between voiced and voiceless stops and articulation places. Additionally, clear speech prompted by listening difficulties showed global modifications (overall increased intensity, longer syllable duration, hyper-articulated vowels). Conversely, corrected speech focused solely on segmental modifications, with burst spectrum variations significantly influenced by listener feedback, emphasizing the distinction between the speaker's intended segment and the misunderstood one.
The results suggest that both situations of speech clarity enhancement involve different strategies, with speech correction relying on real-time perception of the listener's feedback to specifically address perceptual errors.
本研究调查了两种语音清晰度增强情况下法语顿音产生的变化-在中断的沟通环境中解决对话者听力困难时(清晰的语音),以及纠正特定的听者误解时(纠正的语音)。令人感兴趣的是,在这两种情况下,语音修改是否相似,或者在纠正过程中是否专门针对听者的错误进行调整。12名母语为法语的人在游戏任务中与实验者互动,首先在正常条件下进行对话式演讲(“Conv”),然后在对话者明显听力困难的情况下进行清晰的演讲(“clear”)。在中断的情况下,一些单词被听者误解(发音或停止辅音的发音位置错误),导致说话者额外纠正(“Clear+Corr”)。在清晰和纠正后的语音中,停辅音的时间和频谱线索(关闭时间、发声时间、爆发频谱)都发生了显著变化,提高了浊音和不浊音的停顿和发音位置的区别。此外,听力困难导致的清晰言语表现出整体的变化(整体强度增加,音节持续时间延长,元音发音过度)。相反,纠正后的语音只关注片段修改,突发频谱变化受听者反馈的显著影响,强调了说话者预期的片段和误解的片段之间的区别。结果表明,两种情况下的语音清晰度增强涉及不同的策略,语音纠正依赖于听众反馈的实时感知,以具体解决感知错误。
{"title":"\"I said simPle, not symBol!\"Is clear speech tailored to the listener's feedback","authors":"Maëva Garnier,&nbsp;Marion Dohen","doi":"10.1016/j.specom.2025.103251","DOIUrl":"10.1016/j.specom.2025.103251","url":null,"abstract":"<div><div>This study investigates variation in the production of French stop consonants in two situations of speech clarity enhancement – when addressing an interlocutor experiencing listening difficulties in a disrupted communication environment (clear speech), and when correcting specific listener misunderstandings (corrected speech). Of interest is whether speech modifications are similar in both situations, or if adjustments during correction specifically address listeners' errors.</div><div>Twelve native French speakers interacted with the experimenter in a gaming task, first in conversational speech ('Conv') under normal conditions, then in clear speech prompted by apparent listening difficulties from the interlocutor ('Clear'). In the disrupted situation, some words were misunderstood by the listener (errors in either voicing or articulation place of stop consonants), resulting in additional corrections by the speaker ('Clear+Corr').</div><div>Significant changes in the timing and spectral cues of stop consonants (closure duration, Voice Onset Time, burst spectrum) were observed in both clear and corrected speech, improving distinctions between voiced and voiceless stops and articulation places. Additionally, clear speech prompted by listening difficulties showed global modifications (overall increased intensity, longer syllable duration, hyper-articulated vowels). Conversely, corrected speech focused solely on segmental modifications, with burst spectrum variations significantly influenced by listener feedback, emphasizing the distinction between the speaker's intended segment and the misunderstood one.</div><div>The results suggest that both situations of speech clarity enhancement involve different strategies, with speech correction relying on real-time perception of the listener's feedback to specifically address perceptual errors.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103251"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speakers’ communicative intentions lead to acoustic adjustments in native and non-native directed speech 说话者的交际意图导致本族语和非本族语直接语的声学调节
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-08 DOI: 10.1016/j.specom.2025.103250
Giorgio Piazza , Marina Kalashnikova , Laura Fernández-Merino , Clara D. Martin
Speakers adapt acoustic features to factors such as listeners’ linguistic profiles. For instance, addressing a non-native listener elicits Non-Native Directed Speech (NNDS). However, whether these speech adaptations vary depending on the speakers’ didactic goals, in interaction with the listeners' profiles (i.e., native vs. non-native), remains unknown.
We recorded native Spanish speakers naming novel objects to aid their listeners’ performance in comprehension, pronunciation, and writing tasks. Each speaker interacted with a native (Native Directed Speech, NDS) and a non-native (NNDS) Spanish listener. We extracted measures of vowel hyperarticulation, duration, intensity, speech rate, and F0 to assess listener- and task-specific speech adjustments.
Our results showed that speakers hyperarticulated vowels to a greater extent in the writing condition compared to the comprehension condition, and during NNDS compared to NDS. Listener profile and task also impacted speakers’ F0 height, intensity, and vowel duration production. Therefore, speakers adjust acoustic features in their speech to achieve their didactic goals and accommodate their listener's profile. Also, speakers’ overall greater adaptation in NNDS than in NDS suggests that NNDS serves a didactic purpose.
说话者根据听者的语言特征等因素调整声学特征。例如,向非母语听众讲话会引发非母语定向语(NNDS)。然而,这些言语适应是否取决于说话者的教学目标,与听者的个人资料(即母语与非母语)的相互作用,仍然未知。我们记录了以西班牙语为母语的人命名新物体的过程,以帮助他们的听众在理解、发音和写作任务上的表现。每个说话者都与一个西班牙语母语(NDS)和一个非西班牙语母语(NNDS)听众进行互动。我们提取了元音高发音、持续时间、强度、语速和F0的测量来评估听者和任务特定的语音调整。我们的研究结果表明,说话者在写作条件下比在理解条件下高发音,在NNDS过程中比在NDS过程中高发音。听者的形象和任务也会影响说话者的F0高度、强度和元音持续时间的产生。因此,说话者在其讲话中调整声学特征以达到其教学目的,并适应听者的特点。此外,说话者在NNDS中比在NDS中有更大的适应能力,这表明NNDS具有教学目的。
{"title":"Speakers’ communicative intentions lead to acoustic adjustments in native and non-native directed speech","authors":"Giorgio Piazza ,&nbsp;Marina Kalashnikova ,&nbsp;Laura Fernández-Merino ,&nbsp;Clara D. Martin","doi":"10.1016/j.specom.2025.103250","DOIUrl":"10.1016/j.specom.2025.103250","url":null,"abstract":"<div><div>Speakers adapt acoustic features to factors such as listeners’ linguistic profiles. For instance, addressing a non-native listener elicits Non-Native Directed Speech (NNDS). However, whether these speech adaptations vary depending on the speakers’ didactic goals, in interaction with the listeners' profiles (i.e., native vs. non-native), remains unknown.</div><div>We recorded native Spanish speakers naming novel objects to aid their listeners’ performance in comprehension, pronunciation, and writing tasks. Each speaker interacted with a native (Native Directed Speech, NDS) and a non-native (NNDS) Spanish listener. We extracted measures of vowel hyperarticulation, duration, intensity, speech rate, and F0 to assess listener- and task-specific speech adjustments.</div><div>Our results showed that speakers hyperarticulated vowels to a greater extent in the writing condition compared to the comprehension condition, and during NNDS compared to NDS. Listener profile and task also impacted speakers’ F0 height, intensity, and vowel duration production. Therefore, speakers adjust acoustic features in their speech to achieve their didactic goals and accommodate their listener's profile. Also, speakers’ overall greater adaptation in NNDS than in NDS suggests that NNDS serves a didactic purpose.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103250"},"PeriodicalIF":2.4,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144069931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition 肌萎缩性侧索硬化症患者球运动功能障碍的早期识别:一种利用AFM信号分解的方法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-06 DOI: 10.1016/j.specom.2025.103246
Shaik Mulla Shabber , Mohan Bansal
Amyotrophic lateral sclerosis (ALS) is an aggressive neurodegenerative disorder that impacts the nerve cells in the brain and spinal cord that control muscle movements. Early ALS symptoms include speech and swallowing difficulties, and sadly, the disease is incurable and fatal in some instances. This study aims to construct a predictive model for identifying speech dysarthria and bulbar motor dysfunction in ALS patients, using speech signals as a non-invasive biomarker. Utilizing an amplitude and frequency modulated (AFM) signal decomposition model, the study identifies distinctive characteristics crucial for monitoring and diagnosing ALS. The study focuses on classifying ALS patients and healthy controls (HC) through a machine-learning approach, employing the TORGO database for analysis. Recognizing speech signals as potential biomarkers for ALS detection, the study aims to achieve early identification without invasive measures. An ensemble learning classifier attains a remarkable 97% accuracy in distinguishing between ALS and HC based on features extracted using the AFM signal model.
肌萎缩性侧索硬化症(ALS)是一种侵袭性神经退行性疾病,影响大脑和脊髓中控制肌肉运动的神经细胞。早期的ALS症状包括说话和吞咽困难,不幸的是,这种疾病在某些情况下是无法治愈的,甚至是致命的。本研究旨在利用语音信号作为非侵入性生物标志物,构建识别ALS患者言语构音障碍和球运动功能障碍的预测模型。利用振幅和频率调制(AFM)信号分解模型,该研究确定了监测和诊断ALS的关键特征。该研究的重点是通过机器学习方法对ALS患者和健康对照(HC)进行分类,采用TORGO数据库进行分析。该研究将语音信号视为ALS检测的潜在生物标志物,旨在实现无创检测的早期识别。基于使用AFM信号模型提取的特征,集成学习分类器在区分ALS和HC方面达到了惊人的97%的准确率。
{"title":"Early identification of bulbar motor dysfunction in ALS: An approach using AFM signal decomposition","authors":"Shaik Mulla Shabber ,&nbsp;Mohan Bansal","doi":"10.1016/j.specom.2025.103246","DOIUrl":"10.1016/j.specom.2025.103246","url":null,"abstract":"<div><div>Amyotrophic lateral sclerosis (ALS) is an aggressive neurodegenerative disorder that impacts the nerve cells in the brain and spinal cord that control muscle movements. Early ALS symptoms include speech and swallowing difficulties, and sadly, the disease is incurable and fatal in some instances. This study aims to construct a predictive model for identifying speech dysarthria and bulbar motor dysfunction in ALS patients, using speech signals as a non-invasive biomarker. Utilizing an amplitude and frequency modulated (AFM) signal decomposition model, the study identifies distinctive characteristics crucial for monitoring and diagnosing ALS. The study focuses on classifying ALS patients and healthy controls (HC) through a machine-learning approach, employing the TORGO database for analysis. Recognizing speech signals as potential biomarkers for ALS detection, the study aims to achieve early identification without invasive measures. An ensemble learning classifier attains a remarkable 97% accuracy in distinguishing between ALS and HC based on features extracted using the AFM signal model.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103246"},"PeriodicalIF":2.4,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An update rule for multiple source variances estimation using microphone arrays 使用麦克风阵列进行多源方差估计的更新规则
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-30 DOI: 10.1016/j.specom.2025.103245
Fan Zhang , Chao Pan , Jingdong Chen , Jacob Benesty
This paper addresses the problem of time-varying variance estimation in scenarios with multiple speech sources and background noise using a microphone array, which is an important issue in speech enhancement. Under the optimal principle of maximum likelihood (ML), the variance estimation under the general cases occurs no explicit formula, and all the variances require to be updated iteratively. Inspired by the fixed-point iteration (FPI) method, we derive an update rule for variance estimation by introducing a dummy term and exploiting the ML condition. Insights into the update rule is investigated and the relationship with the variance estimates under least-squares (LS) principle is presented. Finally, by simulations, we show that the resulting variance update rule is very efficient and effective, which requires only a few iterations to converge, and the estimation error is very close to the Cramér–Rao Bound (CRB).
本文研究了基于麦克风阵列的多声源和背景噪声场景下的时变方差估计问题,这是语音增强中的一个重要问题。在最大似然最优原理下,一般情况下的方差估计没有显式公式,所有方差都需要迭代更新。受不动点迭代(FPI)方法的启发,我们通过引入虚拟项和利用ML条件推导出方差估计的更新规则。对更新规则进行了深入的研究,并给出了在最小二乘原理下与方差估计的关系。最后,通过仿真表明,所得到的方差更新规则是非常高效的,只需几次迭代即可收敛,估计误差非常接近cram - rao边界(CRB)。
{"title":"An update rule for multiple source variances estimation using microphone arrays","authors":"Fan Zhang ,&nbsp;Chao Pan ,&nbsp;Jingdong Chen ,&nbsp;Jacob Benesty","doi":"10.1016/j.specom.2025.103245","DOIUrl":"10.1016/j.specom.2025.103245","url":null,"abstract":"<div><div>This paper addresses the problem of time-varying variance estimation in scenarios with multiple speech sources and background noise using a microphone array, which is an important issue in speech enhancement. Under the optimal principle of maximum likelihood (ML), the variance estimation under the general cases occurs no explicit formula, and all the variances require to be updated iteratively. Inspired by the fixed-point iteration (FPI) method, we derive an update rule for variance estimation by introducing a dummy term and exploiting the ML condition. Insights into the update rule is investigated and the relationship with the variance estimates under least-squares (LS) principle is presented. Finally, by simulations, we show that the resulting variance update rule is very efficient and effective, which requires only a few iterations to converge, and the estimation error is very close to the Cramér–Rao Bound (CRB).</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103245"},"PeriodicalIF":2.4,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143895313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays 基于深度学习的舞台二维扬声器定位与大型特设麦克风阵列
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-29 DOI: 10.1016/j.specom.2025.103247
Shupei Liu , Linfeng Feng , Yijun Gong , Chengdong Liang , Chen Zhang , Xiao-Lei Zhang , Xuelong Li
While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at https://github.com/Liu-sp/Libri-adhoc-nodes10.
虽然基于深度学习的说话者定位在具有挑战性的声学环境中显示出优势,但它通常只产生到达方向(DOA)线索,而不是精确的二维(2D)坐标。为了解决这个问题,我们提出了一种新的基于深度学习的2D扬声器定位方法,利用特设麦克风阵列。具体来说,每个自组织阵列由随机分布的麦克风节点组成,每个节点配备一个传统阵列。我们的方法首先在每个节点上使用卷积神经网络来估计说话人的方向。然后,我们使用三角测量和聚类技术对这些DOA估计进行整合,以获得2D扬声器位置。为了进一步提高估计精度,我们引入了一种节点选择算法,该算法战略性地过滤最可靠的节点。在模拟和现实世界数据上进行的大量实验表明,我们的方法明显优于传统方法。提出的节点选择进一步改进了性能。实验中的真实数据集名为lib -adhoc-node10,这是本文首次描述的新记录数据,可在https://github.com/Liu-sp/Libri-adhoc-nodes10上在线获取。
{"title":"Deep learning based stage-wise two-dimensional speaker localization with large ad-hoc microphone arrays","authors":"Shupei Liu ,&nbsp;Linfeng Feng ,&nbsp;Yijun Gong ,&nbsp;Chengdong Liang ,&nbsp;Chen Zhang ,&nbsp;Xiao-Lei Zhang ,&nbsp;Xuelong Li","doi":"10.1016/j.specom.2025.103247","DOIUrl":"10.1016/j.specom.2025.103247","url":null,"abstract":"<div><div>While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays. Specifically, each ad-hoc array comprises randomly distributed microphone nodes, each of which is equipped with a traditional array. Our approach first employs convolutional neural networks at each node to estimate speaker directions.Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at <span><span>https://github.com/Liu-sp/Libri-adhoc-nodes10</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103247"},"PeriodicalIF":2.4,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism 基于CNN-Transformer和多维注意机制的语音情感识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-23 DOI: 10.1016/j.specom.2025.103242
Xiaoyu Tang , Jiazheng Huang , Yixin Lin , Ting Dang , Jintao Cheng
Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism.
语音情感识别在人机交互中起着至关重要的作用。以往的方法主要关注局部空间信息或通道信息,而忽略了语音中的时间信息。为了对语音中不同粒度层次的局部和全局信息进行建模,并捕获语音信号中的时间、空间和通道依赖关系,本文提出了一种基于CNN-Transformer和多维注意机制的语音情感识别网络。具体来说,一堆CNN块致力于从时频角度捕获语音中的局部信息。此外,还采用了一种时间-通道-空间注意机制来增强三维特征。此外,我们使用具有深度可分离卷积的大卷积核和轻量级Transformer模块来建模特征序列的局部和全局依赖关系。我们在IEMOCAP和Emo-DB数据集上评估了所提出的方法,并表明我们的方法比最先进的方法显着提高了性能。https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism。
{"title":"Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism","authors":"Xiaoyu Tang ,&nbsp;Jiazheng Huang ,&nbsp;Yixin Lin ,&nbsp;Ting Dang ,&nbsp;Jintao Cheng","doi":"10.1016/j.specom.2025.103242","DOIUrl":"10.1016/j.specom.2025.103242","url":null,"abstract":"<div><div>Speech Emotion Recognition (SER) is crucial in human–machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time–frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. <span><span>https://github.com/SCNU-RISLAB/CNN-Transforemr-and-Multidimensional-Attention-Mechanism</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103242"},"PeriodicalIF":2.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vibravox: A dataset of french speech captured with body-conduction audio sensors Vibravox:用身体传导音频传感器捕获的法语语音数据集
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-04-19 DOI: 10.1016/j.specom.2025.103238
Julien Hauret , Malo Olivier , Thomas Joubaud , Christophe Langrenne , Sarah Poirée , Véronique Zimpfer , Éric Bavu
Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.
Vibravox是一个符合通用数据保护条例(GDPR)的数据集,包含使用五种不同的身体传导音频传感器的音频记录:两个入耳式麦克风,两个骨传导振动拾音器和一个喉听筒。该数据集还包括用作参考的机载麦克风的音频数据。Vibravox语料库包含每个传感器45小时的语音样本和生理声音,由188名参与者在高阶立体声3D空间化器施加的不同声学条件下记录。关于记录条件和语言转录的注释也包括在语料库中。我们对各种语音相关任务进行了一系列实验,包括语音识别、语音增强和说话人验证。这些实验使用最先进的模型来评估和比较它们在由Vibravox数据集提供的不同音频传感器捕获的信号上的表现,目的是更好地掌握它们的个体特征。
{"title":"Vibravox: A dataset of french speech captured with body-conduction audio sensors","authors":"Julien Hauret ,&nbsp;Malo Olivier ,&nbsp;Thomas Joubaud ,&nbsp;Christophe Langrenne ,&nbsp;Sarah Poirée ,&nbsp;Véronique Zimpfer ,&nbsp;Éric Bavu","doi":"10.1016/j.specom.2025.103238","DOIUrl":"10.1016/j.specom.2025.103238","url":null,"abstract":"<div><div>Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 h per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"172 ","pages":"Article 103238"},"PeriodicalIF":2.4,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1