首页 > 最新文献

Speech Communication最新文献

英文 中文
Multimodal speech emotion recognition via modality constraint with hierarchical bottleneck feature fusion 基于层次化瓶颈特征融合的多模态语音情感识别
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-07-10 DOI: 10.1016/j.specom.2025.103278
Ying Wang , Jianjun Lei , Xiangwei Zhu , Tao Zhang
Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.
多模态模型可以同时结合不同的信息通道,提高建模能力。近年来,许多研究都集中在克服多模式架构的多模态冲突和不完全的内模态学习所带来的挑战。在本文中,我们提出了一个可扩展的多模态语音情感识别(SER)框架,该框架结合了分层瓶颈特征(HBF)融合方法。此外,我们设计了一个模态内和模态间的对比学习机制,使模态特定和跨模态特征分布的自监督校准成为可能。该方法实现了自适应特征融合和对齐,同时显著减少了对刚性特征对齐约束的依赖。同时,通过限制模态编码器的学习路径,设计了模态表示约束(MRC)方法来缓解模态之间的冲突。此外,我们提出了一种模态讨价还价(MB)策略,该策略通过相互讨价还价和平衡机制促进模态内的学习,通过允许不同模态的学习执行交替互换性,可以避免陷入次优模态表征。我们积极和有纪律的训练策略使我们的架构能够在一些多模态情感数据集(如CREMA-D, IEMOCAP和MELD)上表现良好。最后,我们还进行了大量的实验来证明我们提出的架构在各种模态编码器和不同模态组合方法上的有效性。
{"title":"Multimodal speech emotion recognition via modality constraint with hierarchical bottleneck feature fusion","authors":"Ying Wang ,&nbsp;Jianjun Lei ,&nbsp;Xiangwei Zhu ,&nbsp;Tao Zhang","doi":"10.1016/j.specom.2025.103278","DOIUrl":"10.1016/j.specom.2025.103278","url":null,"abstract":"<div><div>Multimodal can combine different channels of information simultaneously to improve the modeling capabilities. Many recent studies focus on overcoming challenges arising from inter-modal conflicts and incomplete intra-modal learning for multimodal architectures. In this paper, we propose a scalable multimodal speech emotion recognition (SER) framework incorporating a hierarchical bottleneck feature (HBF) fusion approach. Furthermore, we design an intra-modal and inter-modal contrastive learning mechanism that enables self-supervised calibration of both modality-specific and cross-modal feature distributions. This approach achieves adaptive feature fusion and alignment while significantly reducing reliance on rigid feature alignment constraints. Meanwhile, by restricting the learning path of modality encoders, we design a modality representation constraint (MRC) method to mitigate conflicts between modalities. Furthermore, we present a modality bargaining (MB) strategy that facilitates learning within modalities through a mechanism of mutual bargaining and balance, which can avoid falling into suboptimal modal representations by allowing the learning of different modalities to perform alternating interchangeability. Our aggressive and disciplined training strategies enable our architecture to perform well on some multimodal emotion datasets such as CREMA-D, IEMOCAP, and MELD. Finally, we also conduct extensive experiments to demonstrate the effectiveness of our proposed architecture on various modal encoders and different modal combination methods.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103278"},"PeriodicalIF":2.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-native (Czech and Russian L1) auditor assessments of some English suprasegmental features: Prominence and pitch accents 非母语(捷克语和俄语)听者对一些英语超分段特征的评估:突出音和音高口音
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-07-10 DOI: 10.1016/j.specom.2025.103281
Alexey Tymbay
This study reports on a comparative perceptual experiment investigating the ability of Russian and Czech advanced learners of English to identify prominence in spoken English. Two groups of non-native annotators completed prominence marking tasks on English monologues, both before and after undergoing a 12-week phonological training program. The study employed three annotation techniques: Rapid Prosody Transcription (RPT), traditional (British), and ToBI. While the RPT annotations produced by the focus groups did not reach statistical equivalence with those of native English speakers, the data indicate a significant improvement in the perception and categorization of prominence following phonological training. A recurrent difficulty observed in both groups was the accurate identification of prenuclear prominence. This is attributed to prosodic transfer effects from the participants’ first languages, Russian and Czech. The study highlights that systemic, phonetic, and distributional differences in the realization of prominence between L1 and L2 may hinder accurate perceptual judgments in English. It further posits that Russian and Czech speakers rely on different acoustic cues for prominence marking in their native languages, and that these cue-weighting strategies are transferred to English. Nevertheless, the results demonstrate that targeted phonological instruction can substantially enhance L2 learners’ perceptual sensitivity to English prosody.
本研究报告了一项比较知觉实验,调查了俄罗斯和捷克高级英语学习者识别英语口语突出部分的能力。两组非母语注释者在接受为期12周的语音训练计划之前和之后完成了英语独白的突出标记任务。该研究采用了三种注释技术:快速韵律转录(RPT)、传统(英式)和ToBI。虽然焦点小组产生的RPT注释与母语为英语的人的注释在统计上没有达到相等,但数据表明,经过语音训练后,突出音的感知和分类有了显着改善。在两组中观察到的一个反复出现的困难是准确识别核前突出。这归因于参与者的母语俄语和捷克语的韵律迁移效应。本研究强调,第一语言和第二语言在实现突出性方面的系统、语音和分布差异可能会阻碍英语准确的感知判断。它进一步假设说俄语和捷克语的人依靠不同的声音线索在他们的母语中进行突出标记,并且这些线索加权策略被转移到英语中。然而,研究结果表明,有针对性的语音教学可以显著提高二语学习者对英语韵律的感知敏感性。
{"title":"Non-native (Czech and Russian L1) auditor assessments of some English suprasegmental features: Prominence and pitch accents","authors":"Alexey Tymbay","doi":"10.1016/j.specom.2025.103281","DOIUrl":"10.1016/j.specom.2025.103281","url":null,"abstract":"<div><div>This study reports on a comparative perceptual experiment investigating the ability of Russian and Czech advanced learners of English to identify prominence in spoken English. Two groups of non-native annotators completed prominence marking tasks on English monologues, both before and after undergoing a 12-week phonological training program. The study employed three annotation techniques: Rapid Prosody Transcription (RPT), traditional (British), and ToBI. While the RPT annotations produced by the focus groups did not reach statistical equivalence with those of native English speakers, the data indicate a significant improvement in the perception and categorization of prominence following phonological training. A recurrent difficulty observed in both groups was the accurate identification of prenuclear prominence. This is attributed to prosodic transfer effects from the participants’ first languages, Russian and Czech. The study highlights that systemic, phonetic, and distributional differences in the realization of prominence between L1 and L2 may hinder accurate perceptual judgments in English. It further posits that Russian and Czech speakers rely on different acoustic cues for prominence marking in their native languages, and that these cue-weighting strategies are transferred to English. Nevertheless, the results demonstrate that targeted phonological instruction can substantially enhance L2 learners’ perceptual sensitivity to English prosody.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103281"},"PeriodicalIF":2.4,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparisons of Mandarin on-focus expansion and post-focus compression between native speakers and L2 learners: Production and machine learning classification 本族语使用者与二语学习者普通话焦点前扩展与焦点后压缩的比较:生产与机器学习分类
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-07-09 DOI: 10.1016/j.specom.2025.103280
Jing Wu , Jun Liu , Ting Wang , Sunghye Cho , Yong-cheol Lee
Korean and Mandarin are reported to have on-focus expansion and post-focus compression in marking prosodic focus. It is not clear whether Korean L2 learners of Mandarin benefit from this prosodic similarity in the production of focused tones or encounter difficulty due to the interaction between tone and intonation in a tonal language. This study examined the prosodic focus of Korean L2 learners of Mandarin through a production experiment, followed by the development of a machine learning classification to automatically detect learners’ production of focused elements. Learners were divided into two groups according to proficiency level (advanced and intermediate) and were directly compared with Mandarin native speakers. Production results showed that intermediate-level speakers did not show any systemic modulations for focus marking. Although the advanced-level speakers performed better than the intermediate group, their prosodic effects of focus were significantly different from those of native speakers in both focus and post-focus positions. The machine learning classification of focused elements reflected clear focus-cueing differences among the three groups. The accuracy rate was about 86 % for the native speakers, 49 % for the advanced learners, and about 34 % for the intermediate learners. The results of this study suggest that on-focus expansion and post-focus compression are not automatically transferred across languages, even when those languages share similar acoustic correlates of prosodic focus. This study also underscores that the difficulty in acquiring the prosodic structure of a tone language lies mainly in mastering tone acquisition, which impacts non-tonal language learners, leading to ineffective performance of on-focus expansion and post-focus compression.
据报道,韩语和普通话在标记韵律焦点时存在焦点前扩展和焦点后压缩的现象。目前尚不清楚,韩国的普通话第二语言学习者是受益于这种韵律上的相似性,还是由于声调语言中声调和语调的相互作用而遇到困难。本研究通过生成实验考察了韩国第二语言汉语学习者的韵律焦点,随后开发了一种机器学习分类,以自动检测学习者产生的焦点元素。学习者根据熟练程度分为两组(高级和中级),并直接与普通话母语者进行比较。生产结果表明,中级水平的讲话者没有表现出任何系统的焦点标记调制。虽然高水平的说话者比中级水平的说话者表现得更好,但他们在焦点位置和后焦点位置上的注意力韵律效果与母语者有显著差异。聚焦元素的机器学习分类反映了三组之间明显的聚焦线索差异。母语者的准确率约为86%,高级学习者为49%,中级学习者约为34%。本研究的结果表明,非焦点扩展和后焦点压缩不会自动跨语言传递,即使这些语言具有相似的韵律焦点声学关联。本研究还强调了声调语言韵律结构习得的困难主要在于掌握声调习得,这影响了非声调语言学习者,导致非焦点扩展和焦点后压缩的效果不佳。
{"title":"Comparisons of Mandarin on-focus expansion and post-focus compression between native speakers and L2 learners: Production and machine learning classification","authors":"Jing Wu ,&nbsp;Jun Liu ,&nbsp;Ting Wang ,&nbsp;Sunghye Cho ,&nbsp;Yong-cheol Lee","doi":"10.1016/j.specom.2025.103280","DOIUrl":"10.1016/j.specom.2025.103280","url":null,"abstract":"<div><div>Korean and Mandarin are reported to have on-focus expansion and post-focus compression in marking prosodic focus. It is not clear whether Korean L2 learners of Mandarin benefit from this prosodic similarity in the production of focused tones or encounter difficulty due to the interaction between tone and intonation in a tonal language. This study examined the prosodic focus of Korean L2 learners of Mandarin through a production experiment, followed by the development of a machine learning classification to automatically detect learners’ production of focused elements. Learners were divided into two groups according to proficiency level (advanced and intermediate) and were directly compared with Mandarin native speakers. Production results showed that intermediate-level speakers did not show any systemic modulations for focus marking. Although the advanced-level speakers performed better than the intermediate group, their prosodic effects of focus were significantly different from those of native speakers in both focus and post-focus positions. The machine learning classification of focused elements reflected clear focus-cueing differences among the three groups. The accuracy rate was about 86 % for the native speakers, 49 % for the advanced learners, and about 34 % for the intermediate learners. The results of this study suggest that on-focus expansion and post-focus compression are not automatically transferred across languages, even when those languages share similar acoustic correlates of prosodic focus. This study also underscores that the difficulty in acquiring the prosodic structure of a tone language lies mainly in mastering tone acquisition, which impacts non-tonal language learners, leading to ineffective performance of on-focus expansion and post-focus compression.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103280"},"PeriodicalIF":2.4,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144695150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight online punctuation and capitalization restoration for streaming ASR systems 轻量级在线标点和大写恢复流ASR系统
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-07-05 DOI: 10.1016/j.specom.2025.103269
Martin Polacek, Petr Cerva, Jindrich Zdansky
This work proposes a lightweight online approach to automatic punctuation and capitalization restoration (APCR). Our method takes pure text as input and can be utilized in real-time speech transcription systems for, e.g., live captioning of TV or radio streams. We develop and evaluate it in a series of consecutive experiments, starting with the task of automatic punctuation restoration (APR). Within that, we also compare our results to another real-time APR method, which combines textual and acoustic features. The test data that we use for this purpose contains automatic transcripts of radio talks and TV debates. In the second part of the paper, we extend our method towards the task of automatic capitalization restoration (ACR). The resulting approach uses two consecutive ELECTRA-small models complemented by simple classification heads; the first ELECTRA model restores punctuation, while the second performs capitalization. Our complete system allows for restoring question marks, commas, periods, and capitalization with a very short inference time and a low latency of just four words. We evaluate its performance for Czech and German, and also compare its results to those of another existing APCR system for English. We are also publishing the data used for our evaluation and testing.
本文提出了一种轻量级的在线自动标点和大写恢复(APCR)方法。我们的方法以纯文本作为输入,可用于实时语音转录系统,例如电视或广播流的实时字幕。我们从自动标点恢复(APR)的任务开始,在一系列连续的实验中开发和评估了它。在此基础上,我们还将结果与另一种结合了文本和声学特征的实时APR方法进行了比较。我们为此目的使用的测试数据包含广播谈话和电视辩论的自动抄本。在论文的第二部分,我们将我们的方法扩展到自动大写恢复(ACR)的任务。由此产生的方法使用两个连续的electra小模型,辅以简单的分类头;第一个ELECTRA模型恢复标点符号,而第二个执行大写。我们完整的系统允许恢复问号、逗号、句号和大写,推理时间非常短,延迟很低,只有四个字。我们评估了它对捷克语和德语的表现,并将其结果与另一个现有的英语APCR系统的结果进行了比较。我们还发布了用于评估和测试的数据。
{"title":"Lightweight online punctuation and capitalization restoration for streaming ASR systems","authors":"Martin Polacek,&nbsp;Petr Cerva,&nbsp;Jindrich Zdansky","doi":"10.1016/j.specom.2025.103269","DOIUrl":"10.1016/j.specom.2025.103269","url":null,"abstract":"<div><div>This work proposes a lightweight online approach to automatic punctuation and capitalization restoration (APCR). Our method takes pure text as input and can be utilized in real-time speech transcription systems for, e.g., live captioning of TV or radio streams. We develop and evaluate it in a series of consecutive experiments, starting with the task of automatic punctuation restoration (APR). Within that, we also compare our results to another real-time APR method, which combines textual and acoustic features. The test data that we use for this purpose contains automatic transcripts of radio talks and TV debates. In the second part of the paper, we extend our method towards the task of automatic capitalization restoration (ACR). The resulting approach uses two consecutive ELECTRA-small models complemented by simple classification heads; the first ELECTRA model restores punctuation, while the second performs capitalization. Our complete system allows for restoring question marks, commas, periods, and capitalization with a very short inference time and a low latency of just four words. We evaluate its performance for Czech and German, and also compare its results to those of another existing APCR system for English. We are also publishing the data used for our evaluation and testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103269"},"PeriodicalIF":2.4,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144572378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the nuances of reduction in conversational speech: lexicalized and non-lexicalized reductions 探讨会话言语中略读的细微差别:词汇化和非词汇化略读
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-06-25 DOI: 10.1016/j.specom.2025.103268
Kübra Bodur , Corinne Fredouille , Stéphane Rauzy , Christine Meunier
In spoken language, a significant proportion of words are produced with missing or underspecified segments, a phenomenon known as reduction. In this study, we distinguish two types of reductions in spontaneous speech: lexicalized reductions, which are well-documented, regularly occurring forms driven primarily by lexical processes, and non-lexicalized reductions, which occur irregularly and lack consistent patterns or representations. The latter are inherently more difficult to detect, and existing methods struggle to capture their full range.
We introduce a novel bottom-up approach for detecting potential reductions in French conversational speech, complemented by a top-down method focused on detecting previously known reduced forms. Our bottom-up method targets sequences consisting of at least six phonemes produced within a 230 ms window, identifying temporally condensed segments, indicative of reduction.
Our findings reveal significant variability in reduction patterns across the corpus. Lexicalized reductions displayed relatively stable and consistent ratios, whereas non-lexicalized reductions varied substantially and were strongly influenced by speaker characteristics. Notably, gender had a significant effect on non-lexicalized reductions, with male speakers showing higher reduction ratios, while no such effect was observed for lexicalized reductions. The two reduction types were influenced differently by speaking time and articulation rate. A positive correlation between lexicalized and non-lexicalized reduction ratios suggested speaker-specific tendencies.
Non-lexicalized reductions showed a higher prevalence of certain phonemes and word categories, whereas lexicalized reductions were more closely linked to morpho-syntactic roles. In a focused investigation of selected lexicalized items, we found that “tu sais” was more frequently reduced when functioning as a discourse marker than when used as a pronoun + verb construction. These results support the interpretation that lexicalized reductions are integrated into the mental lexicon, while non-lexicalized reductions are more context-dependent, further supporting the distinction between the two types of reductions.
在口语中,有相当大比例的单词是由缺失或不明确的词段组成的,这种现象被称为略读。在这项研究中,我们区分了自发言语中的两种类型的减少:词汇化减少,这是有充分记录的,主要由词汇过程驱动的有规律的形式;非词汇化减少,这是不规则的,缺乏一致的模式或表征。后者本质上更难以检测,现有的方法很难捕捉到它们的全部范围。我们引入了一种新颖的自下而上的方法来检测法语会话语音中的潜在约简,辅以一种自上而下的方法来检测先前已知的约简形式。我们的自底向上方法的目标是在230 ms窗口内产生至少6个音素的序列,识别时间上浓缩的片段,表明减少。我们的研究结果揭示了语料库中减少模式的显著差异。词汇化的减少表现出相对稳定和一致的比例,而非词汇化的减少则有很大的变化,并且受到说话人特征的强烈影响。值得注意的是,性别对非词汇化的减少有显著影响,男性说话者表现出更高的减少比例,而词汇化的减少没有这种影响。两种减少类型受说话时间和发音速度的影响不同。词汇化和非词汇化的减少率呈正相关,表明说话人有特定的倾向。非词汇化的略读对某些音素和词类的影响更大,而词汇化的略读则与词形-句法的作用更密切相关。通过对选定词汇化项目的重点调查,我们发现“你说”作为语篇标记时比作为代词+动词结构时更常被省略。这些结果支持了词汇化的约读被整合到心理词汇中,而非词汇化的约读更依赖于上下文的解释,进一步支持了两种类型的约读之间的区别。
{"title":"Exploring the nuances of reduction in conversational speech: lexicalized and non-lexicalized reductions","authors":"Kübra Bodur ,&nbsp;Corinne Fredouille ,&nbsp;Stéphane Rauzy ,&nbsp;Christine Meunier","doi":"10.1016/j.specom.2025.103268","DOIUrl":"10.1016/j.specom.2025.103268","url":null,"abstract":"<div><div>In spoken language, a significant proportion of words are produced with missing or underspecified segments, a phenomenon known as reduction. In this study, we distinguish two types of reductions in spontaneous speech: <em>lexicalized</em> reductions, which are well-documented, regularly occurring forms driven primarily by lexical processes, and <em>non-lexicalized</em> reductions, which occur irregularly and lack consistent patterns or representations. The latter are inherently more difficult to detect, and existing methods struggle to capture their full range.</div><div>We introduce a novel bottom-up approach for detecting potential reductions in French conversational speech, complemented by a top-down method focused on detecting previously known reduced forms. Our bottom-up method targets sequences consisting of at least six phonemes produced within a 230 ms window, identifying temporally condensed segments, indicative of reduction.</div><div>Our findings reveal significant variability in reduction patterns across the corpus. Lexicalized reductions displayed relatively stable and consistent ratios, whereas non-lexicalized reductions varied substantially and were strongly influenced by speaker characteristics. Notably, gender had a significant effect on non-lexicalized reductions, with male speakers showing higher reduction ratios, while no such effect was observed for lexicalized reductions. The two reduction types were influenced differently by speaking time and articulation rate. A positive correlation between lexicalized and non-lexicalized reduction ratios suggested speaker-specific tendencies.</div><div>Non-lexicalized reductions showed a higher prevalence of certain phonemes and word categories, whereas lexicalized reductions were more closely linked to morpho-syntactic roles. In a focused investigation of selected lexicalized items, we found that “tu sais” was more frequently reduced when functioning as a discourse marker than when used as a pronoun + verb construction. These results support the interpretation that lexicalized reductions are integrated into the mental lexicon, while non-lexicalized reductions are more context-dependent, further supporting the distinction between the two types of reductions.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103268"},"PeriodicalIF":2.4,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144534682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prosodic modulation of discourse markers: A cross-linguistic analysis of conversational dynamics 话语标记的韵律调节:会话动态的跨语言分析
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-06-21 DOI: 10.1016/j.specom.2025.103271
Yi Shan
This paper delves into the fascinating world of prosody and pragmatics in discourse markers (DMs). We have come a long way since the early structural approaches, and now we are exploring dynamic models that reveal how prosody shapes DM interpretation in spoken discourse. Our journey takes us through various research methods, from acoustic analysis to naturalistic observations, each offering unique insights into how intonation, stress, and rhythm interact with DMs to guide conversations. Recent cross-linguistic studies, such as Ahn et al. (2024) on Korean “nay mali” and Wang et al. (2024) on Mandarin “haole,” demonstrate how prosodic detachment and contextual cues facilitate the evolution of DMs from lexical to pragmatic functions, underscoring the interplay between prosody and discourse management. Further cross-linguistic evidence comes from Vercher’s (2023) analysis of Spanish “entonces” and Siebold’s (2021) study on German “dann,” which highlight language-specific prosodic realizations of DMs in turn management and conversational closings. We are also looking at cross-linguistic patterns to uncover both universal trends and language-specific characteristics. It is amazing how cultural context plays such a crucial role in prosodic analysis. Besides, machine learning and AI are revolutionizing the field, allowing us to analyze prosodic features in massive datasets with unprecedented precision. We are now embracing multimodal analysis by combining prosody with non-verbal cues for a more holistic understanding of DMs in face-to-face communication. These findings have real-world applications, from improving speech recognition to enhancing language teaching methods. Looking ahead, we are advocating for an integrated approach that considers the dynamic interplay between prosody, pragmatics, and social context. There is still so much to explore across linguistic boundaries and diverse communicative settings. This review is not just a state-of-the-art overview. Rather, it is a roadmap for future research in this exciting field.
本文探讨了语篇标记语的韵律和语用学的奇妙世界。自早期的结构方法以来,我们已经走了很长一段路,现在我们正在探索动态模型,揭示韵律如何影响口语话语中的DM解释。我们的旅程将带我们通过各种研究方法,从声学分析到自然观察,每种方法都提供了语调,重音和节奏如何与dm相互作用以指导对话的独特见解。最近的跨语言研究,如Ahn等人(2024)对韩语“nay mali”和Wang等人(2024)对普通话“haole”的研究,证明了韵律分离和上下文线索如何促进dm从词汇功能向语用功能的演变,强调了韵律和话语管理之间的相互作用。进一步的跨语言证据来自Vercher(2023)对西班牙语“entonces”的分析和Siebold(2021)对德语“dann”的研究,他们强调了dm在管理和会话结束方面的语言特定韵律实现。我们也在研究跨语言模式,以揭示通用趋势和特定语言特征。文化背景在韵律分析中起着如此重要的作用,这是令人惊讶的。此外,机器学习和人工智能正在彻底改变这一领域,使我们能够以前所未有的精度分析大量数据集中的韵律特征。我们现在正在采用多模态分析,将韵律和非语言线索结合起来,以更全面地了解面对面交流中的dm。这些发现具有实际应用价值,从提高语音识别到加强语言教学方法。展望未来,我们提倡采用一种综合的方法,考虑韵律、语用学和社会语境之间的动态相互作用。跨越语言界限和不同的交际环境,还有很多东西需要探索。这篇评论不仅仅是对最新技术的概述。相反,它是这个令人兴奋的领域未来研究的路线图。
{"title":"Prosodic modulation of discourse markers: A cross-linguistic analysis of conversational dynamics","authors":"Yi Shan","doi":"10.1016/j.specom.2025.103271","DOIUrl":"10.1016/j.specom.2025.103271","url":null,"abstract":"<div><div>This paper delves into the fascinating world of prosody and pragmatics in discourse markers (DMs). We have come a long way since the early structural approaches, and now we are exploring dynamic models that reveal how prosody shapes DM interpretation in spoken discourse. Our journey takes us through various research methods, from acoustic analysis to naturalistic observations, each offering unique insights into how intonation, stress, and rhythm interact with DMs to guide conversations. Recent cross-linguistic studies, such as Ahn et al. (2024) on Korean “<em>nay mali</em>” and Wang et al. (2024) on Mandarin “<em>haole</em>,” demonstrate how prosodic detachment and contextual cues facilitate the evolution of DMs from lexical to pragmatic functions, underscoring the interplay between prosody and discourse management. Further cross-linguistic evidence comes from Vercher’s (2023) analysis of Spanish “<em>entonces</em>” and Siebold’s (2021) study on German “<em>dann</em>,” which highlight language-specific prosodic realizations of DMs in turn management and conversational closings. We are also looking at cross-linguistic patterns to uncover both universal trends and language-specific characteristics. It is amazing how cultural context plays such a crucial role in prosodic analysis. Besides, machine learning and AI are revolutionizing the field, allowing us to analyze prosodic features in massive datasets with unprecedented precision. We are now embracing multimodal analysis by combining prosody with non-verbal cues for a more holistic understanding of DMs in face-to-face communication. These findings have real-world applications, from improving speech recognition to enhancing language teaching methods. Looking ahead, we are advocating for an integrated approach that considers the dynamic interplay between prosody, pragmatics, and social context. There is still so much to explore across linguistic boundaries and diverse communicative settings. This review is not just a state-of-the-art overview. Rather, it is a roadmap for future research in this exciting field.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103271"},"PeriodicalIF":2.4,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144518752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation 自动语音识别技术评价听测词识别测试:初步研究
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-06-20 DOI: 10.1016/j.specom.2025.103270
Ayden M. Cauchi , Jaina Negandhi , Sharon L. Cushing , Karen A. Gordon
This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (n=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (n=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, p<0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.
本研究调查了机器学习系统在临床语音感知测试中的得分能力,在该测试中,听者听到并重复单音节单词。准确度评分用于听力评估,包括人工耳蜗候选和监测。评分是由听取和判断反应的临床医生进行的,这可能会造成评分者之间的差异,并占用临床时间。机器学习方法可以通过提供更高的可靠性和时间效率来支持这种测试,特别是在儿童中。本研究以语音平衡幼儿园(PBK)词表为研究对象。记录了12名听力正常的成年人的口语回答(n=1200)。这些单词被呈现给3个自动语音识别器(Whisper large, Whisper medium, Ursa)和7个人,在7种条件下:未改变或模拟潜在的语音错误,通过删除第一个或最后一个辅音或在1,2,4和6 kHz进行低通滤波(n=6972个改变的响应)。回答与未改变的目标相同或不同。数据显示,自动语音识别器(ASRs)在不同条件下对未改变单词的正确分类与人类评估者相似[平均±1 SE: Whisper large = 88.20%±1.52%;耳语介质= 81.20%±1.52%;熊座= 90.70%±1.52%;人类= 91.80%±2.16%],[F(3866.2) = 23.63,术中;0.001]。在第一个辅音缺失和1 kHz过滤条件下,与未改变目标不同的分类最常见。Fleiss Kappa指标显示,在未改变的情况下,ASRs比人类评估者表现出更高的一致性(ASRs = 0.69;人类= 0.17)和变异(ASRs = 0.56;人类= 0.51)PBK单词。这些结果支持了自动语音识别系统的进一步发展,以支持语音感知测试。
{"title":"Automatic speech recognition technology to evaluate an audiometric word recognition test: A preliminary investigation","authors":"Ayden M. Cauchi ,&nbsp;Jaina Negandhi ,&nbsp;Sharon L. Cushing ,&nbsp;Karen A. Gordon","doi":"10.1016/j.specom.2025.103270","DOIUrl":"10.1016/j.specom.2025.103270","url":null,"abstract":"<div><div>This study investigated the ability of machine learning systems to score a clinical speech perception test in which monosyllabic words are heard and repeated by a listener. The accuracy score is used in audiometric assessments, including cochlear implant candidacy and monitoring. Scoring is performed by clinicians who listen and judge responses, which can create inter-rater variability and takes clinical time. A machine learning approach could support this testing by providing increased reliability and time efficiency, particularly in children. This study focused on the Phonetically Balanced Kindergarten (PBK) word list. Spoken responses (<em>n</em>=1200) were recorded from 12 adults with normal hearing. These words were presented to 3 automatic speech recognizers (Whisper large, Whisper medium, Ursa) and 7 humans in 7 conditions: unaltered or, to simulate potential speech errors, altered by first or last consonant deletion or low pass filtering at 1, 2, 4, and 6 kHz (<em>n</em>=6972 altered responses). Responses were scored as the same or different from the unaltered target. Data revealed that automatic speech recognizers (ASRs) correctly classified unaltered words similarly to human evaluators across conditions [mean ± 1 SE: Whisper large = 88.20 % ± 1.52 %; Whisper medium = 81.20 % ± 1.52 %; Ursa = 90.70 % ± 1.52 %; humans = 91.80 % ± 2.16 %], [F(3, 3866.2) = 23.63, <em>p</em>&lt;0.001]. Classifications different from the unaltered target occurred most frequently for the first consonant deletion and 1 kHz filtering conditions. Fleiss Kappa metrics showed that ASRs displayed higher agreement than human evaluators across unaltered (ASRs = 0.69; humans = 0.17) and altered (ASRs = 0.56; humans = 0.51) PBK words. These results support the further development of automatic speech recognition systems to support speech perception testing.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103270"},"PeriodicalIF":2.4,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144510758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech stimulus continuum synthesis using deep learning methods 基于深度学习方法的语音刺激连续统合成
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-06-17 DOI: 10.1016/j.specom.2025.103266
Zhu Li, Yuqing Zhang, Yanlu Xie
Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.
创造一个自然的言语刺激连续体(即,在两个给定类别之间沿特定声学维度均匀间隔的一系列刺激)是类别感知研究中不可或缺的组成部分。常用的方法是手动修改语音的关键声学参数,但合成语音的质量仍然不令人满意。这项工作探索了如何使用深度学习技术进行语音刺激连续统合成,目的是提高合成连续统的自然度。借鉴语音解纠缠学习的最新进展,我们实现了一种基于对抗性训练(AT)的监督解纠缠框架,将语音信号中的特定声学特征(如基频、形成峰特征)与其他内容分离,并通过从关键声学特征的潜在空间采样来实现可控的语音刺激生成。此外,我们借鉴信息论中的互信息(MI)思想,设计了一个基于互信息的无监督解纠缠框架,将语音信号中的特定声学特征与其他内容解纠缠。对多个连续体的刺激生成实验,从客观和主观评价两方面验证了本文方法的有效性。
{"title":"Speech stimulus continuum synthesis using deep learning methods","authors":"Zhu Li,&nbsp;Yuqing Zhang,&nbsp;Yanlu Xie","doi":"10.1016/j.specom.2025.103266","DOIUrl":"10.1016/j.specom.2025.103266","url":null,"abstract":"<div><div>Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103266"},"PeriodicalIF":2.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The perception of intonational peaks and valleys: The effects of plateaux, declination and experimental task 语调峰谷感知:高原、偏角和实验任务的影响
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-06-10 DOI: 10.1016/j.specom.2025.103267
Hae-Sung Jeon
An experiment assessed listeners’ judgement of either relative pitch height or prominence between two consecutive fundamental frequency (fo) peaks or valleys in speech. The fo contour of the first peak or valley was kept constant, while the second was orthogonally manipulated in its height and plateau duration. Half of the stimuli had a flat baseline from which the peaks and valleys were scaled, while the other half had an overtly declining baseline. The results replicated the previous finding that fo peaks with a long plateau are salient to listeners, while valleys are hard to process even with a plateau. Furthermore, the effect of declination was dependent on the experimental task. Listeners’ responses seemed to be directly affected by the fo excursion size only for judging relative height between two peaks, while their prominence judgement was strongly affected by the overall impression of the pitch raising or lowering event near the perceptual target. The findings suggest that the global fo contour, not a single representative fo value of an intonational event, should be considered in perceptual models of intonation. The findings show an interplay between the signal, listeners’ top-down expectations, and speech perception.
一项实验评估了听众对讲话中两个连续基频峰或谷之间的相对音高或突出程度的判断。第一个峰或谷的等高线保持不变,而第二个峰或谷的高度和高原持续时间进行正交处理。一半的刺激有一个平坦的基线,从峰值和低谷开始缩放,而另一半有一个明显下降的基线。结果重复了先前的发现,即具有长平台的峰值对听众来说是显著的,而即使有平台也很难处理低谷。此外,赤纬的影响与实验任务有关。听者的反应似乎只在判断两个峰值之间的相对高度时直接受到偏移大小的影响,而他们的突出判断则受到知觉目标附近音调升高或降低事件的整体印象的强烈影响。研究结果表明,在语调的感知模型中,应该考虑整体的语调轮廓,而不是语调事件值的单一代表。研究结果表明,信号、听者自上而下的期望和语言感知之间存在相互作用。
{"title":"The perception of intonational peaks and valleys: The effects of plateaux, declination and experimental task","authors":"Hae-Sung Jeon","doi":"10.1016/j.specom.2025.103267","DOIUrl":"10.1016/j.specom.2025.103267","url":null,"abstract":"<div><div>An experiment assessed listeners’ judgement of either relative pitch height or prominence between two consecutive fundamental frequency (<em>f<sub>o</sub></em>) peaks or valleys in speech. The <em>f<sub>o</sub></em> contour of the first peak or valley was kept constant, while the second was orthogonally manipulated in its height and plateau duration. Half of the stimuli had a flat baseline from which the peaks and valleys were scaled, while the other half had an overtly declining baseline. The results replicated the previous finding that <em>f<sub>o</sub></em> peaks with a long plateau are salient to listeners, while valleys are hard to process even with a plateau. Furthermore, the effect of declination was dependent on the experimental task. Listeners’ responses seemed to be directly affected by the <em>f<sub>o</sub></em> excursion size only for judging relative height between two peaks, while their prominence judgement was strongly affected by the overall impression of the pitch raising or lowering event near the perceptual target. The findings suggest that the global <em>f<sub>o</sub></em> contour, not a single representative <em>f<sub>o</sub></em> value of an intonational event, should be considered in perceptual models of intonation. The findings show an interplay between the signal, listeners’ top-down expectations, and speech perception.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103267"},"PeriodicalIF":2.4,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144288725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN 使用1D-CNN进行文学和口语化泰米尔语分类的特征工程方法
IF 2.4 3区 计算机科学 Q2 ACOUSTICS Pub Date : 2025-05-29 DOI: 10.1016/j.specom.2025.103254
M. Nanmalar , S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan
In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.
在理想的人机交互(HCI)中,语言的口语形式是大多数用户的首选,因为这是他们日常对话中使用的形式。然而,保留正式的文学形式也有不可否认的必要性。通过接纳新事物和保留旧事物,既可以为普通人服务(实用性),也可以为语言本身服务(保护)。因此,理想的情况是,计算机能够根据需要接受、处理和交流这两种语言。为了解决这个问题,首先有必要确定输入语音的形式,在目前的工作中,输入语音介于文学和口语化的泰米尔语之间。这样的前端系统必须包含一个简单、有效和轻量级的分类器,该分类器在能够捕获语音信号的底层模式的几个有效特征上进行训练。为了实现这一目标,提出了一种一维卷积神经网络(1D-CNN),该网络可以学习随时间变化的特征包络。该网络首先在选定数量的手工特征上进行训练,然后在Mel频率倒谱系数(MFCC)上进行比较。选择手工制作的特征来处理语音的各个方面,如频谱和时间特征,韵律和语音质量。首先分析十个平行话语的特征,观察每个特征随时间的变化趋势。使用手工特征训练的1D-CNN的F1分数为0.9803,而使用MFCC训练的1D-CNN的F1分数为0.9895。在此基础上,对特征消融和特征组合进行了探讨。将特征消融研究中排名最高的手工特征与MFCC相结合,F1得分为0.9946,结果最佳。
{"title":"A feature engineering approach for literary and colloquial Tamil speech classification using 1D-CNN","authors":"M. Nanmalar ,&nbsp;S. Johanan Joysingh ,&nbsp;P. Vijayalakshmi ,&nbsp;T. Nagarajan","doi":"10.1016/j.specom.2025.103254","DOIUrl":"10.1016/j.specom.2025.103254","url":null,"abstract":"<div><div>In ideal human computer interaction (HCI), the colloquial form of a language would be preferred by most users, since it is the form used in their day-to-day conversations. However, there is also an undeniable necessity to preserve the formal literary form. By embracing the new and preserving the old, both service to the common man (practicality) and service to the language itself (conservation) can be rendered. Hence, it is ideal for computers to have the ability to accept, process, and converse in both forms of the language, as required. To address this, it is first necessary to identify the form of the input speech, which in the current work is between literary and colloquial Tamil speech. Such a front-end system must consist of a simple, effective, and lightweight classifier that is trained on a few effective features that are capable of capturing the underlying patterns of the speech signal. To accomplish this, a one-dimensional convolutional neural network (1D-CNN) that learns the envelope of features across time, is proposed. The network is trained on a select number of handcrafted features initially, and then on Mel frequency cepstral coefficients (MFCC) for comparison. The handcrafted features were selected to address various aspects of speech such as the spectral and temporal characteristics, prosody, and voice quality. The features are initially analyzed by considering ten parallel utterances and observing the trend of each feature with respect to time. The proposed 1D-CNN, trained using the handcrafted features, offers an F1 score of 0.9803, while that trained on the MFCC offers an F1 score of 0.9895. In light of this, feature ablation and feature combination are explored. When the best ranked handcrafted features, from the feature ablation study, are combined with the MFCC, they offer the best results with an F1 score of 0.9946.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103254"},"PeriodicalIF":2.4,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144490027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Speech Communication
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1