Speech Communication最新文献_第10页

Comparing Levenshtein distance and dynamic time warping in predicting listeners’ judgments of accent distance 比较Levenshtein距离与动态时间扭曲对听者口音距离判断的预测

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-21 DOI: 10.1016/j.specom.2023.102987

Holly C. Lind-Combs , Tessa Bent , Rachael F. Holt , Cynthia G. Clopper , Emma Brown

Listeners attend to variation in segmental and prosodic cues when judging accent strength. The relative contributions of these cues to perceptions of accentedness in English remains open for investigation, although objective accent distance measures (such as Levenshtein distance) appear to be reliable tools for predicting perceptual distance. Levenshtein distance, however, only accounts for phonemic information in the signal. The purpose of the current study was to examine the relative contributions of phonemic (Levenshtein) and holistic acoustic (dynamic time warping) distances from the local accent to listeners’ accent rankings for nine non-local native and nonnative accents. Listeners (n = 52) ranked talkers on perceived distance from the local accent (Midland American English) using a ladder task for three sentence-length stimuli. Phonemic and holistic acoustic distances between Midland American English and the other accents were quantified using both weighted and unweighted Levenshtein distance measures, and dynamic time warping (DTW). Results reveal that all three metrics contribute to perceived accent distance, with the weighted Levenshtein slightly outperforming the other measures. Moreover, the relative contribution of phonemic and holistic acoustic cues was driven by the speaker's accent. Both nonnative and non-local native accents were included in this study, and the benefits of considering both of these accent groups in studying phonemic and acoustic cues used by listeners is discussed.

在判断重音强度时，听众会注意到分段和韵律线索的变化。尽管客观的重音距离测量（如Levenstein距离）似乎是预测感知距离的可靠工具，但这些线索对英语重音感知的相对贡献仍有待研究。然而，Levenstein距离只解释了信号中的音位信息。本研究的目的是检验当地口音的音位（Levenstein）和整体声学（动态时间扭曲）距离对九种非当地本地和非本地口音的听众口音排名的相对贡献。听众（n=52）使用三个句子长度刺激的阶梯任务，根据与当地口音（米德兰美式英语）的感知距离对说话者进行排名。使用加权和未加权的Levenstein距离测量以及动态时间扭曲（DTW）来量化米德兰美式英语和其他口音之间的音位和整体声学距离。结果表明，这三个指标都有助于感知重音距离，加权的Levenstein略优于其他指标。此外，音位和整体声学线索的相对贡献是由说话者的口音驱动的。本研究包括非本地和非本地口音，并讨论了在研究听众使用的音位和声学线索时同时考虑这两种口音组的好处。

{"title":"Comparing Levenshtein distance and dynamic time warping in predicting listeners’ judgments of accent distance","authors":"Holly C. Lind-Combs , Tessa Bent , Rachael F. Holt , Cynthia G. Clopper , Emma Brown","doi":"10.1016/j.specom.2023.102987","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102987","url":null,"abstract":"<div><p>Listeners attend to variation in segmental and prosodic cues when judging accent strength. The relative contributions of these cues to perceptions of accentedness in English remains open for investigation, although objective accent distance measures (such as Levenshtein distance) appear to be reliable tools for predicting perceptual distance. Levenshtein distance, however, only accounts for phonemic information in the signal. The purpose of the current study was to examine the relative contributions of phonemic (Levenshtein) and holistic acoustic (dynamic time warping) distances from the local accent to listeners’ accent rankings for nine non-local native and nonnative accents. Listeners (<em>n</em> = 52) ranked talkers on perceived distance from the local accent (Midland American English) using a ladder task for three sentence-length stimuli. Phonemic and holistic acoustic distances between Midland American English and the other accents were quantified using both weighted and unweighted Levenshtein distance measures, and dynamic time warping (DTW). Results reveal that all three metrics contribute to perceived accent distance, with the weighted Levenshtein slightly outperforming the other measures. Moreover, the relative contribution of phonemic and holistic acoustic cues was driven by the speaker's accent. Both nonnative and non-local native accents were included in this study, and the benefits of considering both of these accent groups in studying phonemic and acoustic cues used by listeners is discussed.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 102987"},"PeriodicalIF":3.2,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49702800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal attention for lip synthesis using conditional generative adversarial networks 基于条件生成对抗网络的多模态注意力唇部合成

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102959

Andrea Vidal, Carlos Busso

The synthesis of lip movements is an important problem for a socially interactive agent (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from automatic speech recognition (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on conditional generative adversarial networks (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.

唇动作的合成是社会互动智能体(SIA)的一个重要问题。重要的是产生与言语同步的嘴唇运动，并具有现实的协同发音。我们假设，将词汇信息(即音素序列)和声学特征结合起来，不仅可以产生与发音运动相匹配的正确嘴唇运动模型，还可以产生与语音重点和情感内容同步的轨迹。这项工作提出了基于注意力的框架，使用声学和词汇信息来增强唇部运动的合成。词汇信息从自动语音识别(ASR)转录中获得，扩大了所提出解决方案的应用范围。我们提出了基于自模态注意和跨模态注意机制的条件生成对抗网络(CGAN)模型。这些模型使我们能够了解哪些框架在嘴唇运动的产生中被考虑得更多。我们使用混合形状动画合成的嘴唇运动。这些动画用于将我们提出的多模态模型与其他方法进行比较，包括使用文本或声学特征实现的单模态模型。我们依靠主观指标使用感知评估和基于LipSync模型的客观指标。结果表明，我们提出的具有注意机制的模型在自然性感知方面优于基线。交叉模态注意和自模态注意的加入对生成序列的性能有显著的正向影响。我们观察到，即使转录不完美，词汇信息也能提供有价值的信息。多模态系统所观察到的改进性能证实了语音和文本模态所提供的互补信息。

{"title":"Multimodal attention for lip synthesis using conditional generative adversarial networks","authors":"Andrea Vidal, Carlos Busso","doi":"10.1016/j.specom.2023.102959","DOIUrl":"10.1016/j.specom.2023.102959","url":null,"abstract":"<div><p>The synthesis of lip movements is an important problem for a <em>socially interactive agent</em> (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from <em>automatic speech recognition</em> (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on <em>conditional generative adversarial networks</em> (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102959"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43953120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction of whitespace and word segmentation in noisy Pashto text using CRF 用CRF校正普什图语带噪声文本中的空白和分词

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102970

Ijazul Haq, Weidong Qiu, Jie Guo, Peng Tang

Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word segmenter. This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models' training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.

分词是将文本分割成单词的过程。在英语和大多数欧洲语言中，单词边界由空格标识，而在普什图语中，没有明确的单词分隔符。普什图语使用空格分隔单词，但并不一致，它不能被认为是一个可靠的词边界标识符。这种不一致使得普什图语分词独特而富有挑战性。此外，普什图语是一种资源匮乏、非标准化的语言，没有建立正确使用空白的规则，这导致了两种典型的拼写错误:空格省略和空格插入。这些错误严重影响了分词器的性能。本研究旨在为普什图语开发一个最先进的分词器，并提供一个校对工具来识别和纠正嘈杂文本中的空间位置。结合CRF算法来训练两个机器学习模型来完成这些任务。对于模型的训练，我们开发了一个近350万单词的文本语料库，使用基于词典的技术标注空格的正确位置和明确的词边界信息，然后手动检查错误。该模型的实验结果非常令人满意，其中校对模型和分词器的f1得分分别为99.2%和96.7%。

{"title":"Correction of whitespace and word segmentation in noisy Pashto text using CRF","authors":"Ijazul Haq, Weidong Qiu, Jie Guo, Peng Tang","doi":"10.1016/j.specom.2023.102970","DOIUrl":"10.1016/j.specom.2023.102970","url":null,"abstract":"<div><p>Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word segmenter. This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models' training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102970"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46845071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fractional feature-based speech enhancement with deep neural network 基于分数特征的深度神经网络语音增强

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102971

Liyun Xu, Tong Zhang

Speech enhancement (SE) has become a considerable promise application of deep learning. Commonly, the deep neural network (DNN) in the SE task is trained to learn a mapping from the noisy features to the clean. However, the features are usually extracted in the time or frequency domain. In this paper, the improved features in the fractional domain are presented based on the flexible character of fractional Fourier transform (FRFT). First, the distribution characters and differences of the speech signal and the noise in the fractional domain are investigated. Second, the L1-optimal FRFT spectrum and the feature matrix constructed from a set of FRFT spectrums are served as the training features in DNN and applied in the SE. A series of pre-experiments conducted in various different fractional transform orders illustrate that the L1-optimal FRFT-DNN-based SE method can achieve a great enhancement result compared with the methods based on another single fractional spectrum. Moreover, the matrix of FRFT-DNN-based SE performs better under the same conditions. Finally, compared with other two typically SE models, the experiment results indicate that the proposed method could reach significantly performance in different SNRs with unseen noise types. The conclusions confirm the advantages of using the proposed improved features in the fractional domain.

语音增强(SE)已经成为深度学习中一个非常有前途的应用。通常，深度神经网络(DNN)在SE任务中被训练学习从噪声特征到干净特征的映射。然而，这些特征通常是在时域或频域提取的。本文基于分数阶傅里叶变换(FRFT)的柔性特性，提出了分数阶域的改进特征。首先，研究了语音信号和噪声在分数阶域的分布特征和差异。其次，将l1最优FRFT谱和由一组FRFT谱构建的特征矩阵作为DNN中的训练特征，并应用于SE中;在不同分数阶变换阶下进行的一系列预实验表明，与基于其他单分数阶谱的方法相比，基于l1最优frft - dnn的SE方法可以获得较大的增强效果。在相同条件下，基于frft - dnn的SE矩阵表现更好。最后，通过与其他两种典型SE模型的比较，实验结果表明，该方法在不同信噪比下具有显著的性能。结论证实了在分数阶域使用改进特征的优点。

{"title":"Fractional feature-based speech enhancement with deep neural network","authors":"Liyun Xu, Tong Zhang","doi":"10.1016/j.specom.2023.102971","DOIUrl":"10.1016/j.specom.2023.102971","url":null,"abstract":"<div><p>Speech enhancement (SE) has become a considerable promise application of deep learning. Commonly, the deep neural network (DNN) in the SE task is trained to learn a mapping from the noisy features to the clean. However, the features are usually extracted in the time or frequency domain. In this paper, the improved features in the fractional domain are presented based on the flexible character of fractional Fourier transform (FRFT). First, the distribution characters and differences of the speech signal and the noise in the fractional domain are investigated. Second, the L1-optimal FRFT spectrum and the feature matrix constructed from a set of FRFT spectrums are served as the training features in DNN and applied in the SE. A series of pre-experiments conducted in various different fractional transform orders illustrate that the L1-optimal FRFT-DNN-based SE method can achieve a great enhancement result compared with the methods based on another single fractional spectrum. Moreover, the matrix of FRFT-DNN-based SE performs better under the same conditions. Finally, compared with other two typically SE models, the experiment results indicate that the proposed method could reach significantly performance in different SNRs with unseen noise types. The conclusions confirm the advantages of using the proposed improved features in the fractional domain.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102971"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43924981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating prosodic entrainment from global conversations to local turns and tones in Mandarin conversations 普通话对话中从全球对话到局部转折和声调的韵律娱乐研究

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102961

Zhihua Xia , Julia Hirschberg , Rivka Levitan

Previous research on acoustic entrainment has paid less attention to tones than to other prosodic features. This study sets a hierarchical framework by three layers of conversations, turns and tone units, investigates prosodic entrainment in Mandarin spontaneous dialogues at each level, and compares the three. Our research has found that (1) global and local entrainment exist independently, and local entrainment is more evident than global; (2) variation exists in prosodic features’ contribution to entrainment at three levels: amplitude features exhibiting more prominent entrainment at both global and local levels, and speaking-rate and F0 features showing more prominence at the local levels; and (3) no convergence is found at the conversational level, at the turn level or over tone units.

以往关于声干扰的研究对音调的关注比对其他韵律特征的关注要少。本研究将普通话自发对话的韵律蕴涵分为对话、转折和声调三个层次，并对三个层次进行比较。研究发现:(1)全局夹带和局部夹带是独立存在的，局部夹带比全局夹带更明显;(2)韵律特征在三个层次上对夹带的贡献存在差异:幅度特征在整体和局部层次上都表现出更突出的夹带，语速和F0特征在局部层次上表现出更突出的夹带;(3)在会话水平、转折水平或音调单位上没有发现收敛。

引用次数: 0

The dependence of accommodation processes on conversational experience 调节过程对会话体验的依赖

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102963

L. Ann Burchfield, Mark Antoniou, Anne Cutler

Conversational partners accommodate to one another's speech, a process that greatly facilitates perception. This process occurs in both first (L1) and second languages (L2); however, recent research has revealed that adaptation can be language-specific, with listeners sometimes applying it in one language but not in another. Here, we investigate whether a supply of novel talkers impacts whether the adaptation is applied, testing Mandarin-English groups whose use of their two languages involves either an extensive or a restricted set of social situations. Perceptual learning in Mandarin and English is examined across two similarly-constituted groups in the same English-speaking environment: (a) heritage language users with Mandarin as family L1 and English as environmental language, and (b) international students with Mandarin as L1 and English as later-acquired L2. In English, exposure to an ambiguous sound in lexically disambiguating contexts prompted the expected retuning of phonemic boundaries in categorisation for the heritage users, but not for the students. In Mandarin, the opposite appeared: the heritage users showed no adaptation, but the students did adapt. In each case where learning did not appear, participants reported using the language in question with fewer interlocutors. The results support the view that successful retuning ability in any language requires regular conversational interaction with novel talkers.

对话伙伴相互适应对方的讲话，这一过程极大地促进了感知。这个过程发生在第一语言(L1)和第二语言(L2)中;然而，最近的研究表明，适应可以是特定于语言的，听众有时会在一种语言中使用它，而不是在另一种语言中。在这里，我们调查了新说话者的供应是否会影响适应性的应用，测试了使用两种语言的普通话-英语群体，他们的使用涉及广泛或有限的社会情境。在相同的英语环境中，对普通话和英语的感知学习在两个相似的群体中进行了研究:(a)以普通话为家庭第一语言、英语为环境语言的传统语言使用者，以及(b)以普通话为第一语言、英语为后来习得的第二语言的国际学生。在英语中，在词汇消歧的语境中，暴露于一个模糊的声音会促使传统使用者在分类中回归音位边界，但对学生却没有。在普通话中，情况正好相反:文物使用者没有表现出适应能力，但学生们确实适应了。在每一种没有学习的情况下，参与者报告说，他们使用的语言与较少的对话者。研究结果支持了这样一种观点，即任何语言的成功回归能力都需要与陌生人进行定期的对话互动。

{"title":"The dependence of accommodation processes on conversational experience","authors":"L. Ann Burchfield, Mark Antoniou, Anne Cutler","doi":"10.1016/j.specom.2023.102963","DOIUrl":"10.1016/j.specom.2023.102963","url":null,"abstract":"<div><p>Conversational partners accommodate to one another's speech, a process that greatly facilitates perception. This process occurs in both first (L1) and second languages (L2); however, recent research has revealed that adaptation can be language-specific, with listeners sometimes applying it in one language but not in another. Here, we investigate whether a supply of novel talkers impacts whether the adaptation is applied, testing Mandarin-English groups whose use of their two languages involves either an extensive or a restricted set of social situations. Perceptual learning in Mandarin and English is examined across two similarly-constituted groups in the same English-speaking environment: (a) heritage language users with Mandarin as family L1 and English as environmental language, and (b) international students with Mandarin as L1 and English as later-acquired L2. In English, exposure to an ambiguous sound in lexically disambiguating contexts prompted the expected retuning of phonemic boundaries in categorisation for the heritage users, but not for the students. In Mandarin, the opposite appeared: the heritage users showed no adaptation, but the students did adapt. In each case where learning did not appear, participants reported using the language in question with fewer interlocutors. The results support the view that successful retuning ability in any language requires regular conversational interaction with novel talkers.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102963"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47047534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech 空间和扬声器感知声学建模与有效的数据增强识别多阵列会话语音

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102958

Li Chai , Hang Chen , Jun Du , Qing-Feng Liu , Chin-Hui Lee

We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers, which contain discriminative information from different speakers, including spatial information embedded in interaural phase differences (IPDs) between individual interfering speakers and the target speaker. In the proposed SSA-AM framework, we explore four acoustic model architectures consisting of different combinations of four neural networks, namely deep residual network, factorized time delay neural network, self-attention and residual bidirectional long short-term memory neural network. Various data augmentation techniques are adopted to expand the training data to include different options of beamformed speech obtained from multi-channel speech enhancement. Evaluated on the recent CHiME-6 Challenge Track 1, our proposed SSA-AM framework achieves consistent recognition performance improvements when compared with the official baseline acoustic models. Furthermore, SSA-AM outperforms acoustic models without explicitly using the space and speaker information. Finally, our data augmentation schemes are shown to be especially effective for compact model designs. Code is released at https://github.com/coalboss/SSA_AM.

我们提出了一种空间和说话人感知（SSA）的声学建模（AM）方法，称为SSA-AM，以提高远程多阵列会话场景中自动语音识别（ASR）的系统性能。与仅使用来自目标扬声器的频谱特征作为输入的传统AM相比，SSA-AM的输入由来自目标扬声器和干扰扬声器的语音特征组成，其包含来自不同扬声器的判别信息，包括嵌入在各个干扰扬声器和目标扬声器之间的耳间相位差（IPD）中的空间信息。在所提出的SSA-AM框架中，我们探索了四种由四种神经网络的不同组合组成的声学模型架构，即深度残差网络、因子化时延神经网络、自注意和残差双向长短期记忆神经网络。采用各种数据增强技术来扩展训练数据，以包括从多通道语音增强获得的波束成形语音的不同选项。在最近的CHiME-6挑战赛道1上进行了评估，与官方基线声学模型相比，我们提出的SSA-AM框架实现了一致的识别性能改进。此外，SSA-AM在没有明确使用空间和扬声器信息的情况下优于声学模型。最后，我们的数据扩充方案被证明对紧凑模型设计特别有效。代码发布于https://github.com/coalboss/SSA_AM.

{"title":"Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech","authors":"Li Chai , Hang Chen , Jun Du , Qing-Feng Liu , Chin-Hui Lee","doi":"10.1016/j.specom.2023.102958","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102958","url":null,"abstract":"<div><p>We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers, which contain discriminative information from different speakers, including spatial information embedded in interaural phase differences (IPDs) between individual interfering speakers and the target speaker. In the proposed SSA-AM framework, we explore four acoustic model architectures consisting of different combinations of four neural networks, namely deep residual network, factorized time delay neural network, self-attention and residual bidirectional long short-term memory neural network. Various data augmentation techniques are adopted to expand the training data to include different options of beamformed speech obtained from multi-channel speech enhancement. Evaluated on the recent CHiME-6 Challenge Track 1, our proposed SSA-AM framework achieves consistent recognition performance improvements when compared with the official baseline acoustic models. Furthermore, SSA-AM outperforms acoustic models without explicitly using the space and speaker information. Finally, our data augmentation schemes are shown to be especially effective for compact model designs. Code is released at <span>https://github.com/coalboss/SSA_AM</span><svg><path></path></svg>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102958"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet 阿塞拜疆手语dactyl字母的混合单词识别系统和数据集的开发

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-09-01 DOI: 10.1016/j.specom.2023.102960

Jamaladdin Hasanov , Nigar Alishzade , Aykhan Nazimzade , Samir Dadashzade , Toghrul Tahirov

The paper introduces a real-time fingerspelling-to-text translation system for the Azerbaijani Sign Language (AzSL), targeted to the clarification of the words with no available or ambiguous signs. The system consists of both statistical and probabilistic models, used in the sign recognition and sequence generation phases. Linguistic, technical, and human–computer interaction-related challenges, which are usually not considered in publicly available sign-based recognition application programming interfaces and tools, are addressed in this study. The specifics of the AzSL are reviewed, feature selection strategies are evaluated, and a robust model for the translation of hand signs is suggested. The two-stage recognition model exhibits high accuracy during real-time inference. Considering the lack of a publicly available dataset with the benchmark, a new, comprehensive AzSL dataset consisting of 13,444 samples collected by 221 volunteers is described and made publicly available for the sign language recognition community. To extend the dataset and make the model robust to changes, augmentation methods and their effect on the performance are analyzed. A lexicon-based validation method used for the probabilistic analysis and candidate word selection enhances the probability of the recognized phrases. Experiments delivered 94% accuracy on the test dataset, which was close to the real-time user experience. The dataset and implemented software are shared in a public repository for review and further research (CeDAR, 2021; Alishzade et al., 2022). The work has been presented at TeknoFest 2022 and ranked as the first in the category of social-oriented technologies.

本文介绍了阿塞拜疆手语(AzSL)的实时手指拼写到文本翻译系统，旨在澄清没有可用或模棱两可的符号的单词。该系统由统计模型和概率模型组成，用于符号识别和序列生成阶段。语言、技术和人机交互相关的挑战，通常在公开可用的基于符号的识别应用程序编程接口和工具中没有被考虑，在本研究中得到解决。回顾了手语翻译的具体情况，评估了特征选择策略，并提出了一个稳健的手语翻译模型。两阶段识别模型在实时推理中具有较高的准确率。考虑到缺乏具有基准的公开可用数据集，本文描述了由221名志愿者收集的13,444个样本组成的新的综合AzSL数据集，并将其公开提供给手语识别社区。为了扩展数据集并使模型对变化具有鲁棒性，分析了增强方法及其对性能的影响。基于词典的验证方法用于概率分析和候选词选择，提高了识别短语的概率。实验在测试数据集上提供了94%的准确率，接近实时用户体验。数据集和实现的软件在公共存储库中共享，以供审查和进一步研究(CeDAR, 2021;Alishzade et al.， 2022)。这项工作已在TeknoFest 2022上展示，并在面向社会的技术类别中排名第一。

{"title":"Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet","authors":"Jamaladdin Hasanov , Nigar Alishzade , Aykhan Nazimzade , Samir Dadashzade , Toghrul Tahirov","doi":"10.1016/j.specom.2023.102960","DOIUrl":"10.1016/j.specom.2023.102960","url":null,"abstract":"<div><p>The paper introduces a real-time fingerspelling-to-text translation system for the Azerbaijani Sign Language (AzSL), targeted to the clarification of the words with no available or ambiguous signs. The system consists of both statistical and probabilistic models, used in the sign recognition and sequence generation phases. Linguistic, technical, and <em>human–computer interaction</em>-related challenges, which are usually not considered in publicly available sign-based recognition application programming interfaces and tools, are addressed in this study. The specifics of the AzSL are reviewed, feature selection strategies are evaluated, and a robust model for the translation of hand signs is suggested. The two-stage recognition model exhibits high accuracy during real-time inference. Considering the lack of a publicly available dataset with the benchmark, a new, comprehensive AzSL dataset consisting of 13,444 samples collected by 221 volunteers is described and made publicly available for the sign language recognition community. To extend the dataset and make the model robust to changes, augmentation methods and their effect on the performance are analyzed. A lexicon-based validation method used for the probabilistic analysis and candidate word selection enhances the probability of the recognized phrases. Experiments delivered 94% accuracy on the test dataset, which was close to the real-time user experience. The dataset and implemented software are shared in a public repository for review and further research (CeDAR, 2021; Alishzade et al., 2022). The work has been presented at TeknoFest 2022 and ranked as the first in the category of <em>social-oriented technologies</em>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102960"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46498442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A new time–frequency representation based on the tight framelet packet for telephone-band speech coding 基于紧小帧包的电话频段语音编码时频表示方法

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102954

Souhir Bousselmi, Kaïs Ouni

To improve the quality and intelligibility of telephone-band speech coding, a new time–frequency representation based on a tight framelet packet transform is proposed in this paper. In the context of speech coding, the effectiveness of this representation stems from its resilience to quantization noise, and reconstruction stability. Moreover, it offers a sub-band decomposition and good time–frequency localization according to the critical bands of the human ear. The coded signal is obtained using dynamic bit allocation and optimal quantization of normalized framelet coefficients. The performances of the corresponding method are compared to the critically sampled wavelet packet transform. Extensive simulation revealed that the proposed speech coding scheme, which incorporates the tight framelet packet transform performs better than that based on the critically sampled wavelet packet transform. Furthermore, it ensures a high bit-rate reduction with negligible degradation in speech quality. The proposed coder is found to outperform the standard telephone-band speech coders in term of objective measures and subjective evaluations including a formal listening test. The subjective quality of our codec at 4 kbps is almost identical to the reference G.711 codec operating at 64 kbps.

为了提高电话频段语音编码的质量和可理解性，提出了一种基于紧小帧包变换的时频表示方法。在语音编码的背景下，这种表示的有效性源于其对量化噪声的弹性和重建的稳定性。根据人耳的关键波段进行子波段分解，具有较好的时频定位能力。采用动态位分配和归一化小帧系数的最优量化方法获得编码信号。将该方法的性能与临界采样小波包变换进行了比较。大量的仿真结果表明，采用严格小波包变换的语音编码方案比基于严格采样小波包变换的语音编码方案性能更好。此外，它确保了高比特率的降低，而语音质量的退化可以忽略不计。该编码器在客观测量和主观评价(包括正式的听力测试)方面优于标准电话频段语音编码器。我们编解码器在4kbps下的主观质量几乎与参考G.711编解码器在64kbps下的工作质量相同。

{"title":"A new time–frequency representation based on the tight framelet packet for telephone-band speech coding","authors":"Souhir Bousselmi, Kaïs Ouni","doi":"10.1016/j.specom.2023.102954","DOIUrl":"10.1016/j.specom.2023.102954","url":null,"abstract":"<div><p>To improve the quality and intelligibility of telephone-band speech coding, a new time–frequency representation based on a tight framelet packet transform is proposed in this paper. In the context of speech coding, the effectiveness of this representation stems from its resilience to quantization noise, and reconstruction stability. Moreover, it offers a sub-band decomposition and good time–frequency localization according to the critical bands of the human ear. The coded signal is obtained using dynamic bit allocation and optimal quantization of normalized framelet coefficients. The performances of the corresponding method are compared to the critically sampled wavelet packet transform. Extensive simulation revealed that the proposed speech coding scheme, which incorporates the tight framelet packet transform performs better than that based on the critically sampled wavelet packet transform. Furthermore, it ensures a high bit-rate reduction with negligible degradation in speech quality. The proposed coder is found to outperform the standard telephone-band speech coders in term of objective measures and subjective evaluations including a formal listening test. The subjective quality of our codec at 4 kbps is almost identical to the reference G.711 codec operating at 64 kbps.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102954"},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46455896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time intelligibility affects the realization of French word-final schwa 实时可理解性影响法语词尾弱读音的实现

IF 3.2 3区计算机科学 Q2 ACOUSTICS

Speech Communication

Pub Date : 2023-07-01 DOI: 10.1016/j.specom.2023.102962

Georgia Zellou , Ioana Chitoran , Ziqi Zhou

Speech variation has been hypothesized to reflect both speaker-internal influences of lexical access on production and adaptive modifications to make words more intelligible to the listener. The current study considers categorical and gradient variation in the production of word-final schwa in French as explained by lexical access processes, phonological, and/or listener-oriented influences on speech production, while controlling for other factors. To that end, native French speakers completed two laboratory production tasks. In Experiment 1, speakers produced 32 monosyllabic words varying in lexical frequency in a word list production task with no listener feedback. In Experiment 2, speakers produced the same words to an interlocutor while completing a map task varying listener comprehension success across trials: in half the trials, the words are correctly perceived by the interlocutor; in half, there is misunderstanding. Results reveal that speakers are more likely to produce word-final schwa when there is explicit pressure to be intelligible to the interlocutor. Also, when schwa is produced, it is longer preceding a consonant-initial word. Taken together, findings suggest that there are both phonological and clarity-oriented influences on word-final schwa realization in French.

言语变异被认为既反映了说话者内部词汇获取对产出的影响，也反映了听者为使话语更容易理解而进行的适应性修改。在控制其他因素的同时，本研究考虑了法语词尾弱读音产生的分类和梯度变化，解释了词汇获取过程、语音和/或听者导向对语音产生的影响。为此，以法语为母语的人完成了两项实验室生产任务。在实验1中，说话者在一个没有听众反馈的单词列表生成任务中产生32个单音节单词，这些单词的词汇频率各不相同。在实验2中，说话者向对话者说出了相同的单词，同时完成了一项地图任务，在不同的试验中，听者的理解成功率不同:在一半的试验中，对话者正确地理解了这些单词;其中一半是误解。结果表明，当说话者有明确的压力要让对话者听懂时，说话者更有可能产生弱读音。而且，当弱读元音出现时，它要在辅音开头的单词之前更长一些。综上所述，研究结果表明，语音和清晰导向对法语词尾弱读音的实现有双重影响。

{"title":"Real-time intelligibility affects the realization of French word-final schwa","authors":"Georgia Zellou , Ioana Chitoran , Ziqi Zhou","doi":"10.1016/j.specom.2023.102962","DOIUrl":"10.1016/j.specom.2023.102962","url":null,"abstract":"<div><p>Speech variation has been hypothesized to reflect both speaker-internal influences of lexical access on production and adaptive modifications to make words more intelligible to the listener. The current study considers categorical and gradient variation in the production of word-final schwa in French as explained by lexical access processes, phonological, and/or listener-oriented influences on speech production, while controlling for other factors. To that end, native French speakers completed two laboratory production tasks. In Experiment 1, speakers produced 32 monosyllabic words varying in lexical frequency in a word list production task with no listener feedback. In Experiment 2, speakers produced the same words to an interlocutor while completing a map task varying listener comprehension success across trials: in half the trials, the words are correctly perceived by the interlocutor; in half, there is misunderstanding. Results reveal that speakers are more likely to produce word-final schwa when there is explicit pressure to be intelligible to the interlocutor. Also, when schwa is produced, it is longer preceding a consonant-initial word. Taken together, findings suggest that there are both phonological and clarity-oriented influences on word-final schwa realization in French.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102962"},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45132784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0