5th International Conference on Spoken Language Processing (ICSLP 1998)最新文献

英文中文

Speech recognition based on the distance calculation between intermediate phonetic code sequences in symbolic domain 基于符号域中间音码序列距离计算的语音识别

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-297

Kazuyo Tanaka, Hiroaki Kojima

This paper proposes a speech recognition method alternative to the conventional sample-based statistical methods which are characterized by the necessity of large amounts of training speech data. To resolve this type of heavy processing, the proposed method employs an intermediate phonetic code system and the calculation of distance between phonetic code sequences in symbolic domain. It realizes high efficiency when compared with direct processing of acoustic correlates, although some deterioration will be expected in recognition scores. We first describe the distance calculation method and present specific procedures for obtaining the intermediate code sequence from input utterances and for spotting words using the calculation of distance in the symbolic domain. Preliminary experiments were examined on isolated word recognition and phrase spotting in continuous speech. Word recognition results indicate that the recognition scores obtained by the proposed method are comparable compared with ordinary phone-HMM-based speech recognition.

针对传统的基于样本的统计方法需要大量训练语音数据的特点，提出了一种语音识别方法。为了解决这类繁重的处理问题，该方法采用中间音码系统，并在符号域计算音码序列之间的距离。与直接处理声学相关物相比，该方法实现了更高的效率，尽管识别分数会有所下降。我们首先描述了距离计算方法，并给出了从输入话语中获得中间码序列和在符号域中使用距离计算来识别单词的具体步骤。对连续语音中的孤立词识别和短语识别进行了初步实验。单词识别结果表明，该方法的识别分数与普通电话语音识别相当。

引用次数: 1

Enhanced ASR by acoustic feature filtering 声学特征滤波增强ASR

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-194

C. Wellekens

Several recent results demonstrate improvement of recognition scores if some FIR ltering is applied on the trajectories of feature vectors. This paper presents a new approach where the characteristics of lters are trained together with the HMM parameters resulting in improvements of the recognition in rst tests.Reestimation formulas for the cut-o frequencies of ideal LPlters are derived as well for the impulse response coe cients of a general FIR LPlter.

最近的一些研究结果表明，如果对特征向量的轨迹进行FIR滤波，可以提高识别分数。本文提出了一种新的方法，将字母的特征与HMM参数一起训练，从而提高了测试中的识别率。对于一般FIR LPlter的脉冲响应系数，导出了理想LPlter的截止频率重估计公式。

引用次数: 2

Restoration of hyperbaric speech by correction of the formants and the pitch 通过校正共振峰和音高来恢复高压语音

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-519

Laure Charonnat, M. Guitton, J. Crestel, G. Allée

This paper describes an hyperbaric speech processing algorithm combining a restoration of the formants position and a correction of the pitch. The pitch is corrected using an algorithm of time-scale modification associated to an oversampling module. This operation does not only perform a shift of the fundamental frequency, but induces a shift of the other frequencies of the signal. This shift, as well as the formants shift due to the hyperbaric environment, is corrected by the formants restoration module, based on the linear speech production model.

本文描述了一种结合共振峰位置恢复和音高校正的高压语音处理算法。使用与过采样模块相关联的时间尺度修正算法对螺距进行校正。这个操作不仅执行基频的移位，而且引起信号的其他频率的移位。这种移位，以及由于高压环境引起的共振峰移位，由基于线性语音产生模型的共振峰恢复模块进行校正。

引用次数: 0

Speech recognition via phonetically featured syllables 语音识别通过语音特征音节

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-531

Simon King, T. A. Stephenson, S. Isard, P. Taylor, Alex Strachan

Speech can be naturally described by phonetic features, such as a set of acoustic phonetic features or a set of articulatory features. This thesis establi shes the effectiveness of using phonetic features in phoneme recognition by comparing a recogniser based on them to a recogniser using an established parametrisation as a baseline. The usefulness of phonetic features serves as the foundation for the subsequent modelling of syllables. Syllables are subject to fewer of the context-sensitivity effects that hamper phone-based speech recognition. I investigate the different questions involved in creating syllable models. After training a feature-based syllable recogniser, I compare the feature based syllables against a baseline. To conclude, the feature based syllable models are compared against the baseline phoneme models in word recognition. With the resultant feature-syllable models performing well in word recognition, the featuresyllables show their future potential for large vocabulary automatic speech recognition.

语音可以自然地用语音特征来描述，如一组声学语音特征或一组发音特征。本文通过将基于语音特征的识别器与使用已建立的参数化作为基线的识别器进行比较，确定了在音素识别中使用语音特征的有效性。语音特征的有用性是后续音节建模的基础。音节受到较少的上下文敏感性影响，而上下文敏感性影响阻碍了基于电话的语音识别。我研究了创建音节模型所涉及的不同问题。在训练了一个基于特征的音节识别器之后，我将基于特征的音节与基线进行比较。最后，将基于特征的音节模型与基线音素模型在单词识别中的应用进行了比较。所得到的特征音节模型在单词识别中表现良好，表明了特征音节在大词汇量自动语音识别中的潜力。

引用次数: 66

On the use of F0 features in automatic segmentation for speech synthesis F0特征在语音合成自动分割中的应用

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-56

Takashi Saito

This paper focuses on a method for automatically dividing speech utterances into phonemic segments, which are used for constructing synthesis unit inventories for speech synthesis. Here, we propose a new segmentation parameter called, “dynamics of fundamental frequency (DF0).” In the fine structures of F0 contours, there exist phonemic events which are observed as local dips at phonemic transition regions, especially around voiced consonants. We apply this observation about F0 contours to a speech segmentation method. The DF0 segmentation parameter is used in the final stage of the segmentation procedure to refine the phonemic boundaries obtained roughly by DP alignment. We conduct experiments on the proposed automatic segmentation with a speech database prepared for unit inventory construction, and compare the obtained boundaries with those of manual segmentation to show the effectiveness of the proposed method. We also discuss the effects of the boundary refinement on the synthesized speech. effects of the proposed method on the synthesized speech. Finally, we summarize the results obtained here.

本文研究了一种语音自动分割方法，将语音自动分割成音位段，用于构建语音合成的合成单元清单。在这里，我们提出了一个新的分割参数，称为“基频动力学(DF0)”。在F0轮廓的精细结构中，存在音位事件，在音位过渡区，特别是在浊音附近，观察到局部下降。我们将这种关于F0轮廓的观察应用于语音分割方法。在分割过程的最后阶段，使用DF0分割参数来细化由DP对齐粗略得到的音位边界。我们利用为单位库存构建准备的语音数据库对所提出的自动分割进行了实验，并将所得到的边界与人工分割的边界进行了比较，以证明所提出方法的有效性。讨论了边界细化对合成语音的影响。该方法对合成语音的影响。最后，对所得结果进行了总结。

引用次数: 13

Combination of confidence measures in isolated word recognition 孤立词识别中置信度测度的组合

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-815

Hans J. G. A. Dolfing, A. Wendemuth

In the context of command-and-control applications, we exploit confidence measures in order to classify single-word utterances into two categories: utterances within the vocabulary which are recognized correctly, and other utterances, namely out-ofvocabulary (OOV) or misrecognized utterances. To this end, we investigate the classification error rate (CER) of several classes of confidence measures and transformations. In particular, we employed data-independent and data-dependent measures. The transformations we investigated include mapping to single confidence measures, LDA-transformed measures, and other linear combinations of these measures. These combinations are computed by means of neural networks trained with Bayesoptimal, and with Gardner-Derrida-optimal criteria. Compared to a recognition system without confidence measures, the selection of (various combinations of) confidence measures, the selection of suitable neural network architectures and training methods, continuously improves the CER. Additionally, we found that a linear perceptron generalizes better than a non-linear backpropagation network.

在命令和控制应用的背景下，我们利用置信度来将单词话语分为两类:在词汇表内被正确识别的话语，以及其他话语，即词汇表外(OOV)或被错误识别的话语。为此，我们研究了几种置信度度量和变换的分类错误率(CER)。特别是，我们采用了数据独立和数据依赖的测量方法。我们研究的转换包括映射到单个置信度度量，lda转换的度量，以及这些度量的其他线性组合。这些组合是通过使用贝叶斯最优和加德纳-德里达最优准则训练的神经网络来计算的。与没有置信度度量的识别系统相比，置信度度量的选择(各种组合)、合适的神经网络架构和训练方法的选择不断提高了识别率。此外，我们发现线性感知器比非线性反向传播网络具有更好的泛化能力。

引用次数: 23

A study of tones and tempo in continuous Mandarin digit strings and their application in telephone quality speech recognition 普通话连续数字串的音调和节奏及其在电话质量语音识别中的应用研究

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-140

Chao Wang, S. Seneff

Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of theF0 contour are extracted for tone recognition. Context dependency is expressed by “tri-tone” models clustered into broader classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.

韵律线索(即基本频率、能量和持续时间)为言语提供了重要信息。对于像汉语这样的声调语言，基频(F0)在声调表征中也起着至关重要的作用，它是一个基本的音位特征。在本文中，我们描述了我们在电话质量连续汉语数字的持续时间和音调建模方面的工作，以及这些模型在提高识别方面的应用。持续时间建模包括一个语速归一化方案。提出了一种新的F0提取算法，并基于正交分解的F0轮廓提取参数进行音调识别。上下文依赖关系是通过聚集到更广泛的类中的“三音调”模型来表示的。四音分类的错误率为20.0%。在5.1%单词错误率的基线识别性能下，我们使用时长模型实现了31.4%的误差率降低，使用音调模型实现了23.5%的误差率降低，使用时长和音调模型组合实现了39.2%的误差率降低。

引用次数: 33

Voice onset time patterns in 7-, 9- and 11-year old children 7岁、9岁和11岁儿童的发声时间模式

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-771

S. Whiteside, Jeni Marshall

Voice onset time (VOT) is a key temporal feature in spoken language. There is some evidence to suggest that there are sex differences in VOT patterns. The cause of these sex differences could be attributed to sexual dimorphism of the vocal apparatus. There is also some evidence to suggest that phonetic sex differences could also be attributed to learned stylistic and linguistic factors. This study reports on an investigation into the VOT patterns for /p b t d/ in a group of thirty children aged 7 (n=10), 9 (n=10) and 11 (n=10) years, with equal numbers of girls (n=5) and boys (n=5) in each age group. Age and sex differences were examined for in the VOT data. Age, sex and age-by-sex interactions were found. The results are presented and discussed.

语音起始时间(VOT)是口语中一项重要的时间特征。有证据表明，VOT模式存在性别差异。造成这些性别差异的原因可归因于发声器官的两性二态性。还有一些证据表明，语音上的性别差异也可能归因于习得的文体和语言因素。本研究报告了对30名年龄为7岁(n=10)、9岁(n=10)和11岁(n=10)的儿童/p / b / d/的VOT模式的调查，每个年龄组中女孩(n=5)和男孩(n=5)的人数相等。在VOT数据中检查了年龄和性别差异。发现了年龄，性别和年龄与性别之间的相互作用。给出了实验结果并进行了讨论。

引用次数: 4

Cultural similarities and differences in the recognition of audio-visual speech stimuli 视听言语刺激识别的文化异同

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-268

S. Shigeno

Cultural similarities and differences were compared between Japanese and North American subjects in the recognition of emotion. Seven native Japanese and five native North Americans (four Americans and one Canadian) subjects participated in the experiments. The materials were five meaningful words or short-sentences in Japanese and American English. Japanese and American actors made vocal and facial expression in order to transmit six basic emotions— happiness, surprise, anger, disgust, fear, and sadness. Three presentation conditions were used—auditory, visual, and audio-visual. The audio-visual stimuli were made by dubbing the auditory stimuli on to the visual stimuli. The results show: (1) subjects can more easily recognize the vocal expression of a speaker who belongs to their own culture, (2) Japanese subjects are not good at recognizing “fear” in both the auditory-alone and visual-alone conditions, (3) and both Japanese and American subjects identify the audio-visually incongruent stimuli more often as a visual label rather than as an auditory label. These results suggest that it is difficult to identify the emotion of a speaker from a different culture and that people will predominantly use visual information to identify emotion.

比较了日本和北美受试者在情感识别方面的文化异同。7名日本人和5名北美人(4名美国人和1名加拿大人)参加了实验。材料是日语和美语中五个有意义的单词或短句。日本和美国演员通过声音和面部表情来传达六种基本情绪——快乐、惊讶、愤怒、厌恶、恐惧和悲伤。使用了三种呈现条件:听觉、视觉和视听。视听刺激是通过将听觉刺激加到视觉刺激上来实现的。结果表明:(1)被试更容易识别来自本民族文化的说话者的声音表达;(2)日本被试在听觉单独和视觉单独条件下都不擅长识别“恐惧”;(3)日本和美国被试都更多地将视听不一致刺激识别为视觉标签而不是听觉标签。这些结果表明，很难识别来自不同文化的说话者的情绪，人们将主要使用视觉信息来识别情绪。

{"title":"Cultural similarities and differences in the recognition of audio-visual speech stimuli","authors":"S. Shigeno","doi":"10.21437/ICSLP.1998-268","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-268","url":null,"abstract":"Cultural similarities and differences were compared between Japanese and North American subjects in the recognition of emotion. Seven native Japanese and five native North Americans (four Americans and one Canadian) subjects participated in the experiments. The materials were five meaningful words or short-sentences in Japanese and American English. Japanese and American actors made vocal and facial expression in order to transmit six basic emotions— happiness, surprise, anger, disgust, fear, and sadness. Three presentation conditions were used—auditory, visual, and audio-visual. The audio-visual stimuli were made by dubbing the auditory stimuli on to the visual stimuli. The results show: (1) subjects can more easily recognize the vocal expression of a speaker who belongs to their own culture, (2) Japanese subjects are not good at recognizing “fear” in both the auditory-alone and visual-alone conditions, (3) and both Japanese and American subjects identify the audio-visually incongruent stimuli more often as a visual label rather than as an auditory label. These results suggest that it is difficult to identify the emotion of a speaker from a different culture and that people will predominantly use visual information to identify emotion.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126254323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Source controlled variable bit-rate speech coder based on waveform interpolation 基于波形插值的源控可变比特率语音编码器

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-395

F. Plante, B. Cheetham, D. Marston, P. A. Barrett

This paper describes a source controlled variable bit-rate (SC-VBR) speech coder based on the concept of prototype waveform interpolation. The coder uses a four mode classification : silence, voiced, unvoiced and transition. These modes are detected after the speech has been decomposed into slowly evolving (SEW) and rapidly evolving (REW) waveforms. A voicing activity detection (VAD), the relative level of SEW and REW and the cross-correlation coefficient between characteristic waveform segments are used to make the classification. The encoding of the SEW components is improved using a gender adaptation. In tests using conversational speech, the SC-VBR allows a compression factor of around 3. The VBR coder was evaluated against a fixed rate 4.6kbit/s PWI coder for clean speech and noisy speech and was found to perform better for male speech and for noisy speech.

本文介绍了一种基于原型波形插值原理的源控可变比特率语音编码器。编码器使用四种模式分类:静音、浊音、静音和过渡。这些模式是在语音被分解成慢进化(SEW)和快速进化(REW)波形后检测到的。利用语音活动检测(VAD)、SEW和REW的相对水平以及特征波形段之间的互相关系数进行分类。使用性别适应改进了SEW组件的编码。在使用会话语音的测试中，SC-VBR允许的压缩系数约为3。VBR编码器与固定速率4.6kbit/s的PWI编码器对干净语音和嘈杂语音进行了评估，发现在男性语音和嘈杂语音中表现更好。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

5th International Conference on Spoken Language Processing (ICSLP 1998)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀