5th International Conference on Spoken Language Processing (ICSLP 1998)最新文献

英文中文

A novel method of formant analysis and glottal inverse filtering 一种新的形成峰分析和声门反滤波方法

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-543

Steve Pearson

This paper presents a class of methods for automatically extracting formant parameters from speech. The methods rely on an iterative optimization algorithm. It was found that formant parameter data derived with these methods was less prone to discontinuity errors than conventional methods. Also, experiments were conducted that demonstrated that these methods are capable of better accuracy in formant estimation than LPC, especially for the first formant. In some cases, the analytic (non-iterative) solution has been derived, making real time applications feasible. The main target that we have been pursuing is text-to-speech (TTS) conversion. These methods are being used to automatically analyze a concatenation database, without the need for a tuning phase to fix errors. In addition, they are instrumental in realizing high quality pitch tracking, and pitch epoch marking.

提出了一种从语音中自动提取形成峰参数的方法。该方法依赖于迭代优化算法。研究发现，与传统方法相比，用这些方法得到的构造峰参数数据不容易出现不连续误差。此外，实验表明，这些方法能够比LPC更准确地估计形成峰，特别是对于第一个形成峰。在某些情况下，推导出了解析(非迭代)解，使实时应用变得可行。我们一直追求的主要目标是文本到语音(TTS)的转换。这些方法用于自动分析串联数据库，而不需要调优阶段来修复错误。此外，它们有助于实现高质量的音高跟踪和音高epoch标记。

引用次数: 1

High-speed speaker adaptation using phoneme dependent tree-structured speaker clustering 基于音素相关树形说话人聚类的高速说话人自适应

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-745

Motoyuki Suzuki, T. Abe, H. Mori, S. Makino, H. Aso

The tree-structured speaker clustering was proposed as a highspeed speaker adaptation method. It can select the model which is most similar to a target speaker. However, this method does not consider speaker difference dependent on phoneme class. In this paper, we propose a speaker adaptation method based on speaker clustering by taking speaker difference dependent on phoneme class into account. The experimental results showed that the new method gave a better performance than the original method. Furthermore, we propose the improved method which use a tree-structure of a similar phoneme as the substitute for the phoneme which does not appear in the adaptation data. From the experimental results, the improved method gave a better performance than the method previously proposed.

提出了树形结构的说话人聚类方法作为一种快速的说话人自适应方法。它可以选择与目标扬声器最相似的模型。然而，该方法没有考虑不同音素类的说话者差异。本文提出了一种基于说话人聚类的说话人自适应方法，该方法考虑了不同音素类的说话人差异。实验结果表明，新方法比原方法具有更好的性能。此外，我们还提出了一种改进的方法，即使用相似音素的树状结构代替在自适应数据中没有出现的音素。实验结果表明，改进后的方法比之前提出的方法具有更好的性能。

引用次数: 2

Toward on-line learning of Chinese continuous speech recognition system 汉语连续语音识别系统的在线学习研究

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-748

Rong Zheng, Zuoying Wang

In this paper, we presented an integrated on-line learning scheme, which combined the state-of-art speaker normalization and adaptation techniques to improve the performance of our large vocabulary Chinese continuous speech recognition (CSR)system. We used VTLN to remove inter-speaker variation in both training and testing stage. To facilitate dynamic transformation scale determination, we devised a tree-based transformation method as the key component of our incrementaladaptation. Experiments shows that the combined scheme of on-line learning (incremental & unsupervised) system, which gives approximately 22~26% error reduction rate, was proved to be better than either method when used separately at and 2.7 . .

在本文中，我们提出了一个集成的在线学习方案，该方案结合了最先进的说话人归一化和自适应技术，以提高我们的大词汇量汉语连续语音识别系统的性能。在训练和测试阶段，我们使用了VTLN来消除说话者之间的差异。为了方便动态转换尺度的确定，我们设计了一种基于树的转换方法作为我们增量适应的关键组成部分。实验表明，在线学习(增量和无监督)系统的组合方案在分别使用和2.7时，错误率约为22~26%，优于两种方法。

引用次数: 0

Unsupervised training of a speech recognizer using TV broadcasts 使用电视广播对语音识别器进行无监督训练

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-632

T. Kemp, A. Waibel

Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer without transcriptions. The experiments were carried out with TV newscasts, that were recorded using a satellite receiver and a simple MPEG coding hardware. The newscasts were automatically segmented into segments of similar acoustic background condition. This material is inexpensive and can be made available in large quantities, but there are no transcriptions available. We develop a training scheme, where a recognizer is boot-strapped using very little transcribed data and is improved using new, untranscribed speech. We show that it is neces-sary to use a con(cid:12)dence measure to judge the initial transcriptions of the recognizer before using them. Higher im-provements can be achieved if the number of parameters in the system is increased when more data becomes available. We show, that the bene(cid:12)cial e(cid:11)ect of unsupervised training is not compensated by MLLR adaptation on the hypothesis. In a (cid:12)nal experiment, the e(cid:11)ect of untranscribed data is compared with the e(cid:11)ect of transcribed speech. Using the described methods, we found that the untranscribed data gives roughly one third of the improvement of the transcribed material.

当前的语音识别系统需要大量的转录数据进行参数估计。然而，抄写既繁琐又昂贵。在这项工作中，我们描述了我们的实验，旨在训练一个没有转录的语音识别器。实验是用电视新闻节目进行的，这些节目是用卫星接收器和一个简单的MPEG编码硬件录制的。新闻广播被自动分割成具有相似声学背景条件的片段。这种材料价格低廉，可以大量获得，但没有可用的转录本。我们开发了一个训练方案，其中识别器使用很少的转录数据进行引导，并使用新的，未转录的语音进行改进。我们表明，在使用识别器的初始转录之前，有必要使用一个可信度(cid:12)度量来判断它们。如果在可用数据更多的情况下增加系统中的参数数量，则可以实现更高的改进。我们发现，基于假设的MLLR适应不能补偿无监督训练的收益(cid:12)和收益(cid:11)效应。在(cid:12)nal实验中，将未转录数据的e(cid:11)ect与转录语音的e(cid:11)ect进行了比较。使用所描述的方法，我们发现未转录的数据提供了大约三分之一的转录材料的改进。

{"title":"Unsupervised training of a speech recognizer using TV broadcasts","authors":"T. Kemp, A. Waibel","doi":"10.21437/ICSLP.1998-632","DOIUrl":"https://doi.org/10.21437/ICSLP.1998-632","url":null,"abstract":"Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer without transcriptions. The experiments were carried out with TV newscasts, that were recorded using a satellite receiver and a simple MPEG coding hardware. The newscasts were automatically segmented into segments of similar acoustic background condition. This material is inexpensive and can be made available in large quantities, but there are no transcriptions available. We develop a training scheme, where a recognizer is boot-strapped using very little transcribed data and is improved using new, untranscribed speech. We show that it is neces-sary to use a con(cid:12)dence measure to judge the initial transcriptions of the recognizer before using them. Higher im-provements can be achieved if the number of parameters in the system is increased when more data becomes available. We show, that the bene(cid:12)cial e(cid:11)ect of unsupervised training is not compensated by MLLR adaptation on the hypothesis. In a (cid:12)nal experiment, the e(cid:11)ect of untranscribed data is compared with the e(cid:11)ect of transcribed speech. Using the described methods, we found that the untranscribed data gives roughly one third of the improvement of the transcribed material.","PeriodicalId":117113,"journal":{"name":"5th International Conference on Spoken Language Processing (ICSLP 1998)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115497599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

SIVHA, visual speech synthesis system 视觉语音合成系统

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-777

Y. Blanco, Maria Cuellar, A. Villanueva, Fernando Lacunza, R. Cabeza, B. Marcotegui

This paper presents SIVHA, a high quality Spanish speech synthesis system for severe disabled persons controlled by their eye movements. The system follows the eye-gaze of the patients along the screen and constructs the text with the selected words. When the user considers that the construction of the message has been finished, the synthesis of the message can be ordered. The system is divided in three modules. The first one determines the point of the screen the user is looking at, the second one is an interface to construct the sentences and the third one is the synthesis itself.

本文介绍了一种高质量的由眼动控制的西班牙语语音合成系统SIVHA。该系统沿着屏幕跟踪患者的目光，并用选定的单词构建文本。当用户认为消息的构造已经完成时，可以对消息的合成进行排序。系统分为三个模块。第一个决定了用户正在看的屏幕的位置，第二个是构建句子的界面，第三个是合成本身。

引用次数: 3

Convergence of fundamental frequencies in conversation: if it happens, does it matter? 对话中基本频率的收敛:如果发生了，有什么关系吗?

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-111

Belinda Collins

This paper explores the existence and nature of accommodation processes within conversation, particularly convergence of fundamental frequency (Fo) of conversational participants over time. The study raises a number of issues related to methodologies for analysing interactional (typically conversational) data. Most important is the issue of the applicability of statistical sampling methods which are independent of the interactional events occurring within the talk. It concludes with suggestions for a methodology that examines long term acoustic phenomena (eg long term fundamental frequency) and relates events at the micro acoustic level to interactional events within a conversation.

本文探讨了会话中调节过程的存在和本质，特别是会话参与者的基本频率(Fo)随时间的收敛。该研究提出了许多与分析交互(通常是会话)数据的方法相关的问题。最重要的是统计抽样方法的适用性问题，这些方法与谈话中发生的相互作用事件无关。最后提出了一种方法的建议，该方法可以检查长期声学现象(例如长期基频)，并将微声学水平上的事件与对话中的相互作用事件联系起来。

引用次数: 9

Multi-Span statistical language modeling for large vocabulary speech recognition 面向大词汇量语音识别的多跨度统计语言建模

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-640

J. Bellegarda

The goal of multi-span language modeling is to integrate the various constraints, both local and global, that are present in the language. In this paper, local constraints are captured via the usual n-gram approach, while global constraints are taken into account through the use of latent semantic analysis. Anintegrative formulation is derivedfor the combination of these two paradigms, resulting in an en-tirely data-driven, multi-span framework for large vocabulary speech recognition. Because of the inherent comple-mentarity in the two types of constraints, the performance of the integrated language model compares favorably with the corresponding n-gram performance. Both perplexity and average word error rate (cid:12)gures are reported and dis-cussed.

多跨语言建模的目标是集成语言中存在的各种约束，包括局部约束和全局约束。在本文中，通过通常的n-gram方法捕获局部约束，而通过使用潜在语义分析来考虑全局约束。将这两种模式结合在一起，形成了一个完全由数据驱动的、适用于大词汇量语音识别的多跨度框架。由于两类约束的内在互补性，集成语言模型的性能优于相应的n-gram性能。报告并讨论了困惑度和平均错误率(cid:12)两个数字。

引用次数: 13

A computational algorithm for F0 contour generation in Korean developed with prosodically labeled databases using k-toBI system 利用k-toBI系统开发了一种基于节奏标记数据库的韩文F0轮廓生成计算算法

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-34

Yong-Ju Lee, Sook-Hyang Lee, Jong-Jin Kim, Hyun-Ju Ko, Young-Il Kim, Sanghun Kim, Jung-Cheol Lee

This study describes an algorithm for the F0 contour generation system for Korean sentences and its evaluation results. 400 K-ToBI labeled utterances were used which were read by one male and one female announcers. F0 contour generation system uses two classification trees for prediction of K-ToBI labels for input text and 11 regression trees for prediction of F0 values for the labels. Evaluation results of the system showed 77.2% prediction accuracy for prediction of IP boundaries and 72.0% prediction accuracy for AP boundaries. Information of voicing and duration of the segments was not changed for F0 contour generation and its evaluation. Evaluation results showed 23.5Hz RMS error and 0.55 correlation coefficient in F0 generation experiment using labelling information from the original speech data.

本文描述了一种韩语句子F0轮廓生成系统的算法及其评价结果。使用了400个K-ToBI标签话语，由一名男性和一名女性播音员朗读。F0轮廓生成系统使用两棵分类树来预测输入文本的K-ToBI标签，使用11棵回归树来预测标签的F0值。评价结果表明，该系统对IP边界的预测准确率为77.2%，对AP边界的预测准确率为72.0%。F0轮廓生成和评价时，不改变片段的语音信息和持续时间。利用原始语音数据的标注信息进行F0生成实验，评价结果为均方根误差23.5Hz，相关系数0.55。

引用次数: 4

The effect of fundamental frequency on Mandarin speech recognition 基频对普通话语音识别的影响

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-761

Sharlene A. Liu, S. Doyle, Allen Morris, Farzad Ehsani

We study the effects of modeling tone in Mandarin speech recognition. Including the neutral tone, there are 5 tones in Mandarin and these tones are syllable-level phenomena. A direct acoustic manifestation of tone is the fundamental frequency (f0). We will report on the effect of f0 on the acoustic recognition accuracy of a Mandarin recognizer. In particular, we put f0, its first derivative (f0 ¢ ), and its second derivative (f0 ¢¢ ) in separate streams of the feature vector. Stream weights are adjusted to investigate the individual effects of f0, f0 ¢ , and f0 ¢¢ to recognition accuracy. Our results show that incorporating the f0 feature negatively impacted accuracy, whereas f0’ increased accuracy and f0’’ seemed to have no effect.

研究了声调建模在普通话语音识别中的作用。包括中性音在内，普通话有5个声调，这些声调都是音节级现象。音调在声学上的直接表现是基频(f0)。我们将报告f0对普通话识别器的声音识别精度的影响。特别地，我们把f0，它的一阶导数(f0¢)和它的二阶导数(f0¢)放在特征向量的不同流中。调整流权值以研究f0、f0和f0对识别精度的个别影响。我们的研究结果表明，结合f0特征会对精度产生负面影响，而f0 '可以提高精度，而f0 '似乎没有影响。

引用次数: 10

ToBI accent type recognition ToBI重音类型识别

5th International Conference on Spoken Language Processing (ICSLP 1998)

Pub Date : 1998-11-30 DOI: 10.21437/ICSLP.1998-126

Arman Maghbouleh

This paper describes work in progress for recognizing a subset of ToBI intonation labels (H*, L+H*, L*, !H*, L+!H*, no accent). Initially, duration characteristics are used to classify syllables as accented or not. The accented syllables are then subclassified based on fundamental frequency, F0, values. Potential F0 intonation gestures are schematized by connected line segments within a window around a given syllable. The schematizations are found using spline-basis linear regression. The regression weights on F0 points are varied in order to discount segmental effects and F0 detection errors. Parameters based on the line segments are then used to perform the subclassification. This paper presents new results in recognizing L*, L+H*, and L+!H* accents. In addition, the models presented here perform comparably (80% overall, and 74% accent type recognition) to models which do not distinguish bitonal accents.

本文描述了ToBI语调标签子集(H*， L+H*， L*， !H*， L+!)的识别工作。H*，没有重音)。最初，时长特征被用来区分音节是否重音。然后根据基本频率F0值对重音音节进行细分。潜在的F0语调手势通过在给定音节周围的窗口内连接的线段来表示。用样条基线性回归找到了模型。F0点上的回归权值是不同的，以便忽略片段效应和F0检测误差。然后使用基于线段的参数执行子分类。本文给出了L*、L+H*和L+!H *口音。此外，本文提出的模型的表现与不区分双音口音的模型相当(80%的整体，74%的口音类型识别)。

引用次数: 9

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

5th International Conference on Spoken Language Processing (ICSLP 1998)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀