2010 7th International Symposium on Chinese Spoken Language Processing最新文献

英文中文

Improving the informativeness of verbose queries using summarization techniques for spoken document retrieval 在口语文档检索中使用摘要技术提高冗长查询的信息量

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-12-01 DOI: 10.1109/ISCSLP.2010.5684847

Shih-Hsiang Lin, Berlin Chen, E. Jan

Query-by-example information retrieval aims at helping users to find relevant documents accurately when users provide specific query exemplars describing what they are interested in. The query exemplars are usually long and in the form of either a partial or even a full document. However, they may contain extraneous terms (or off-topic information) that would have a negative impact on the retrieval performance. In this paper, we propose to integrate extractive summarization techniques into the retrieval process so as to improve the informativeness of a verbose query exemplar. The original query exemplar is first divided into several sub-queries or sentences. To construct a new concise query exemplar, summarization techniques are then employed to select a salient subset of sub-queries. Experiments on the TDT Chinese collection show that the proposed approach is indeed effective and promising.

按例查询信息检索的目的是当用户提供描述他们感兴趣的内容的特定查询范例时，帮助用户准确地找到相关文档。查询示例通常很长，并且采用部分文档甚至完整文档的形式。然而，它们可能包含会对检索性能产生负面影响的无关术语(或偏离主题的信息)。在本文中，我们提出将抽取摘要技术整合到检索过程中，以提高冗长查询样本的信息量。首先将原始查询范例分成若干子查询或句子。为了构造一个新的简洁的查询范例，然后使用摘要技术来选择子查询的显著子集。在TDT中文数据集上的实验证明了该方法的有效性和应用前景。

引用次数: 2

Frame selection of interview channel for NIST speaker recognition evaluation 面向NIST说话人识别评价的访谈信道框架选择

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684886

Hanwu Sun, B. Ma, Haizhou Li

In this paper, we study a front-end frame selection approach for the interview channel speaker recognition system. This new approach keeps the high quality speech frames and removes noisy and irrelevant speech frames for speaker modeling. For robust voice activity detection (VAD) under the different types of microphones located in the interview room, we adopt the spectral subtraction algorithm for noise reduction. An energy based frame selection algorithm is first applied to indicate the speech activity at the frame level. To overcome the summed channel effects in the interview condition, a study is conducted to effectively extract the relevant speaker's speech frames based on VAD Tags and ASR transcript Tags provided by NIST. The eigenchannel based GMM-SVM speaker recognition system is used to evaluate the proposed method. The experiments are conducted on the NIST 2008 and NIST 2010 Speaker Recognition Evaluation interview-interview conditions. It demonstrates that the approach provides an efficient way to select high quality speech frames and the relevant speaker's voice in the interview environment for speaker recognition.

本文研究了一种面向采访信道说话人识别系统的前端帧选择方法。该方法保留了高质量的语音帧，并去除了嘈杂和不相关的语音帧，用于说话人建模。为了实现采访室内不同类型麦克风下的鲁棒语音活动检测(VAD)，我们采用了频谱减噪算法进行降噪。首先采用基于能量的帧选择算法在帧级表示语音活动。为了克服采访条件下的信道叠加效应，基于NIST提供的VAD Tags和ASR transcript Tags，进行了有效提取相关说话人言语帧的研究。利用基于特征信道的GMM-SVM说话人识别系统对该方法进行了评价。实验在NIST 2008和NIST 2010说话人识别评估访谈条件下进行。结果表明，该方法为在采访环境中选择高质量的语音帧和相关的说话人语音进行说话人识别提供了一种有效的方法。

{"title":"Frame selection of interview channel for NIST speaker recognition evaluation","authors":"Hanwu Sun, B. Ma, Haizhou Li","doi":"10.1109/ISCSLP.2010.5684886","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684886","url":null,"abstract":"In this paper, we study a front-end frame selection approach for the interview channel speaker recognition system. This new approach keeps the high quality speech frames and removes noisy and irrelevant speech frames for speaker modeling. For robust voice activity detection (VAD) under the different types of microphones located in the interview room, we adopt the spectral subtraction algorithm for noise reduction. An energy based frame selection algorithm is first applied to indicate the speech activity at the frame level. To overcome the summed channel effects in the interview condition, a study is conducted to effectively extract the relevant speaker's speech frames based on VAD Tags and ASR transcript Tags provided by NIST. The eigenchannel based GMM-SVM speaker recognition system is used to evaluate the proposed method. The experiments are conducted on the NIST 2008 and NIST 2010 Speaker Recognition Evaluation interview-interview conditions. It demonstrates that the approach provides an efficient way to select high quality speech frames and the relevant speaker's voice in the interview environment for speaker recognition.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114679470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Sentence Decomplexification using holistic aspect-based clause detection for long sentence understanding 基于整体方面的句子检测的句子分解，用于长句理解

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684897

Chao-Hong Liu, Chung-Hsien Wu

Long sentences have posed significant challenges for many natural language processing (NLP) tasks such as machine translation and language understanding, because it is still very difficult for the state-of-the-art parsers to analyze them. In this paper, we identify the Sentence Decomplexification (SD) problem and propose models for SD to help understand long sentences. Given a complex sentence, SD seeks to return two sentences, one main clause and the other subordinate clause. These two clauses together include all the information of the original sentence. Since identifying subordinate clauses is a more difficult task than traditional chunking, we also propose a holistic aspect-based detection (HAD) method for clause detection to reduce the overhead required for SD sentence similarity computation. We provide the formalisms of SD and show that HAD can be used for efficiency purposes to this task. The SD system was used to improve the performance of a long sentence understanding system. Experimental results show that the task of SD achieves 78.7% accuracy using Chinese Gigaword Corpus as sentence comparison corpus. For the performance of long sentence understanding, the proposed method reports an improvement of accuracy from 70.7% to 75.5% as compared to that without using SD.

长句子对许多自然语言处理(NLP)任务(如机器翻译和语言理解)提出了重大挑战，因为对于最先进的解析器来说，分析长句子仍然非常困难。在本文中，我们识别了句子分解(SD)问题，并提出了SD模型来帮助理解长句子。给定一个复合句，SD寻求返回两个句子，一个主句和另一个从句。这两个从句一起包含了原句的所有信息。由于识别从句比传统的分块更困难，我们还提出了一种基于整体方面的检测(HAD)方法来检测从句，以减少SD句子相似度计算所需的开销。我们提供了SD的形式化，并表明HAD可以用于提高此任务的效率。用SD系统改进了一个长句理解系统的性能。实验结果表明，使用汉语千兆词语料库作为句子比较语料库，SD任务的准确率达到78.7%。对于长句理解的表现，与未使用SD相比，该方法的准确率从70.7%提高到75.5%。

{"title":"Sentence Decomplexification using holistic aspect-based clause detection for long sentence understanding","authors":"Chao-Hong Liu, Chung-Hsien Wu","doi":"10.1109/ISCSLP.2010.5684897","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684897","url":null,"abstract":"Long sentences have posed significant challenges for many natural language processing (NLP) tasks such as machine translation and language understanding, because it is still very difficult for the state-of-the-art parsers to analyze them. In this paper, we identify the Sentence Decomplexification (SD) problem and propose models for SD to help understand long sentences. Given a complex sentence, SD seeks to return two sentences, one main clause and the other subordinate clause. These two clauses together include all the information of the original sentence. Since identifying subordinate clauses is a more difficult task than traditional chunking, we also propose a holistic aspect-based detection (HAD) method for clause detection to reduce the overhead required for SD sentence similarity computation. We provide the formalisms of SD and show that HAD can be used for efficiency purposes to this task. The SD system was used to improve the performance of a long sentence understanding system. Experimental results show that the task of SD achieves 78.7% accuracy using Chinese Gigaword Corpus as sentence comparison corpus. For the performance of long sentence understanding, the proposed method reports an improvement of accuracy from 70.7% to 75.5% as compared to that without using SD.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Perception and analysis of linearly approximated F0 contours in Cantonese speech 广东话中线性近似F0轮廓的感知与分析

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684486

Yujia Li, Tan Lee

Our previous study revealed that F0 variations in Cantonese speech can be sufficiently represented by linear approximations of the observed F0 contours. This was observed with test materials that have relatively limited lexical and segmental variations. In the present work, the generalizability of linear approximation is examined with a large corpus of polysyllabic Cantonese words. Perceptual results clearly validate the effectiveness of linearly approximated F0 contours. Subsequently analysis of the amount of generated linear approximations is carried out. The properties of linear F0 movements in continuous Cantonese speech are learned, particularly in association with different tones. Lastly, two objective evaluations of the modified F0 contours, RMS error and contour correlation are compared with the true perceptual performance. It is found that neither of these objective measurements gives reliable prediction on perceived speech naturalness.

我们之前的研究表明，粤语语音中的F0变化可以用观察到的F0轮廓的线性近似来充分表示。这是在词汇和片段变化相对有限的测试材料中观察到的。在本研究中，利用大量的多音节广东话语料库检验了线性近似的泛化性。感知结果清楚地验证了线性近似F0轮廓的有效性。随后对生成的线性近似量进行了分析。学习连续粤语语音中F0的线性运动特性，特别是与不同音调的联系。最后，将改进后的F0轮廓的两种客观评价、均方根误差和轮廓相关性与真实感知性能进行了比较。研究发现，这两种客观测量方法都不能可靠地预测语音感知的自然度。

引用次数: 0

A speedup method for the separation of speech signals in frequency domain 一种频域语音信号分离的加速方法

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684892

Shih-Hsun Chen, Hsiao-Chuan Wang

The independent component analysis (ICA) is a commonly used method to find the demixing matrix for the blind source separation (BSS). For speech signals, we should solve BSS problems in the convolutive mixing model, i.e., ICA technique is extended to the frequency domain. The cross-spectral density matrices are computed for each frequency bin instead of covariance matrices in time domain. The joint approximate diagonalization (JADIAG) algorithm proposed by D. T. Pham has been proved to be effective in dealing with the convolutive mixing problem. This paper presents a method to speed up the JADIAG computation in two phases. First, the critical band property of human auditory system is applied so that a set of selected demixing matrices is shared in a critical band to reduce the number of demixing matrices. Second, an efficient estimation of transformation matrix is proposed so that the iterations for finding the demixing matrices in JADIAG algorithm are reduced. The experiment shows that about 71% of computation time can be reduced.

独立分量分析(ICA)是一种常用的寻找盲源分离(BSS)解混矩阵的方法。对于语音信号，我们需要解决卷积混合模型中的BSS问题，即将ICA技术扩展到频域。交叉谱密度矩阵在每个频域计算，而不是在时域计算协方差矩阵。由d.t. Pham提出的联合近似对角化(JADIAG)算法已被证明是处理卷积混合问题的有效方法。本文提出了一种分两阶段加快JADIAG计算速度的方法。首先，利用人类听觉系统的临界带特性，在一个临界带内共享一组选定的除混矩阵，以减少除混矩阵的数量;其次，提出了一种有效的变换矩阵估计方法，从而减少了JADIAG算法中寻找解混矩阵的迭代次数。实验表明，该方法可减少约71%的计算时间。

引用次数: 0

An initial investigation of L1 and L2 discourse speech planning in English 英语第一、二语语篇言语规划的初步研究

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684851

Chiu-yu Tseng, Zhao-yu Su, Chi-Feng Huang, T. Visceglia

A perceptually-based hierarchy of prosodic phrase group (HPG) framework was used in this study to investigate similarities and differences in the size and strategy of discourse-level speech planning across L1 and L2 English speaker groups. While both groups appear to produce similar configurations of acoustic contrasts to signal discourse boundaries, L1 speakers were found to produce these cues more robustly in English. Differences were also found between L1 English and L1 Taiwan Mandarin speaker groups with respect to the distribution of prosodic break levels and break locations. These differences in L1 and L2 organization of discourse speech prosody in English can be largely attributed to between-group differences in speech planning and chunking strategies whereby L2 speakers use more intermediate chunking units and fewer larger-scale planning units in their prosodic discourse organization. Through more understanding of prosody transfer, we believe that technology developed on the basis of L1 Mandarin spoken language processing may be applied to L2 English produced by the same speaker population, with little modification.

本研究采用基于感知的韵律短语群层次结构(HPG)框架，探讨了母语和二语群体在语篇层面言语规划的大小和策略上的异同。虽然两组人似乎都产生了类似的声音对比结构来信号话语边界，但研究发现，母语使用者在英语中产生这些线索的能力更强。母语为英语和母语为台湾普通话的人群在韵律中断水平和中断位置的分布上也存在差异。母语和二语英语话语韵律组织的差异很大程度上归因于言语规划和分块策略的组间差异，即二语使用者在韵律话语组织中更多地使用中间分块单元，而较少使用大规模规划单元。通过对韵律迁移的更多了解，我们认为在母语普通话口语加工的基础上开发的技术可以应用于由同一说话人群产生的第二语言英语，而无需进行任何修改。

{"title":"An initial investigation of L1 and L2 discourse speech planning in English","authors":"Chiu-yu Tseng, Zhao-yu Su, Chi-Feng Huang, T. Visceglia","doi":"10.1109/ISCSLP.2010.5684851","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684851","url":null,"abstract":"A perceptually-based hierarchy of prosodic phrase group (HPG) framework was used in this study to investigate similarities and differences in the size and strategy of discourse-level speech planning across L1 and L2 English speaker groups. While both groups appear to produce similar configurations of acoustic contrasts to signal discourse boundaries, L1 speakers were found to produce these cues more robustly in English. Differences were also found between L1 English and L1 Taiwan Mandarin speaker groups with respect to the distribution of prosodic break levels and break locations. These differences in L1 and L2 organization of discourse speech prosody in English can be largely attributed to between-group differences in speech planning and chunking strategies whereby L2 speakers use more intermediate chunking units and fewer larger-scale planning units in their prosodic discourse organization. Through more understanding of prosody transfer, we believe that technology developed on the basis of L1 Mandarin spoken language processing may be applied to L2 English produced by the same speaker population, with little modification.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122166853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Non-negative matrix factorization based discriminative features for speaker verification 基于非负矩阵分解的说话人识别特征

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684891

Yanhua Long, Lirong Dai, Eryu Wang, B. Ma, Wu Guo

Discovering a discriminative feature representative together with a suitable distance measure is the key for a successful speaker recognition system. In this paper, we propose a new approach for automatic speaker verification. The main contribution of the paper is the extraction of discriminative speaker features using non-negative matrix factorization (NMF) decomposition in the GMM mean space, and the use of cosine-distance measure for speaker classification. With the decomposition, the speaker space is represented by the pattern components while a speaker can be characterized by a coefficient vector representing a specific localization in the space. We validate the proposed approach on the 10-second training and 10-second testing condition constructed from 863 Putonghua (Mandarin) corpus. Relative 10.57% and 26.11% improvements compared to the conventional GMM-UBM system have been achieved for female and male trials respectively.

找到一个判别特征代表和合适的距离度量是成功的说话人识别系统的关键。本文提出了一种新的自动说话人验证方法。本文的主要贡献是在GMM平均空间中使用非负矩阵分解(NMF)提取判别性说话人特征，并使用余弦距离度量对说话人进行分类。通过分解，扬声器空间由模式分量表示，扬声器可以用表示空间中特定位置的系数向量来表征。我们用863个普通话语料库构建的10秒训练和10秒测试条件对该方法进行了验证。与传统的GMM-UBM系统相比，女性试验和男性试验分别取得了10.57%和26.11%的改善。

引用次数: 3

Effects of F0 dimensions in perception of Mandarin tones F0维度对普通话声调知觉的影响

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684878

Bin Li, Caicai Zhang

This study focuses on the perception of two synthesized Mandarin tones: the high level tone (Tone 1) and the high falling tone (Tone 4), which have been reported difficult for Cantonese learners of Mandarin [15]. As the two tones are distinctive in F0 directions and also vary in F0 onsets, it is worth investigating why Cantonese listeners find them perceptually indistinguishable. We aim to find out what F0 cues Cantonese listeners rely on in perceiving these two Mandarin tones by modifying the F0 curves along two dimensions: F0 onset and F0 slope. Results show that Mandarin listeners are able to identify the two pitches based on F0 slope irrespective of F0 onsets, whereas Cantonese listeners seem more sensitive towards the variation of F0 onsets.

本研究的重点是两种合成的普通话声调的感知:高调(声调1)和高降调(声调4)，这两种声调被报道为粤语学习者学习普通话的难点[15]。由于这两种声调在F0方向上是不同的，在F0开始时也是不同的，因此值得研究为什么粤语听众觉得它们在感知上难以区分。我们的目的是通过修改F0曲线沿两个维度:F0起始和F0斜率，找出广东话听众在感知这两种普通话声调时所依赖的F0线索。结果表明，普通话听者能够根据F0斜率来识别两个音高，而与F0起跳无关，而粤语听者似乎对F0起跳的变化更为敏感。

引用次数: 7

Semantics-based language modeling for Cantonese-English code-mixing speech recognition 基于语义的粤英混码语音识别语言建模

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684900

Houwei Cao, P. Ching, Tan Lee, Y. Yeung

This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme, the Cantonese-to-English translation dictionary is adopted to transcribe monolingual Cantonese n-grams to mixed-language n-grams. In semantics-based mapping scheme, n-gram mapping is based on the meaning and syntactic function of the English words in the lexicon. Different semantics-based language models are trained with different mapping schemes. They are evaluated in terms of perplexity and in the task of LVCSR. Experimental results confirm that, the more the observed mixed-language n-grams after mapping, the better the language model perplexity as well as the recognition performance. The proposed language models show significant improvement on recognition performance on embedded English words when they are compared with the baseline 3-gram LM. The best recognition accuracy attained is 63.9% and 74.7% respectively for the English words and Cantonese characters in code-mixing utterances.

本文研究了日常交际中粤英混码话语的LVCSR语言建模问题。在没有足够数量的代码混合文本数据的情况下，基于翻译和基于语义的映射应用于n-gram，以更好地估计低频和不可见的混合语言n-gram事件的概率。在基于翻译的映射方案中，采用粤语-英语翻译词典将单语粤语n-gram转录为混合语言n-gram。在基于语义的映射方案中，n-gram映射是基于词汇中英语单词的意义和句法功能。不同的基于语义的语言模型使用不同的映射方案进行训练。从困惑度和LVCSR任务的角度对其进行评价。实验结果证实，映射后观察到的混合语言n-gram越多，语言模型perplexity越好，识别性能也越好。与基线3-gram LM相比，所提出的语言模型在嵌入式英语单词的识别性能上有显著提高。在混码话语中，英语单词和粤语字符的识别准确率分别为63.9%和74.7%。

{"title":"Semantics-based language modeling for Cantonese-English code-mixing speech recognition","authors":"Houwei Cao, P. Ching, Tan Lee, Y. Yeung","doi":"10.1109/ISCSLP.2010.5684900","DOIUrl":"https://doi.org/10.1109/ISCSLP.2010.5684900","url":null,"abstract":"This paper addresses the problem of language modeling for LVCSR of Cantonese-English code-mixing utterances spoken in daily communications. In the absence of sufficient amount of code-mixing text data, translation-based and semantics-based mapping are applied on n-grams to better estimate the probability of low-frequency and unseen mixed-language n-grams events. In translation-based mapping scheme, the Cantonese-to-English translation dictionary is adopted to transcribe monolingual Cantonese n-grams to mixed-language n-grams. In semantics-based mapping scheme, n-gram mapping is based on the meaning and syntactic function of the English words in the lexicon. Different semantics-based language models are trained with different mapping schemes. They are evaluated in terms of perplexity and in the task of LVCSR. Experimental results confirm that, the more the observed mixed-language n-grams after mapping, the better the language model perplexity as well as the recognition performance. The proposed language models show significant improvement on recognition performance on embedded English words when they are compared with the baseline 3-gram LM. The best recognition accuracy attained is 63.9% and 74.7% respectively for the English words and Cantonese characters in code-mixing utterances.","PeriodicalId":226730,"journal":{"name":"2010 7th International Symposium on Chinese Spoken Language Processing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121664358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Problems of modeling phone deletion in conversational speech for speech recognition 语音识别中会话语音中电话删除的建模问题

2010 7th International Symposium on Chinese Spoken Language Processing

Pub Date : 2010-11-01 DOI: 10.1109/ISCSLP.2010.5684839

B. Mak, Tom Ko

Recently we proposed a novel method to explicitly model the phone deletion phenomenon in speech, and introduced the context-dependent fragmented word model (CD-FWM). An evaluation on the WSJ1 Hub2 5K task shows that even in read speech, CD-FWM could reduce word error rate (WER) by a relative 10.3%. Since it is generally expected that the phone deletion phenomenon is more pronounced in conversational and spontaneous speech than in read speech, we extend our investigation of modeling phone deletion in conversation using CD-FWM on the SVitchboard 500-word task in this paper. To our surprise, much smaller recognition gain is obtained. Through a series of analyses, we present some plausible explanations for why phone deletion modeling is more successful in read speech than in conversational speech, and suggest future directions in improving CD-FWM for recognizing conversational speech.

最近，我们提出了一种新的方法来明确地模拟语音中的电话删除现象，并引入了上下文相关的碎片词模型(CD-FWM)。对WSJ1 Hub2 5K任务的评估表明，即使在读语音中，CD-FWM也可以将单词错误率(WER)相对降低10.3%。由于人们普遍认为电话删除现象在会话和自发语音中比在阅读语音中更为明显，因此本文在SVitchboard 500字任务上使用CD-FWM扩展了对会话中电话删除建模的研究。令我们惊讶的是，获得的识别增益要小得多。通过一系列的分析，我们提出了一些合理的解释，为什么电话删除建模在阅读语音中比在会话语音中更成功，并提出了改进CD-FWM识别会话语音的未来方向。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2010 7th International Symposium on Chinese Spoken Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀