Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777834
E. V. Raghavendra, B. Yegnanarayana, K. Prahallad
In this paper we propose a technique for a syllable based speech synthesis system. While syllable based synthesizers produce better sounding speech than diphone and phone, the coverage of all syllables is a non-trivial issue. We address the issue of coverage of syllables through approximating the syllable when the required syllable is not found. To verify our hypothesis, we conducted perceptual studies on manually modified sentences and found that our assumption is valid. Similar approaches have been used in speech synthesis and it shows that such approximation produces intelligible and better quality speech than diphone units.
{"title":"Speech synthesis using approximate matching of syllables","authors":"E. V. Raghavendra, B. Yegnanarayana, K. Prahallad","doi":"10.1109/SLT.2008.4777834","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777834","url":null,"abstract":"In this paper we propose a technique for a syllable based speech synthesis system. While syllable based synthesizers produce better sounding speech than diphone and phone, the coverage of all syllables is a non-trivial issue. We address the issue of coverage of syllables through approximating the syllable when the required syllable is not found. To verify our hypothesis, we conducted perceptual studies on manually modified sentences and found that our assumption is valid. Similar approaches have been used in speech synthesis and it shows that such approximation produces intelligible and better quality speech than diphone units.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115443001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777899
Junta Mizuno, J. Ogata, Masataka Goto
Given podcasts (audio blogs) that are sets of speech files called episodes, this paper describes a method for retrieving episodes that have similar content. Although most previous retrieval methods were based on bibliographic information, tags, or users' playback behaviors without considering spoken content, our method can compute content-based similarity based on speech recognition results of podcast episodes even if the recognition results include some errors. To overcome those errors, it converts intermediate speech-recognition results to a confusion network containing competitive candidates, and then computes the similarity by using keywords extracted from the network. Experimental results with episodes that have different word accuracy and content showed that keywords obtained from competitive candidates were useful in retrieving similar episodes. To show relevant episodes, our method will be incorporated into PodCastle, a public web service that provides full-text searching of podcasts on the basis of speech recognition.
{"title":"A similar content retrieval method for podcast episodes","authors":"Junta Mizuno, J. Ogata, Masataka Goto","doi":"10.1109/SLT.2008.4777899","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777899","url":null,"abstract":"Given podcasts (audio blogs) that are sets of speech files called episodes, this paper describes a method for retrieving episodes that have similar content. Although most previous retrieval methods were based on bibliographic information, tags, or users' playback behaviors without considering spoken content, our method can compute content-based similarity based on speech recognition results of podcast episodes even if the recognition results include some errors. To overcome those errors, it converts intermediate speech-recognition results to a confusion network containing competitive candidates, and then computes the similarity by using keywords extracted from the network. Experimental results with episodes that have different word accuracy and content showed that keywords obtained from competitive candidates were useful in retrieving similar episodes. To show relevant episodes, our method will be incorporated into PodCastle, a public web service that provides full-text searching of podcasts on the basis of speech recognition.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125146285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777864
Shasha Xie, Yang Liu, Hui-Ching Lin
Feature-based approaches are widely used in the task of extractive meeting summarization. In this paper, we analyze and evaluate the effectiveness of different types of features using forward feature selection in an SVM classifier. In addition to features used in prior studies, we introduce topic related features and demonstrate that these features are helpful for meeting summarization. We also propose a new way to resample the sentences based on their salience scores for model training and testing. The experimental results on both the human transcripts and recognition output, evaluated by the ROUGE summarization metrics, show that feature selection and data resampling help improve the system performance.
{"title":"Evaluating the effectiveness of features and sampling in extractive meeting summarization","authors":"Shasha Xie, Yang Liu, Hui-Ching Lin","doi":"10.1109/SLT.2008.4777864","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777864","url":null,"abstract":"Feature-based approaches are widely used in the task of extractive meeting summarization. In this paper, we analyze and evaluate the effectiveness of different types of features using forward feature selection in an SVM classifier. In addition to features used in prior studies, we introduce topic related features and demonstrate that these features are helpful for meeting summarization. We also propose a new way to resample the sentences based on their salience scores for model training and testing. The experimental results on both the human transcripts and recognition output, evaluated by the ROUGE summarization metrics, show that feature selection and data resampling help improve the system performance.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125567818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777893
Igor Szöke, L. Burget, J. Černocký, M. Fapšo
This paper deals with comparison of sub-word based methods for spoken term detection (STD) task and phone recognition. The sub-word units are needed for search for out-of-vocabulary words. We compared words, phones and multigrams. The maximal length and pruning of multigrams were investigated first. Then two constrained methods of multigram training were proposed. We evaluated on the NIST STD06 dev-set CTS data. The conclusion is that the proposed method improves the phone accuracy more than 9% relative and STD accuracy more than 7% relative.
{"title":"Sub-word modeling of out of vocabulary words in spoken term detection","authors":"Igor Szöke, L. Burget, J. Černocký, M. Fapšo","doi":"10.1109/SLT.2008.4777893","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777893","url":null,"abstract":"This paper deals with comparison of sub-word based methods for spoken term detection (STD) task and phone recognition. The sub-word units are needed for search for out-of-vocabulary words. We compared words, phones and multigrams. The maximal length and pruning of multigrams were investigated first. Then two constrained methods of multigram training were proposed. We evaluated on the NIST STD06 dev-set CTS data. The conclusion is that the proposed method improves the phone accuracy more than 9% relative and STD accuracy more than 7% relative.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122946456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777825
F. Weber, Kalika Bali, R. Rosenfeld, K. Toyama
The full range of possibilities for spoken-language technologies (SLTs) to impact poor communities has been investigated on partially, despite what appears to be strong potential. Voice interfaces raise fewer barriers for the illiterate, require less training to use, and are a natural choice for applications on cell phones, which have far greater penetration, in the developing world than PCs. At the same time, critical lessons of existing technology projects in development still apply and require careful attention. We suggest how to expand the view of SLT for development, and discuss how its potential can realistically be explored.
{"title":"Unexplored directions in spoken language technology for development","authors":"F. Weber, Kalika Bali, R. Rosenfeld, K. Toyama","doi":"10.1109/SLT.2008.4777825","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777825","url":null,"abstract":"The full range of possibilities for spoken-language technologies (SLTs) to impact poor communities has been investigated on partially, despite what appears to be strong potential. Voice interfaces raise fewer barriers for the illiterate, require less training to use, and are a natural choice for applications on cell phones, which have far greater penetration, in the developing world than PCs. At the same time, critical lessons of existing technology projects in development still apply and require careful attention. We suggest how to expand the view of SLT for development, and discuss how its potential can realistically be explored.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128249959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777847
J. Lee, S. Seneff
While a wide variety of grammatical mistakes may be observed in the speech of non-native speakers, the types and frequencies of these mistakes are not random. Certain parts of speech, for example, have been shown to be especially problematic for Japanese learners of English [1]. Modeling these errors can potentially enhance the performance of computer-assisted language learning systems. This paper presents an automatic method to estimate an error model from a non-native English corpus, focusing on articles and prepositions. A fine-grained analysis is achieved by conditioning the errors on appropriate words in the context.
{"title":"An analysis of grammatical errors in non-native speech in english","authors":"J. Lee, S. Seneff","doi":"10.1109/SLT.2008.4777847","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777847","url":null,"abstract":"While a wide variety of grammatical mistakes may be observed in the speech of non-native speakers, the types and frequencies of these mistakes are not random. Certain parts of speech, for example, have been shown to be especially problematic for Japanese learners of English [1]. Modeling these errors can potentially enhance the performance of computer-assisted language learning systems. This paper presents an automatic method to estimate an error model from a non-native English corpus, focusing on articles and prepositions. A fine-grained analysis is achieved by conditioning the errors on appropriate words in the context.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128634944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777895
Heng Ji, R. Grishman, Wen Wang
Cross-lingual spoken sentence retrieval (CLSSR) remains a challenge, especially for queries including OOV words such as person names. This paper proposes a simple method of fuzzy matching between query names and phones of candidate audio segments. This approach has the advantage of avoiding some word decoding errors in automatic speech recognition (ASR). Experiments on Mandarin-English CLSSR show that phone-based searching and conventional translation-based searching are complementary. Adding phone matching achieved 26.29% improvement on F-measure over searching on state-of-the-art machine translation (MT) output and 8.83% over entity translation (ET) output.
{"title":"Phonetic name matching for cross-lingual Spoken Sentence Retrieval","authors":"Heng Ji, R. Grishman, Wen Wang","doi":"10.1109/SLT.2008.4777895","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777895","url":null,"abstract":"Cross-lingual spoken sentence retrieval (CLSSR) remains a challenge, especially for queries including OOV words such as person names. This paper proposes a simple method of fuzzy matching between query names and phones of candidate audio segments. This approach has the advantage of avoiding some word decoding errors in automatic speech recognition (ASR). Experiments on Mandarin-English CLSSR show that phone-based searching and conventional translation-based searching are complementary. Adding phone matching achieved 26.29% improvement on F-measure over searching on state-of-the-art machine translation (MT) output and 8.83% over entity translation (ET) output.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127751950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777879
V. Wan, J. Dines, A. Hannani, Thomas Hain
This paper presents Bob, a tool for managing lexicons and generating pronunciation dictionaries for automatic speech recognition systems. It aims to maintain a high level of consistency between lexicons and language modelling corpora by managing the text normalisation and lexicon generation processes in a single dedicated package. It also aims to maintain consistent pronunciation dictionaries by generating pronunciation hypotheses automatically and aiding their verification. The tool's design and functionality are described. Also two case studies highlighting the importance of consistency and illustrating the use of the tool are reported.
{"title":"Bob: A lexicon and pronunciation dictionary generator","authors":"V. Wan, J. Dines, A. Hannani, Thomas Hain","doi":"10.1109/SLT.2008.4777879","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777879","url":null,"abstract":"This paper presents Bob, a tool for managing lexicons and generating pronunciation dictionaries for automatic speech recognition systems. It aims to maintain a high level of consistency between lexicons and language modelling corpora by managing the text normalisation and lexicon generation processes in a single dedicated package. It also aims to maintain consistent pronunciation dictionaries by generating pronunciation hypotheses automatically and aiding their verification. The tool's design and functionality are described. Also two case studies highlighting the importance of consistency and illustrating the use of the tool are reported.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121680972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777902
Kamini Malhotra, A. Khosla
In the past significant effort has been focused on automatic extraction of information from speech signals. Most techniques have aimed at automatic speech recognition or speaker identification. Automatic accent identification (AID) has received far less attention. This paper gives an approach to identify gender and accent of a speaker using Gaussian mixture modeling technique. The proposed approach is text independent and identifies accent among four regional Indian accents in spoken Hindi and also identifies the gender. The accents worked upon are Kashmiri, Manipuri, Bengali and neutral Hindi. The Gaussian mixture model (GMM) approach precludes the need of speech segmentation for training and makes the implementation of the system very simple. When gender dependent GMMs are used, the accent identification score is enhanced and gender is also correctly recognized. The results show that the GMMs lend themselves to accent and gender identification task very well. In this approach spectral features have been incorporated in the form of mel frequency cepstral coefficients (MFCC). The approach has a wide scope of expansion to incorporate other regional accents in a very simple way.
{"title":"Automatic identification of gender & accent in spoken Hindi utterances with regional Indian accents","authors":"Kamini Malhotra, A. Khosla","doi":"10.1109/SLT.2008.4777902","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777902","url":null,"abstract":"In the past significant effort has been focused on automatic extraction of information from speech signals. Most techniques have aimed at automatic speech recognition or speaker identification. Automatic accent identification (AID) has received far less attention. This paper gives an approach to identify gender and accent of a speaker using Gaussian mixture modeling technique. The proposed approach is text independent and identifies accent among four regional Indian accents in spoken Hindi and also identifies the gender. The accents worked upon are Kashmiri, Manipuri, Bengali and neutral Hindi. The Gaussian mixture model (GMM) approach precludes the need of speech segmentation for training and makes the implementation of the system very simple. When gender dependent GMMs are used, the accent identification score is enhanced and gender is also correctly recognized. The results show that the GMMs lend themselves to accent and gender identification task very well. In this approach spectral features have been incorporated in the form of mel frequency cepstral coefficients (MFCC). The approach has a wide scope of expansion to incorporate other regional accents in a very simple way.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129375953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-12-01DOI: 10.1109/SLT.2008.4777889
D. Karakos, S. Khudanpur
System combination is a technique which has been shown to yield significant gains in speech recognition and machine translation. Most combination schemes perform an alignment between different system outputs in order to produce lattices (or confusion networks), from which a composite hypothesis is chosen, possibly with the help of a large language model. The benefit of this approach is two-fold: (i) whenever many systems agree with each other on a set of words, the combination output contains these words with high confidence; and (ii) whenever the systems disagree, the language model resolves the ambiguity based on the (probably correct) agreed upon context. The case of machine translation system combination is more challenging because of the different word orders of the translations: the alignment has to incorporate computationally expensive movements of word blocks. In this paper, we show how one can combine translation outputs efficiently, extending the incremental alignment procedure of (A-V.I. Rosti et al., 2008). A comparison between different system combination design choices is performed on an Arabic speech translation task.
系统组合是一种在语音识别和机器翻译方面取得显著成果的技术。大多数组合方案在不同的系统输出之间执行对齐,以产生格(或混淆网络),从中选择复合假设,可能借助大型语言模型。这种方法的好处是双重的:(i)当许多系统在一组单词上彼此一致时,组合输出以高置信度包含这些单词;(ii)当系统不一致时,语言模型基于(可能正确的)商定的上下文来解决歧义。机器翻译系统组合的情况更具挑战性,因为翻译的词序不同:对齐必须包含计算上昂贵的词块移动。在本文中,我们展示了如何有效地组合翻译输出,扩展了(a - v - i)的增量对齐过程。Rosti et al., 2008)。针对一个阿拉伯语语音翻译任务,对不同的系统组合设计选择进行了比较。
{"title":"Sequential system combination for machine translation of speech","authors":"D. Karakos, S. Khudanpur","doi":"10.1109/SLT.2008.4777889","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777889","url":null,"abstract":"System combination is a technique which has been shown to yield significant gains in speech recognition and machine translation. Most combination schemes perform an alignment between different system outputs in order to produce lattices (or confusion networks), from which a composite hypothesis is chosen, possibly with the help of a large language model. The benefit of this approach is two-fold: (i) whenever many systems agree with each other on a set of words, the combination output contains these words with high confidence; and (ii) whenever the systems disagree, the language model resolves the ambiguity based on the (probably correct) agreed upon context. The case of machine translation system combination is more challenging because of the different word orders of the translations: the alignment has to incorporate computationally expensive movements of word blocks. In this paper, we show how one can combine translation outputs efficiently, extending the incremental alignment procedure of (A-V.I. Rosti et al., 2008). A comparison between different system combination design choices is performed on an Arabic speech translation task.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132771317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}