Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430189
F. Liu, Yang Liu
Soundbite identification in broadcast news is important for locating information useful for question answering, mining opinions of a particular person, and enriching speech recognition output with quotation marks. This paper presents a systematic study of this problem under a classification framework, including problem formulation for classification, feature extraction, and the effect of using automatic speech recognition (ASR) output and automatic sentence boundary detection. Our experiments on a Mandarin broadcast news speech corpus show that the three-way classification framework outperforms the binary classification. The entropy-based feature weighting method generally performs better than others. Using ASR output degrades system performance, with more degradation observed from using automatic sentence segmentation than speech recognition errors for this task, especially on the recall rate.
{"title":"Soundbite identification using reference and automatic transcripts of broadcast news speech","authors":"F. Liu, Yang Liu","doi":"10.1109/ASRU.2007.4430189","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430189","url":null,"abstract":"Soundbite identification in broadcast news is important for locating information useful for question answering, mining opinions of a particular person, and enriching speech recognition output with quotation marks. This paper presents a systematic study of this problem under a classification framework, including problem formulation for classification, feature extraction, and the effect of using automatic speech recognition (ASR) output and automatic sentence boundary detection. Our experiments on a Mandarin broadcast news speech corpus show that the three-way classification framework outperforms the binary classification. The entropy-based feature weighting method generally performs better than others. Using ASR output degrades system performance, with more degradation observed from using automatic sentence segmentation than speech recognition errors for this task, especially on the recall rate.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132701088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430092
L. Buera, A. Miguel, EDUARDO LLEIDA SOLANO, Oscar Saz-Torralba, A. Ortega
An on-line unsupervised hybrid compensation technique is proposed to reduce the mismatch between training and testing conditions. It combines multi-environment model based linear normalization with cross-probability model based on GMMs (MEMLIN CPM) with a novel acoustic model adaptation method based on rotation transformations. Hence, a set of rotation transformations is estimated with clean and MEMLIN CPM-normalized training data by linear regression in an unsupervised process. Thus, in testing, each MEMLIN CPM normalized frame is decoded using a modified Viterbi algorithm and expanded acoustic models, which are obtained from the reference ones and the set of rotation transformations. To test the proposed solution, some experiments with Spanish SpeechDat Car database were carried out. MEMLIN CPM over standard ETSI front-end parameters reaches 83.89% of average improvement in WER, while the introduced hybrid solution goes up to 92.07%. Also, the proposed hybrid technique was tested with Aurora 2 database, obtaining an average improvement of 68.88% with clean training.
{"title":"Robust speech recognition with on-line unsupervised acoustic feature compensation","authors":"L. Buera, A. Miguel, EDUARDO LLEIDA SOLANO, Oscar Saz-Torralba, A. Ortega","doi":"10.1109/ASRU.2007.4430092","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430092","url":null,"abstract":"An on-line unsupervised hybrid compensation technique is proposed to reduce the mismatch between training and testing conditions. It combines multi-environment model based linear normalization with cross-probability model based on GMMs (MEMLIN CPM) with a novel acoustic model adaptation method based on rotation transformations. Hence, a set of rotation transformations is estimated with clean and MEMLIN CPM-normalized training data by linear regression in an unsupervised process. Thus, in testing, each MEMLIN CPM normalized frame is decoded using a modified Viterbi algorithm and expanded acoustic models, which are obtained from the reference ones and the set of rotation transformations. To test the proposed solution, some experiments with Spanish SpeechDat Car database were carried out. MEMLIN CPM over standard ETSI front-end parameters reaches 83.89% of average improvement in WER, while the introduced hybrid solution goes up to 92.07%. Also, the proposed hybrid technique was tested with Aurora 2 database, obtaining an average improvement of 68.88% with clean training.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133609156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430165
Antoine Raux, M. Eskénazi
We present a new architecture for spoken dialogue systems that explicitly separates the discrete, abstract representation used in the high-level dialogue manager and the continuous, real-time nature of real world events. We propose to use the concept of conversational floor as a means to synchronize the internal state of the dialogue manager with the real world. To act as the interface between these two layers, we introduce a new component, called the Interaction Manager. The proposed architecture was implemented as a new version of the Olympus framework, which can be used across different domains and modalities. We confirmed the practicality of the approach by porting Let's Go, an existing deployed dialogue system to the new architecture.
{"title":"A multi-layer architecture for semi-synchronous event-driven dialogue management","authors":"Antoine Raux, M. Eskénazi","doi":"10.1109/ASRU.2007.4430165","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430165","url":null,"abstract":"We present a new architecture for spoken dialogue systems that explicitly separates the discrete, abstract representation used in the high-level dialogue manager and the continuous, real-time nature of real world events. We propose to use the concept of conversational floor as a means to synchronize the internal state of the dialogue manager with the real world. To act as the interface between these two layers, we introduce a new component, called the Interaction Manager. The proposed architecture was implemented as a new version of the Olympus framework, which can be used across different domains and modalities. We confirmed the practicality of the approach by porting Let's Go, an existing deployed dialogue system to the new architecture.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132594350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430162
S. Oviatt
The present research contributes new empirical research, theory, and prototyping toward developing implicit user-adaptive techniques for system engagement based exclusively on speech amplitude and pen pressure. The results reveal that people will spontaneously adapt their communicative energy level reliably, substantially, and in different modalities to designate and repair an intended interlocutor in a computer-mediated group setting. Furthermore, this sole behavior can be harnessed to achieve system engagement accuracies in the 75 - 86 % range. In short, there was a high level of correct system engagement based exclusively on implicit cues in users' energy level during communication.
{"title":"Implicit user-adaptive system engagement in speech, pen and multimodal interfaces","authors":"S. Oviatt","doi":"10.1109/ASRU.2007.4430162","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430162","url":null,"abstract":"The present research contributes new empirical research, theory, and prototyping toward developing implicit user-adaptive techniques for system engagement based exclusively on speech amplitude and pen pressure. The results reveal that people will spontaneously adapt their communicative energy level reliably, substantially, and in different modalities to designate and repair an intended interlocutor in a computer-mediated group setting. Furthermore, this sole behavior can be harnessed to achieve system engagement accuracies in the 75 - 86 % range. In short, there was a high level of correct system engagement based exclusively on implicit cues in users' energy level during communication.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115103956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430193
Yi-Cheng Pan, Hung-lin Chang, Lin-Shan Lee
In this paper we analytically compare the two widely accepted approaches of spoken document indexing, position specific posterior lattices (PSPL) and confusion network (CN), in terms of retrieval accuracy and index size. The fundamental distinctions between these two approaches in terms of construction units, posterior probabilities, number of clusters, indexing coverage and space requirements are discussed in detail. A new approach to approximate subword posterior probability in a word lattice is also incorporated in PSPL/CN to handle OOV/rare word problems, which were unaddressed in original PSPL and CN approaches. Extensive experimental results on Chinese broadcast news segments indicate that PSPL offers higher accuracy than CN but requiring much larger disk space, while subword-based PSPL turns out to be very attractive because it lowers the storage cost while offers even higher accuracies.
{"title":"Analytical comparison between position specific posterior lattices and confusion networks based on words and subword units for spoken document indexing","authors":"Yi-Cheng Pan, Hung-lin Chang, Lin-Shan Lee","doi":"10.1109/ASRU.2007.4430193","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430193","url":null,"abstract":"In this paper we analytically compare the two widely accepted approaches of spoken document indexing, position specific posterior lattices (PSPL) and confusion network (CN), in terms of retrieval accuracy and index size. The fundamental distinctions between these two approaches in terms of construction units, posterior probabilities, number of clusters, indexing coverage and space requirements are discussed in detail. A new approach to approximate subword posterior probability in a word lattice is also incorporated in PSPL/CN to handle OOV/rare word problems, which were unaddressed in original PSPL and CN approaches. Extensive experimental results on Chinese broadcast news segments indicate that PSPL offers higher accuracy than CN but requiring much larger disk space, while subword-based PSPL turns out to be very attractive because it lowers the storage cost while offers even higher accuracies.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114589185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430170
Yi-Cheng Pan, Lin-Shan Lee
In this paper, we present a new formulation and a new framework for a new type of dialogue system, referred to as the type-II dialogue systems in this paper. The distinct feature of such dialogue systems is their tasks of information access from unstructured knowledge sources, or the lack of a well-organized back-end database offering the information for the user. Typical example tasks of this type of dialogue systems include retrieval, browsing and question answering. The mainstream dialogue systems with a well-organized back-end database are then referred to as type-I dialogue systems here in the paper. The functionalities of each module in such type-II dialogue systems are analyzed, presented, and compared with the respective modules in type-I dialogue systems. A preliminary type-II dialogue system recently developed in National Taiwan University is also presented at the end as a typical example.
{"title":"Type-II dialogue systems for information access from unstructured knowledge sources","authors":"Yi-Cheng Pan, Lin-Shan Lee","doi":"10.1109/ASRU.2007.4430170","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430170","url":null,"abstract":"In this paper, we present a new formulation and a new framework for a new type of dialogue system, referred to as the type-II dialogue systems in this paper. The distinct feature of such dialogue systems is their tasks of information access from unstructured knowledge sources, or the lack of a well-organized back-end database offering the information for the user. Typical example tasks of this type of dialogue systems include retrieval, browsing and question answering. The mainstream dialogue systems with a well-organized back-end database are then referred to as type-I dialogue systems here in the paper. The functionalities of each module in such type-II dialogue systems are analyzed, presented, and compared with the respective modules in type-I dialogue systems. A preliminary type-II dialogue system recently developed in National Taiwan University is also presented at the end as a typical example.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128295018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430120
Themos Stafylakis, V. Katsouros, G. Carayannis
In this paper we present a method of combining several acoustic parametric spaces, statistical models and distance metrics in speaker diarization task. Focusing our interest on the post-segmentation part of the problem, we adopt an incremental feature selection and fusion algorithm based on the Maximum Entropy Principle and Iterative Scaling Algorithm that combines several statistical distance measures on speech-chunk pairs. By this approach, we place the merging-of-chunks clustering process into a probabilistic framework. We also propose a decomposition of the input space according to gender, recording conditions and chunk lengths. The algorithm produced highly competitive results compared to GMM-UBM state-of-the-art methods.
{"title":"Efficient combination of parametric spaces, models and metrics for speaker diarization1","authors":"Themos Stafylakis, V. Katsouros, G. Carayannis","doi":"10.1109/ASRU.2007.4430120","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430120","url":null,"abstract":"In this paper we present a method of combining several acoustic parametric spaces, statistical models and distance metrics in speaker diarization task. Focusing our interest on the post-segmentation part of the problem, we adopt an incremental feature selection and fusion algorithm based on the Maximum Entropy Principle and Iterative Scaling Algorithm that combines several statistical distance measures on speech-chunk pairs. By this approach, we place the merging-of-chunks clustering process into a probabilistic framework. We also propose a decomposition of the input space according to gender, recording conditions and chunk lengths. The algorithm produced highly competitive results compared to GMM-UBM state-of-the-art methods.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133965292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430157
R. Chan, J. Zhang, Pascale Fung, Lu Cao
This paper introduces our work on mandarin lecture speech transcription. In particular, we present our work on a small database, which contains only 16 hours of audio data and 0.16 M words of text data. A range of experiments have been done to improve the performances of the acoustic model and the language model, these include adapting the lecture speech data to the reading speech data for acoustic modeling and the use of lecture conference paper, power points and similar domain web data for language modeling. We also study the effects of automatic segmentation, unsupervised acoustic model adaptation and language model adaptation in our recognition system. By using a 3timesRT multiple passes decoding strategy, we obtain 70.3% accuracy performance in our final system. Finally, we apply our speech transcription system into a SVM summarizer and obtain a ROUGE-L F-measure of 66.5%.
{"title":"A Mandarin lecture speech transcription system for speech summarization","authors":"R. Chan, J. Zhang, Pascale Fung, Lu Cao","doi":"10.1109/ASRU.2007.4430157","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430157","url":null,"abstract":"This paper introduces our work on mandarin lecture speech transcription. In particular, we present our work on a small database, which contains only 16 hours of audio data and 0.16 M words of text data. A range of experiments have been done to improve the performances of the acoustic model and the language model, these include adapting the lecture speech data to the reading speech data for acoustic modeling and the use of lecture conference paper, power points and similar domain web data for language modeling. We also study the effects of automatic segmentation, unsupervised acoustic model adaptation and language model adaptation in our recognition system. By using a 3timesRT multiple passes decoding strategy, we obtain 70.3% accuracy performance in our final system. Finally, we apply our speech transcription system into a SVM summarizer and obtain a ROUGE-L F-measure of 66.5%.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116059809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430115
Jordan Cohen
Summary form only given. The GALE (global autonomous language exploitation) program is a DARPA program to develop and apply computer software technologies to absorb, translate, analyze, and interpret huge volumes of speech and text in multiple languages This program has been active for two years, and the GALE contractors have been engaged in developing highly robust speech recognition, machine translation, and information delivery systems in Chinese and Arabic. Several GALE-developed talks will be given in this workshop. This overview talk will review the program goals, the technical highlights, and the technical issues remaining in the GALE project.
{"title":"The GALE project: A description and an update","authors":"Jordan Cohen","doi":"10.1109/ASRU.2007.4430115","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430115","url":null,"abstract":"Summary form only given. The GALE (global autonomous language exploitation) program is a DARPA program to develop and apply computer software technologies to absorb, translate, analyze, and interpret huge volumes of speech and text in multiple languages This program has been active for two years, and the GALE contractors have been engaged in developing highly robust speech recognition, machine translation, and information delivery systems in Chinese and Arabic. Several GALE-developed talks will be given in this workshop. This overview talk will review the program goals, the technical highlights, and the technical issues remaining in the GALE project.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123449833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-12-01DOI: 10.1109/ASRU.2007.4430107
Yi-Ting Chen, Shih-Hsiang Lin, H. Wang, Berlin Chen
Extractive summarization usually automatically selects indicative sentences from a document according to a certain target summarization ratio, and then sequences them to form a summary. In this paper, we investigate the use of information from relevant documents retrieved from a contemporary text collection for each sentence of a spoken document to be summarized in a probabilistic generative framework for extractive spoken document summarization. In the proposed methods, the probability of a document being generated by a sentence is modeled by a hidden Markov model (HMM), while the retrieved relevant text documents are used to estimate the HMM's parameters and the sentence's prior probability. The results of experiments on Chinese broadcast news compiled in Taiwan show that the new methods outperform the previous HMM approach.
{"title":"Spoken document summarization using relevant information","authors":"Yi-Ting Chen, Shih-Hsiang Lin, H. Wang, Berlin Chen","doi":"10.1109/ASRU.2007.4430107","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430107","url":null,"abstract":"Extractive summarization usually automatically selects indicative sentences from a document according to a certain target summarization ratio, and then sequences them to form a summary. In this paper, we investigate the use of information from relevant documents retrieved from a contemporary text collection for each sentence of a spoken document to be summarized in a probabilistic generative framework for extractive spoken document summarization. In the proposed methods, the probability of a document being generated by a sentence is modeled by a hidden Markov model (HMM), while the retrieved relevant text documents are used to estimate the HMM's parameters and the sentence's prior probability. The results of experiments on Chinese broadcast news compiled in Taiwan show that the new methods outperform the previous HMM approach.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122362309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}