Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.861932
Q. Fu, Kechu Yi, Mingui Sun
This paper presents a novel method for objective assessment of speech quality based on one-step strategy using a feedfoward neutral network. Currently, almost all the existing methods for this assessment can be regarded as a two-step strategy, requiring a distortion computation and a mapping from the average distortion value to the mean opinion score (MOS). Our new method combines these two steps by means of a neural network which can incorporate the perception properties of the human auditory system and provide an MOS estimate directly. Our theoretical analysis and experimental results suggest that this method of MOS estimate significantly overperforms the traditional methods. The correlation coefficient between the subjective test score and objective MOS estimate can reach up to about 0.95.
{"title":"Speech quality objective assessment using neural network","authors":"Q. Fu, Kechu Yi, Mingui Sun","doi":"10.1109/ICASSP.2000.861932","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.861932","url":null,"abstract":"This paper presents a novel method for objective assessment of speech quality based on one-step strategy using a feedfoward neutral network. Currently, almost all the existing methods for this assessment can be regarded as a two-step strategy, requiring a distortion computation and a mapping from the average distortion value to the mean opinion score (MOS). Our new method combines these two steps by means of a neural network which can incorporate the perception properties of the human auditory system and provide an MOS estimate directly. Our theoretical analysis and experimental results suggest that this method of MOS estimate significantly overperforms the traditional methods. The correlation coefficient between the subjective test score and objective MOS estimate can reach up to about 0.95.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125741082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.859197
Ryan P. Thomas, T. Moon
This paper points out two potential problems with residual vector quantization (RVQ): tree entanglement and non-projectiveness of the quantizer. The use of a boundary normalization mapping is proposed to pool all quantization residuals at a stage into identically-shaped regions, reducing or eliminating entanglement. Also, a reconstruction codebook is proposed to eliminate the non-projectiveness is proposed. Results are presented on both random and image data.
{"title":"Projective residual vector quantization and mapped residual pooling","authors":"Ryan P. Thomas, T. Moon","doi":"10.1109/ICASSP.2000.859197","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.859197","url":null,"abstract":"This paper points out two potential problems with residual vector quantization (RVQ): tree entanglement and non-projectiveness of the quantizer. The use of a boundary normalization mapping is proposed to pool all quantization residuals at a stage into identically-shaped regions, reducing or eliminating entanglement. Also, a reconstruction codebook is proposed to eliminate the non-projectiveness is proposed. Results are presented on both random and image data.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"39 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123281843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.860202
Y. Gong, Yu-Hung Kao
Continuous speech recognition is a resource-intensive algorithm. Commercial dictation software requires more than 10 Mbytes to install on the disk and 32 Mbytes RAM to run the application. A typical embedded system can not afford this much RAM because of its high cost and power consumption; it also lacks disk to store the large amount of static data (e.g. acoustic models). We have been working on optimization of a small vocabulary speech recognizer suitable for implementation on a 16-bit fixed-point DSP. This recognizer supports sophisticated continuous density, tied-mixtures Gaussians, parallel model combination, and a noise-robust utterance detection algorithm. The fixed-point version achieves the same performance as the floating-point version. The algorithm runs real-time on a 100 MHz, 16-bit, fixed-point Texas Instruments TMS320C5410 even for the most challenging continuous digit dialing with hands-free microphone in driving conditions.
{"title":"Implementing a high accuracy speaker-independent continuous speech recognizer on a fixed-point DSP","authors":"Y. Gong, Yu-Hung Kao","doi":"10.1109/ICASSP.2000.860202","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.860202","url":null,"abstract":"Continuous speech recognition is a resource-intensive algorithm. Commercial dictation software requires more than 10 Mbytes to install on the disk and 32 Mbytes RAM to run the application. A typical embedded system can not afford this much RAM because of its high cost and power consumption; it also lacks disk to store the large amount of static data (e.g. acoustic models). We have been working on optimization of a small vocabulary speech recognizer suitable for implementation on a 16-bit fixed-point DSP. This recognizer supports sophisticated continuous density, tied-mixtures Gaussians, parallel model combination, and a noise-robust utterance detection algorithm. The fixed-point version achieves the same performance as the floating-point version. The algorithm runs real-time on a 100 MHz, 16-bit, fixed-point Texas Instruments TMS320C5410 even for the most challenging continuous digit dialing with hands-free microphone in driving conditions.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125573153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.859114
T. Fung, H. Meng
We describe our approach in developing a speech synthesis technique for response generation in domain-specific spoken language applications. Our approach handles two Chinese dialects-Cantonese and Putonghua. We chose the foreign exchange domain, and worked with its constrained vocabulary and response expressions. The syllable is selected to be our basic unit for concatenation. Each unit label includes a two-digit appendix to encode the distinctive features of the left and right coarticulatory context. Our approach attempts to maximize intelligibility and naturalness of the responses within the application domain. Hence the synthesized outputs compare favorably with a domain-independent TD-PSOLA synthesizer.
{"title":"Concatenating syllables for response generation in spoken language applications","authors":"T. Fung, H. Meng","doi":"10.1109/ICASSP.2000.859114","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.859114","url":null,"abstract":"We describe our approach in developing a speech synthesis technique for response generation in domain-specific spoken language applications. Our approach handles two Chinese dialects-Cantonese and Putonghua. We chose the foreign exchange domain, and worked with its constrained vocabulary and response expressions. The syllable is selected to be our basic unit for concatenation. Each unit label includes a two-digit appendix to encode the distinctive features of the left and right coarticulatory context. Our approach attempts to maximize intelligibility and naturalness of the responses within the application domain. Hence the synthesized outputs compare favorably with a domain-independent TD-PSOLA synthesizer.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125588187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.861803
Bertram E. Shi, K. Yao, Z. Cao
Minimum classification error (MCE) rate training is a discriminative training method which seeks to minimize an empirical estimate of the error probability derived over a training set. The segmental generalized probabilistic descent (GPD) algorithm for MCE uses the log likelihood of the best path as a discriminant function to estimate the error probability. This paper shows that by using a discriminant function similar to the auxiliary function used in EM, we can obtain a "soft" version of GPD in the sense that information about all possible paths is retained. Complexity is similar to segmental GPD. For certain parameter values, the algorithm is equivalent to segmental GPD. By modifying the misclassification measure usually used, we can obtain an algorithm for embedded MCE training for continuous speech which does not require a separate N-best search to determine competing classes. Experimental results show error rate reduction of 20% compared with maximum likelihood training.
{"title":"Soft GPD for minimum classification error rate training","authors":"Bertram E. Shi, K. Yao, Z. Cao","doi":"10.1109/ICASSP.2000.861803","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.861803","url":null,"abstract":"Minimum classification error (MCE) rate training is a discriminative training method which seeks to minimize an empirical estimate of the error probability derived over a training set. The segmental generalized probabilistic descent (GPD) algorithm for MCE uses the log likelihood of the best path as a discriminant function to estimate the error probability. This paper shows that by using a discriminant function similar to the auxiliary function used in EM, we can obtain a \"soft\" version of GPD in the sense that information about all possible paths is retained. Complexity is similar to segmental GPD. For certain parameter values, the algorithm is equivalent to segmental GPD. By modifying the misclassification measure usually used, we can obtain an algorithm for embedded MCE training for continuous speech which does not require a separate N-best search to determine competing classes. Experimental results show error rate reduction of 20% compared with maximum likelihood training.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126628896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.861931
D. Veitch, P. Abry, P. Flandrin, P. Chainais
Infinitely divisible cascades are a model class previously introduced in the field of turbulence to describe the statistics of velocity fields. In this paper, using a wavelet reformulation of the cascades, we investigate their ability to analyze band model scaling properties of data and compare their fundamental ingredients to those of other scaling model classes such as self-similar and multifractal processes. We also propose an estimation procedure for the propagator or kernel of the cascades. Finally the cascade model is successfully applied to describe Internet TCP network traffic data, bringing new insights into their scaling properties and revealing a pitfall in existing techniques.
{"title":"Infinitely divisible cascade analysis of network traffic data","authors":"D. Veitch, P. Abry, P. Flandrin, P. Chainais","doi":"10.1109/ICASSP.2000.861931","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.861931","url":null,"abstract":"Infinitely divisible cascades are a model class previously introduced in the field of turbulence to describe the statistics of velocity fields. In this paper, using a wavelet reformulation of the cascades, we investigate their ability to analyze band model scaling properties of data and compare their fundamental ingredients to those of other scaling model classes such as self-similar and multifractal processes. We also propose an estimation procedure for the propagator or kernel of the cascades. Finally the cascade model is successfully applied to describe Internet TCP network traffic data, bringing new insights into their scaling properties and revealing a pitfall in existing techniques.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126637684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.859141
M. Ikram, D. Morgan
We study and explore the limitations of methods for blind separation of a mixture of multiple speakers in a real reverberant environment. To support our results, we analyze a frequency-domain method, which achieves blind source separation (BSS) by transforming the time-domain convolutive problem to multiple short-term problems in the frequency domain. We show that treating the problem independently at different frequency bins introduces a "permutation inconsistency" problem, which becomes worse as the length of room impulse response increases. Our studies prove that the ideas proposed in the existing literature are not capable of effectively handling this problem and a need exists for its satisfactory solution. We speculate that time-domain BSS techniques may also suffer from an equivalent permutation inconsistency problem when long un-mixing filters are used.
{"title":"Exploring permutation inconsistency in blind separation of speech signals in a reverberant environment","authors":"M. Ikram, D. Morgan","doi":"10.1109/ICASSP.2000.859141","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.859141","url":null,"abstract":"We study and explore the limitations of methods for blind separation of a mixture of multiple speakers in a real reverberant environment. To support our results, we analyze a frequency-domain method, which achieves blind source separation (BSS) by transforming the time-domain convolutive problem to multiple short-term problems in the frequency domain. We show that treating the problem independently at different frequency bins introduces a \"permutation inconsistency\" problem, which becomes worse as the length of room impulse response increases. Our studies prove that the ideas proposed in the existing literature are not capable of effectively handling this problem and a need exists for its satisfactory solution. We speculate that time-domain BSS techniques may also suffer from an equivalent permutation inconsistency problem when long un-mixing filters are used.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126648196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.861907
W. Yoshida, S. Ishii, Masa-aki Sato
In this article, we discuss the reconstruction of chaotic dynamics in a partial observation situation. As a function approximator, we employ a normalized Gaussian network (NGnet), which is trained by an on-line EM algorithm. In order to deal with the partial observation, we propose a new embedding method based on smoothing filters, which is called integral embedding. The NGnet is trained to learn the dynamical system in the integral coordinate space. Experimental results show that the trained NGnet is able to reproduce a chaotic attractor that well approximates the complexity and instability of the original chaotic attractor, even when the data involve large noise. In comparison with our previous method using delay coordinate embedding, this new method is more robust to noise and faster in learning.
{"title":"Reconstruction of chaotic dynamics using a noise-robust embedding method","authors":"W. Yoshida, S. Ishii, Masa-aki Sato","doi":"10.1109/ICASSP.2000.861907","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.861907","url":null,"abstract":"In this article, we discuss the reconstruction of chaotic dynamics in a partial observation situation. As a function approximator, we employ a normalized Gaussian network (NGnet), which is trained by an on-line EM algorithm. In order to deal with the partial observation, we propose a new embedding method based on smoothing filters, which is called integral embedding. The NGnet is trained to learn the dynamical system in the integral coordinate space. Experimental results show that the trained NGnet is able to reproduce a chaotic attractor that well approximates the complexity and instability of the original chaotic attractor, even when the data involve large noise. In comparison with our previous method using delay coordinate embedding, this new method is more robust to noise and faster in learning.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114896989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.859218
Faisal Alturki, R. Mersereau
Digital watermarking is the process of secretly embedding a short sequence of information inside a digital source without changing its perceptual quality. We present a new oblivious digital watermarking method for copyright protection of still images. The technique is based on modifying the sign of a subset of low frequency DCT magnitude coefficients. The robustness to a number of standard image processing attacks is demonstrated using the criteria of the latest Stirmark test.
{"title":"An oblivious robust digital watermark technique for still images using DCT phase modulation","authors":"Faisal Alturki, R. Mersereau","doi":"10.1109/ICASSP.2000.859218","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.859218","url":null,"abstract":"Digital watermarking is the process of secretly embedding a short sequence of information inside a digital source without changing its perceptual quality. We present a new oblivious digital watermarking method for copyright protection of still images. The technique is based on modifying the sign of a subset of low frequency DCT magnitude coefficients. The robustness to a number of standard image processing attacks is demonstrated using the criteria of the latest Stirmark test.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115559120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2000-06-05DOI: 10.1109/ICASSP.2000.859098
J. Hellgren, U. Forssell
An adaptive filter can be used to cancel the undesired acoustic feedback of hearing aids. The adaptive algorithm studied in this paper uses the output and input signal of the hearing aid to continuously track the acoustic feedback path. The bias of the optimal estimate with a quadratic norm is analyzed. The results show the importance of having a good model of the input signal to the hearing aid, as the error in this model will introduce bias in the estimate of the feedback path.
{"title":"Bias of feedback cancellation algorithms based on direct closed loop identification","authors":"J. Hellgren, U. Forssell","doi":"10.1109/ICASSP.2000.859098","DOIUrl":"https://doi.org/10.1109/ICASSP.2000.859098","url":null,"abstract":"An adaptive filter can be used to cancel the undesired acoustic feedback of hearing aids. The adaptive algorithm studied in this paper uses the output and input signal of the hearing aid to continuously track the acoustic feedback path. The bias of the optimal estimate with a quadratic norm is analyzed. The results show the importance of having a good model of the input signal to the hearing aid, as the error in this model will introduce bias in the estimate of the feedback path.","PeriodicalId":164817,"journal":{"name":"2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)","volume":"387 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115990733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}