Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389361
T. Robinson, M. Hochberg, S. Renals
This paper describes phone modelling improvements to the hybrid connectionist-hidden Markov model speech recognition system developed at Cambridge University. These improvements are applied to phone recognition from the TIMIT task and word recognition from the Wall Street Journal (WSJ) task. A recurrent net is used to map acoustic vectors to posterior probabilities of phone classes. The maximum likelihood phone or word string is then extracted using Markov models. The paper describes three improvements: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling. The first is shown to provide a significant improvement in the TIMIT phone recognition rate and all three provide an improvement in the WSJ word recognition rate.<>
{"title":"IPA: improved phone modelling with recurrent neural networks","authors":"T. Robinson, M. Hochberg, S. Renals","doi":"10.1109/ICASSP.1994.389361","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389361","url":null,"abstract":"This paper describes phone modelling improvements to the hybrid connectionist-hidden Markov model speech recognition system developed at Cambridge University. These improvements are applied to phone recognition from the TIMIT task and word recognition from the Wall Street Journal (WSJ) task. A recurrent net is used to map acoustic vectors to posterior probabilities of phone classes. The maximum likelihood phone or word string is then extracted using Markov models. The paper describes three improvements: connectionist model merging; explicit presentation of acoustic context; and improved duration modelling. The first is shown to provide a significant improvement in the TIMIT phone recognition rate and all three provide an improvement in the WSJ word recognition rate.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121597920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389481
G. Sullivan
This paper presents solutions to the entropy-constrained scalar quantizer (ECSQ) design problem for two sources commonly encountered in image and speech compression applications: sources having exponential and Laplacian probability density functions. We obtain the optimal ECSQ either with or without an additional constraint on the number of levels in the quantizer. In contrast to prior methods, which require iterative solution of a large number of nonlinear equations, the new method needs only a single sequence of solutions to one-dimensional nonlinear equations (in some Laplacian cases, one additional two-dimensional solution is needed). As a result, the new method is orders of magnitude faster than prior ones. We also show that as the constraint on the number of levels in the quantizer is relaxed, the optimal ECSQ becomes a uniform threshold quantizer (UTQ) for exponential, but not for Laplacian sources.<>
{"title":"Optimal entropy constrained scalar quantization for exponential and Laplacian random variables","authors":"G. Sullivan","doi":"10.1109/ICASSP.1994.389481","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389481","url":null,"abstract":"This paper presents solutions to the entropy-constrained scalar quantizer (ECSQ) design problem for two sources commonly encountered in image and speech compression applications: sources having exponential and Laplacian probability density functions. We obtain the optimal ECSQ either with or without an additional constraint on the number of levels in the quantizer. In contrast to prior methods, which require iterative solution of a large number of nonlinear equations, the new method needs only a single sequence of solutions to one-dimensional nonlinear equations (in some Laplacian cases, one additional two-dimensional solution is needed). As a result, the new method is orders of magnitude faster than prior ones. We also show that as the constraint on the number of levels in the quantizer is relaxed, the optimal ECSQ becomes a uniform threshold quantizer (UTQ) for exponential, but not for Laplacian sources.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389244
R. Hagen
Studies the cepstral coefficients as a suitable representation of the linear prediction filter for spectral coding purposes. Spectral coding methods in predictive speech coders are usually evaluated using the spectral distance measure. The average spectral distance combined with a measure of the percentage of spectra with high distortion are used to predict the perceptual quality when quantizing the prediction filter. The authors show that the spectral distance is equivalent to a squared error in the cepstral domain. Methods for spectral quantization using vector quantization of cepstral coefficients are analyzed. Better results than for quantization of line spectrum frequencies are reported for both single-stage VQ at 11-14 bits as well as 2-stage VQ at 18-22 bits. It is concluded that the cepstral coefficients are the right representation for LPC spectral coding purposes.<>
{"title":"Spectral quantization of cepstral coefficients","authors":"R. Hagen","doi":"10.1109/ICASSP.1994.389244","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389244","url":null,"abstract":"Studies the cepstral coefficients as a suitable representation of the linear prediction filter for spectral coding purposes. Spectral coding methods in predictive speech coders are usually evaluated using the spectral distance measure. The average spectral distance combined with a measure of the percentage of spectra with high distortion are used to predict the perceptual quality when quantizing the prediction filter. The authors show that the spectral distance is equivalent to a squared error in the cepstral domain. Methods for spectral quantization using vector quantization of cepstral coefficients are analyzed. Better results than for quantization of line spectrum frequencies are reported for both single-stage VQ at 11-14 bits as well as 2-stage VQ at 18-22 bits. It is concluded that the cepstral coefficients are the right representation for LPC spectral coding purposes.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132060848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389330
L. Wilcox, Francine R. Chen, Don Kimber, V. Balasubramanian
This paper describes techniques for segmentation of conversational speech based on speaker identity. Speaker segmentation is performed using Viterbi decoding on a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker sub-networks are initialized using Baum-Welch training on data labeled by speaker, and are iteratively retrained based on the previous segmentation. If data labeled by speaker is not available, agglomerative clustering is used to approximately segment the conversational speech according to speaker prior to Baum-Welch training. The distance measure for the clustering is a likelihood ratio in which speakers are modeled by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data.<>
{"title":"Segmentation of speech using speaker identification","authors":"L. Wilcox, Francine R. Chen, Don Kimber, V. Balasubramanian","doi":"10.1109/ICASSP.1994.389330","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389330","url":null,"abstract":"This paper describes techniques for segmentation of conversational speech based on speaker identity. Speaker segmentation is performed using Viterbi decoding on a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker sub-networks are initialized using Baum-Welch training on data labeled by speaker, and are iteratively retrained based on the previous segmentation. If data labeled by speaker is not available, agglomerative clustering is used to approximately segment the conversational speech according to speaker prior to Baum-Welch training. The distance measure for the clustering is a likelihood ratio in which speakers are modeled by Gaussian distributions. The distance between merged segments is recomputed at each stage of the clustering, and a duration model is used to bias the likelihood ratio. Segmentation accuracy using agglomerative clustering initialization matches accuracy using initialization with speaker labeled data.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132328585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389465
F. D. Natale, G. Desoli, D. Giusto
A novel approach to video coding at very low bit rates is presented, which differs significantly from most of previous approaches, as it uses a spline-like interpolation scheme in a spatiotemporal domain. This operator is applied to a non-uniform 3D grid (built on sets of consecutive frames) so as to allocate the information adaptively. The proposed method allows a full exploitation of intra/inter-frame correlations and a good objective and visual quality of the reconstructed sequences.<>
{"title":"A novel tree-structured video coder","authors":"F. D. Natale, G. Desoli, D. Giusto","doi":"10.1109/ICASSP.1994.389465","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389465","url":null,"abstract":"A novel approach to video coding at very low bit rates is presented, which differs significantly from most of previous approaches, as it uses a spline-like interpolation scheme in a spatiotemporal domain. This operator is applied to a non-uniform 3D grid (built on sets of consecutive frames) so as to allocate the information adaptively. The proposed method allows a full exploitation of intra/inter-frame correlations and a good objective and visual quality of the reconstructed sequences.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132515597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389800
T. E. O. Silva
Proves the equivalence between the Gamma and Laguerre filters. Applying the optimal conditions for Gamma filters, which are easy to obtain, the author arrives at the optimal conditions for Laguerre filters. Curiously these conditions are the same as those of a truncated Laguerre series approximation, which corresponds to the usage of an impulse as the input of the Laguerre filter. The author illustrates these results with an example. The author also investigates the relative merits of both structures in an adaptive filter setup.<>
{"title":"On the equivalence between Gamma and Laguerre filters","authors":"T. E. O. Silva","doi":"10.1109/ICASSP.1994.389800","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389800","url":null,"abstract":"Proves the equivalence between the Gamma and Laguerre filters. Applying the optimal conditions for Gamma filters, which are easy to obtain, the author arrives at the optimal conditions for Laguerre filters. Curiously these conditions are the same as those of a truncated Laguerre series approximation, which corresponds to the usage of an impulse as the input of the Laguerre filter. The author illustrates these results with an example. The author also investigates the relative merits of both structures in an adaptive filter setup.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389510
J. J. Li, A. Ramsingh
The multi-shell median filters have been shown to be effective in preserving image details as well as in the suppression of impulsive noise. In this paper, the statistical analysis of a general class of median based multi-shell order-statistics filters is presented. Using statistical threshold decomposition, together with a tri-tree structure, the statistical properties of the filters were derived. Based on the results, a 2-D nonlinear filter which is of good compromise between noise attenuation and detail preservation to fit various applications can be obtained.<>
{"title":"Statistical analysis of the median based multi-shell order-statistics filters","authors":"J. J. Li, A. Ramsingh","doi":"10.1109/ICASSP.1994.389510","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389510","url":null,"abstract":"The multi-shell median filters have been shown to be effective in preserving image details as well as in the suppression of impulsive noise. In this paper, the statistical analysis of a general class of median based multi-shell order-statistics filters is presented. Using statistical threshold decomposition, together with a tri-tree structure, the statistical properties of the filters were derived. Based on the results, a 2-D nonlinear filter which is of good compromise between noise attenuation and detail preservation to fit various applications can be obtained.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389867
D. Gerlach, A. Paulraj
Currently, a central base station communicates simultaneously with several mobile users by allocating a separate time or frequency channel for each mobile to prevent undesired crosstalk. However, each time or frequency channel may be reused among several mobiles by means of an antenna array at the base station which points a separate beam at each user. The downlink beamformer would normally operate in an "open loop" mode, in which the base steers a mainlobe in the direction of each mobile. Such a system may operate effectively in a free space environment with no multipath. In the presence of scattering, open loop methods will not perform adequately. A new "closed loop" technique is presented in which each mobile user feeds back to the base estimates of the received signal amplitudes. Using feedback, the base station can achieve precision beamforming resulting in lower crosstalk and improved signal separation even in the presence of strong scattering environments.<>
{"title":"Spectrum reuse using transmitting antenna arrays with feedback","authors":"D. Gerlach, A. Paulraj","doi":"10.1109/ICASSP.1994.389867","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389867","url":null,"abstract":"Currently, a central base station communicates simultaneously with several mobile users by allocating a separate time or frequency channel for each mobile to prevent undesired crosstalk. However, each time or frequency channel may be reused among several mobiles by means of an antenna array at the base station which points a separate beam at each user. The downlink beamformer would normally operate in an \"open loop\" mode, in which the base steers a mainlobe in the direction of each mobile. Such a system may operate effectively in a free space environment with no multipath. In the presence of scattering, open loop methods will not perform adequately. A new \"closed loop\" technique is presented in which each mobile user feeds back to the base estimates of the received signal amplitudes. Using feedback, the base station can achieve precision beamforming resulting in lower crosstalk and improved signal separation even in the presence of strong scattering environments.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130131980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.390039
T. Adalı, M. Sönmez
We formulate the adaptive channel equalization as a conditional probability distribution learning problem. Conditional probability density function of the transmitted signal given the received signal is parametrized by a sigmoidal perceptron. In this framework, we use relative entropy (Kullback-Leibler distance) between the true and the estimated distributions as the cost function to be minimized. The true probabilities are approximated by their stochastic estimators resulting in a stochastic relative entropy cost function. This function is well-formed in the sense of Wittner and Denker (1988), therefore gradient descent on this cost function is guaranteed to find a solution. The consistency and asymptotic normality of this learning scheme are shown via maximum partial likelihood estimation of logistic models. As a practical example, we demonstrate that the resulting algorithm successfully equalizes multipath channels.<>
{"title":"Channel equalization with perceptrons: an information-theoretic approach","authors":"T. Adalı, M. Sönmez","doi":"10.1109/ICASSP.1994.390039","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.390039","url":null,"abstract":"We formulate the adaptive channel equalization as a conditional probability distribution learning problem. Conditional probability density function of the transmitted signal given the received signal is parametrized by a sigmoidal perceptron. In this framework, we use relative entropy (Kullback-Leibler distance) between the true and the estimated distributions as the cost function to be minimized. The true probabilities are approximated by their stochastic estimators resulting in a stochastic relative entropy cost function. This function is well-formed in the sense of Wittner and Denker (1988), therefore gradient descent on this cost function is guaranteed to find a solution. The consistency and asymptotic normality of this learning scheme are shown via maximum partial likelihood estimation of logistic models. As a practical example, we demonstrate that the resulting algorithm successfully equalizes multipath channels.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"9923 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130244413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1994-04-19DOI: 10.1109/ICASSP.1994.389348
Thomas Staples, J. Picone, Nozomi Arai
Texas Instruments' Voice Across Japan (VAJ) database, modeled after the highly successful Voice Across America project, consists of a wide range of diverse speech material including digit strings, yes/no questions, and phonetically-rich read sentences. The data is being collected using long distance telephone lines and an analog telephone interface. The target size is 14 items per speaker by 10,000 speakers. Greater emphasis is being placed on the collection of phonetically-rich read sentence data. Four randomly selected sentences are included in each session: one from the 512 sentence ATR PB set, and three from a 10,000 sentence set developed specifically for this project. This latter sentence set, designed to maximize the triphone coverage of the database, is described. The VAJ database is planned to be included in the Linguistic Data Consortium's (LDC) Polyphone (multi-language) database.<>
德州仪器的Voice Across Japan (VAJ)数据库是仿造了非常成功的Voice Across America项目,由各种各样的语音材料组成,包括数字串、是/否问题和语音丰富的可读句子。数据是通过长途电话线和模拟电话接口收集的。目标尺寸是每10,000名演讲者14件物品。更大的重点放在收集语音丰富的阅读句子数据上。每个会话包含四个随机选择的句子:一个来自512个句子的ATR PB集,另外三个来自专门为这个项目开发的10,000个句子集。描述了后一种句子集,旨在最大限度地扩大数据库的三重电话覆盖范围。VAJ数据库计划被纳入语言数据联盟(LDC)的Polyphone(多语言)数据库。
{"title":"The voice across Japan database-the Japanese language contribution to Polyphone","authors":"Thomas Staples, J. Picone, Nozomi Arai","doi":"10.1109/ICASSP.1994.389348","DOIUrl":"https://doi.org/10.1109/ICASSP.1994.389348","url":null,"abstract":"Texas Instruments' Voice Across Japan (VAJ) database, modeled after the highly successful Voice Across America project, consists of a wide range of diverse speech material including digit strings, yes/no questions, and phonetically-rich read sentences. The data is being collected using long distance telephone lines and an analog telephone interface. The target size is 14 items per speaker by 10,000 speakers. Greater emphasis is being placed on the collection of phonetically-rich read sentence data. Four randomly selected sentences are included in each session: one from the 512 sentence ATR PB set, and three from a 10,000 sentence set developed specifically for this project. This latter sentence set, designed to maximize the triphone coverage of the database, is described. The VAJ database is planned to be included in the Linguistic Data Consortium's (LDC) Polyphone (multi-language) database.<<ETX>>","PeriodicalId":290798,"journal":{"name":"Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"29 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1994-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134041950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}