Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820799
Danwei Cai, Weicheng Cai, Zhidong Ni, Ming Li
In this paper, we apply Locality Sensitive Discriminant Analysis (LSDA) to speaker verification system for intersession variability compensation. As opposed to LDA which fails to discover the local geometrical structure of the data manifold, LSDA finds a projection which maximizes the margin between i-vectors from different speakers at each local area. Since the number of samples varies in a wide range in each class, we improve LSDA by using adaptive k nearest neighbors in each class and modifying the corresponding within- and between-class weight matrix. In that way, each class has equal importance in LSDA's objective function. Experiments were carried out on the NIST 2010 speaker recognition evaluation (SRE) extended condition 5 female task, results show that our proposed adaptive k nearest neighbors based LSDA method significantly improves the conventional i-vector/PLDA baseline by 18% relative cost reduction and 28% relative equal error rate reduction.
{"title":"Locality sensitive discriminant analysis for speaker verification","authors":"Danwei Cai, Weicheng Cai, Zhidong Ni, Ming Li","doi":"10.1109/APSIPA.2016.7820799","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820799","url":null,"abstract":"In this paper, we apply Locality Sensitive Discriminant Analysis (LSDA) to speaker verification system for intersession variability compensation. As opposed to LDA which fails to discover the local geometrical structure of the data manifold, LSDA finds a projection which maximizes the margin between i-vectors from different speakers at each local area. Since the number of samples varies in a wide range in each class, we improve LSDA by using adaptive k nearest neighbors in each class and modifying the corresponding within- and between-class weight matrix. In that way, each class has equal importance in LSDA's objective function. Experiments were carried out on the NIST 2010 speaker recognition evaluation (SRE) extended condition 5 female task, results show that our proposed adaptive k nearest neighbors based LSDA method significantly improves the conventional i-vector/PLDA baseline by 18% relative cost reduction and 28% relative equal error rate reduction.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114078379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820681
Mizuki Murayama, Daisuke Oguro, H. Kikuchi, H. Huttunen, Yo-Sung Ho, Jaeho Shin
The divergence similarity between two color images is presented based on the Jensen-Shannon divergence to measure the color-distribution similarity. Subjective assessment experiments were developed to obtain mean opinion scores (MOS) of test images. It was found that the divergence similarity and MOS values showed statistically significant correlations.
{"title":"Color-distribution similarity by information theoretic divergence for color images","authors":"Mizuki Murayama, Daisuke Oguro, H. Kikuchi, H. Huttunen, Yo-Sung Ho, Jaeho Shin","doi":"10.1109/APSIPA.2016.7820681","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820681","url":null,"abstract":"The divergence similarity between two color images is presented based on the Jensen-Shannon divergence to measure the color-distribution similarity. Subjective assessment experiments were developed to obtain mean opinion scores (MOS) of test images. It was found that the divergence similarity and MOS values showed statistically significant correlations.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122625817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820865
Mu Yang, Li Su, Yi-Hsuan Yang
A musical chord is usually described by its root note and the chord type. While a substantial amount of work has been done in the field of music information retrieval (MIR) to automate chord recognition, the role of root notes in this task has seldom received specific attention. In this paper, we present a new approach and empirical studies demonstrating improved accuracy in chord recognition by properly highlighting the information of the root notes. In the signal level, we propose to combine spectral features with features derived from the cepstrum to improve the identification of low pitches, which usually correspond to the root notes. In the model level, we propose a multi-task learning framework based on the neural nets to jointly consider chord recognition and root note recognition in training. We found that the improved accuracy can be attributed to better information about the sub-harmonics of the notes, and the emphasis of root notes in recognizing chords.
{"title":"Highlighting root notes in chord recognition using cepstral features and multi-task learning","authors":"Mu Yang, Li Su, Yi-Hsuan Yang","doi":"10.1109/APSIPA.2016.7820865","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820865","url":null,"abstract":"A musical chord is usually described by its root note and the chord type. While a substantial amount of work has been done in the field of music information retrieval (MIR) to automate chord recognition, the role of root notes in this task has seldom received specific attention. In this paper, we present a new approach and empirical studies demonstrating improved accuracy in chord recognition by properly highlighting the information of the root notes. In the signal level, we propose to combine spectral features with features derived from the cepstrum to improve the identification of low pitches, which usually correspond to the root notes. In the model level, we propose a multi-task learning framework based on the neural nets to jointly consider chord recognition and root note recognition in training. We found that the improved accuracy can be attributed to better information about the sub-harmonics of the notes, and the emphasis of root notes in recognizing chords.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126461814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820710
Jaemoon Lim, Minhyeok Heo, Chulwoo Lee, Chang-Su Kim
We propose a novel noisy low-light image enhancement algorithm via structure-texture-noise (STN) decomposition. We split an input image into structure, texture, and noise components, and enhance the structure and texture components separately. Specifically, we first enhance the contrast of the structure image, by extending a 2D histogram-based image enhancement scheme based on the characteristics of low-light images. Then, we reconstruct the texture image by retrieving texture components from the noise image, and enhance it by exploiting the perceptual response of the human visual system. Experimental results demonstrate that the proposed STN algorithm sharpens the texture and enhances the contrast more effectively than conventional algorithms, while removing noise without artifacts.
{"title":"Enhancement of noisy low-light images via structure-texture-noise decomposition","authors":"Jaemoon Lim, Minhyeok Heo, Chulwoo Lee, Chang-Su Kim","doi":"10.1109/APSIPA.2016.7820710","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820710","url":null,"abstract":"We propose a novel noisy low-light image enhancement algorithm via structure-texture-noise (STN) decomposition. We split an input image into structure, texture, and noise components, and enhance the structure and texture components separately. Specifically, we first enhance the contrast of the structure image, by extending a 2D histogram-based image enhancement scheme based on the characteristics of low-light images. Then, we reconstruct the texture image by retrieving texture components from the noise image, and enhance it by exploiting the perceptual response of the human visual system. Experimental results demonstrate that the proposed STN algorithm sharpens the texture and enhances the contrast more effectively than conventional algorithms, while removing noise without artifacts.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129972809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820703
Zhen Wei, Zhizheng Wu, Lei Xie
Using speech or text to predict articulatory movements can have potential benefits for speech related applications. Many approaches have been proposed to solve the acoustic-to-articulatory inversion problem, which is much more than the exploration for predicting articulatory movements from text. In this paper, we investigate the feasibility of using deep neural network (DNN) for articulartory movement prediction from text. We also combine full-context features, state and phone information with stacked bottleneck features which provide wide linguistic context as network input, to improve the performance of articulatory movements' prediction. We show on the MNGU0 data set that our DNN approach achieves a root mean-squared error (RMSE) of 0.7370 mm, the lowest RMSE reported in the literature. We also confirmed the effectiveness of stacked bottleneck features, which could include important contextual information.
{"title":"Predicting articulatory movement from text using deep architecture with stacked bottleneck features","authors":"Zhen Wei, Zhizheng Wu, Lei Xie","doi":"10.1109/APSIPA.2016.7820703","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820703","url":null,"abstract":"Using speech or text to predict articulatory movements can have potential benefits for speech related applications. Many approaches have been proposed to solve the acoustic-to-articulatory inversion problem, which is much more than the exploration for predicting articulatory movements from text. In this paper, we investigate the feasibility of using deep neural network (DNN) for articulartory movement prediction from text. We also combine full-context features, state and phone information with stacked bottleneck features which provide wide linguistic context as network input, to improve the performance of articulatory movements' prediction. We show on the MNGU0 data set that our DNN approach achieves a root mean-squared error (RMSE) of 0.7370 mm, the lowest RMSE reported in the literature. We also confirmed the effectiveness of stacked bottleneck features, which could include important contextual information.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129708615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Variable transform block (TB) sizes cause high computational complexity at the HEVC encoder. In this paper, a fast residual quad-tree (RQT) structure decision method is proposed to reduce the number of candidate transform sizes. The proposed method uses spatial and temporal correlation information in the neighbor blocks to predict the depth of current RQT. In addition, an efficient all zero block (AZB) detection approach is designed to accelerate transform and quantization. At last, the nonzero DCT coefficient (NNZ) based scheme is also integrated in the proposed method to early terminate the recursive RQT mode decision process. Experimental results show that our proposed method is able to reduce 70% computation complexity on average in RQT structure decision. And the BDBR and BDPR gains are 1.13% and −0.048dB respectively which are negligible.
{"title":"Fast RQT structure decision method for HEVC","authors":"Wei Zhou, Chang Yan, Henglu Wei, Guanwen Zhang, Ai Qing, Xin Zhou","doi":"10.1109/APSIPA.2016.7820705","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820705","url":null,"abstract":"Variable transform block (TB) sizes cause high computational complexity at the HEVC encoder. In this paper, a fast residual quad-tree (RQT) structure decision method is proposed to reduce the number of candidate transform sizes. The proposed method uses spatial and temporal correlation information in the neighbor blocks to predict the depth of current RQT. In addition, an efficient all zero block (AZB) detection approach is designed to accelerate transform and quantization. At last, the nonzero DCT coefficient (NNZ) based scheme is also integrated in the proposed method to early terminate the recursive RQT mode decision process. Experimental results show that our proposed method is able to reduce 70% computation complexity on average in RQT structure decision. And the BDBR and BDPR gains are 1.13% and −0.048dB respectively which are negligible.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129606523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820804
Tao Long, J. Benesty, Jingdong Chen
Noise reduction has long been an active research topic in signal processing and many algorithms have been developed over the last four decades. These algorithms were proved to be successful in some degree to improve the signal-to-noise ratio (SNR) and speech quality. However, there is one problem common to all these algorithms: the volume of the enhanced signal after noise reduction is often perceived lower than that of the original signal. This phenomenon is particularly serious when SNR is low. In this paper, we develop two constrained Wiener gains and filters for noise reduction in the short-time Fourier transform (STFT) domain. These Wiener gains and filters are deduced by minimizing the mean-squared error (MSE) between the clean speech and the speech estimate with the constraint that the sum of the variances of the filtered speech and residual noise is equal to the variance of the noisy observation.
{"title":"Constrained Wiener gains and filters for single-channel and multichannel noise reduction","authors":"Tao Long, J. Benesty, Jingdong Chen","doi":"10.1109/APSIPA.2016.7820804","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820804","url":null,"abstract":"Noise reduction has long been an active research topic in signal processing and many algorithms have been developed over the last four decades. These algorithms were proved to be successful in some degree to improve the signal-to-noise ratio (SNR) and speech quality. However, there is one problem common to all these algorithms: the volume of the enhanced signal after noise reduction is often perceived lower than that of the original signal. This phenomenon is particularly serious when SNR is low. In this paper, we develop two constrained Wiener gains and filters for noise reduction in the short-time Fourier transform (STFT) domain. These Wiener gains and filters are deduced by minimizing the mean-squared error (MSE) between the clean speech and the speech estimate with the constraint that the sum of the variances of the filtered speech and residual noise is equal to the variance of the noisy observation.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128051735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820892
Shunsuke Yamaki, Ryo Suzuki, M. Kawamata, M. Yoshizawa
This paper proposes statistical analysis of phase-only correlation functions between two signals with stochastic phase-spectra. We derive the expectation and variance of the phase-only correlation functions assuming phase-spectra of two input signals to be bivariate probability variables. As a result, we give expressions for the expectation and variance of phase-only correlation functions in terms of joint characteristic functions of the bivariate probability density function of the phase-spectra.
{"title":"Statistical analysis of phase-only correlation functions between two signals with stochastic bivariate phase-spectra","authors":"Shunsuke Yamaki, Ryo Suzuki, M. Kawamata, M. Yoshizawa","doi":"10.1109/APSIPA.2016.7820892","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820892","url":null,"abstract":"This paper proposes statistical analysis of phase-only correlation functions between two signals with stochastic phase-spectra. We derive the expectation and variance of the phase-only correlation functions assuming phase-spectra of two input signals to be bivariate probability variables. As a result, we give expressions for the expectation and variance of phase-only correlation functions in terms of joint characteristic functions of the bivariate probability density function of the phase-spectra.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"38 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130926653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820728
Yuichi Tanaka, S. Yagyu, Akie Sakiyama, Masaki Onuki
We propose a calculation method of deformed image pixel positions for mesh-based image retargeting. Image retargeting is a sophisticated image resizing method which yields resized images with acceptable quality even if we resize the image into different aspect ratio from the original one. It often employs a mesh-based approach, where pixels are nodes of a graph and relationships between pixels are represented as its edges. In this paper, we reformulate a pixel position deformation of image retargeting as a spectral graph filtering with a graph signal processing-based approach. We validate our method through some image retargeting examples with an appropriately designed filter kernels in the graph spectral domain.
{"title":"Mesh-based image retargeting with spectral graph filtering","authors":"Yuichi Tanaka, S. Yagyu, Akie Sakiyama, Masaki Onuki","doi":"10.1109/APSIPA.2016.7820728","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820728","url":null,"abstract":"We propose a calculation method of deformed image pixel positions for mesh-based image retargeting. Image retargeting is a sophisticated image resizing method which yields resized images with acceptable quality even if we resize the image into different aspect ratio from the original one. It often employs a mesh-based approach, where pixels are nodes of a graph and relationships between pixels are represented as its edges. In this paper, we reformulate a pixel position deformation of image retargeting as a spectral graph filtering with a graph signal processing-based approach. We validate our method through some image retargeting examples with an appropriately designed filter kernels in the graph spectral domain.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127271640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/APSIPA.2016.7820721
Young-Sun Joo, Won-Suk Jun, Hong-Goo Kang
This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.
{"title":"Efficient deep neural networks for speech synthesis using bottleneck features","authors":"Young-Sun Joo, Won-Suk Jun, Hong-Goo Kang","doi":"10.1109/APSIPA.2016.7820721","DOIUrl":"https://doi.org/10.1109/APSIPA.2016.7820721","url":null,"abstract":"This paper proposes a cascading deep neural network (DNN) structure for speech synthesis system that consists of text-to-bottleneck (TTB) and bottleneck-to-speech (BTS) models. Unlike conventional single structure that requires a large database to find complicated mapping rules between linguistic and acoustic features, the proposed structure is very effective even if the available training database is inadequate. The bottle-neck feature utilized in the proposed approach represents the characteristics of linguistic features and its average acoustic features of several speakers. Therefore, it is more efficient to learn a mapping rule between bottleneck and acoustic features than to learn directly a mapping rule between linguistic and acoustic features. Experimental results show that the learning capability of the proposed structure is much higher than that of the conventional structures. Objective and subjective listening test results also verify the superiority of the proposed structure.","PeriodicalId":409448,"journal":{"name":"2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122412625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}