Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9746763
Fan Zhang, Mei Tu, Song Liu, Jinyao Yan
To improve the performance of Automatic Speech Recognition (ASR), it is common to deploy an error correction module at the post-processing stage to correct recognition errors. In this paper, we propose 1) an error correction model, which takes account of both contextual information and phonetic information by dual-channel; 2) a self-supervised learning method for the model. Firstly, an error region detection model is used to detect the error regions of ASR output. Then, we perform dual-channel feature extraction for the error regions, where one channel extracts their contextual information with a pre-trained language model, while the other channel builds their phonetic information. At the training stage, we construct error patterns at the phoneme level, which simplifies the data annotation procedure, thus allowing us to leverage a large scale of unlabeled data to train our model in a self-supervised learning manner. Experimental results on different test sets demonstrate the effectiveness and robustness of our model.
{"title":"ASR Error Correction with Dual-Channel Self-Supervised Learning","authors":"Fan Zhang, Mei Tu, Song Liu, Jinyao Yan","doi":"10.1109/icassp43922.2022.9746763","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746763","url":null,"abstract":"To improve the performance of Automatic Speech Recognition (ASR), it is common to deploy an error correction module at the post-processing stage to correct recognition errors. In this paper, we propose 1) an error correction model, which takes account of both contextual information and phonetic information by dual-channel; 2) a self-supervised learning method for the model. Firstly, an error region detection model is used to detect the error regions of ASR output. Then, we perform dual-channel feature extraction for the error regions, where one channel extracts their contextual information with a pre-trained language model, while the other channel builds their phonetic information. At the training stage, we construct error patterns at the phoneme level, which simplifies the data annotation procedure, thus allowing us to leverage a large scale of unlabeled data to train our model in a self-supervised learning manner. Experimental results on different test sets demonstrate the effectiveness and robustness of our model.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"289 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122301850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9747039
Farah Cherfaoui, H. Kadri, L. Ralaivola
The Nyström method, known as an efficient technique for approximating Gram matrices, builds upon a small subset of the data called landmarks, whose choice impacts the quality of the approximated Gram matrix. Various sampling methods have been proposed in the literature to choose such a subset, among which some based on ridge Leverage scores, which come with good theoretical and practical results. Nevertheless, direct computation of ridge leverage scores has an Θ(n3) computation cost if n is the number of data, which is prohibitive when n is large. To tackle this problem, we here propose a Θ(n) divide-and-conquer (DAC) method to approximate ridge leverage scores and we provide theoretical guarantees and empirical results regarding their ability to blend with the Nyström approximation strategy. Our experimental results show that the proposed approximate leverage score sampling scheme achieves a good trade-off between predictive performance and running time.
{"title":"Scalable Ridge Leverage Score Sampling for the Nyström Method","authors":"Farah Cherfaoui, H. Kadri, L. Ralaivola","doi":"10.1109/icassp43922.2022.9747039","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747039","url":null,"abstract":"The Nyström method, known as an efficient technique for approximating Gram matrices, builds upon a small subset of the data called landmarks, whose choice impacts the quality of the approximated Gram matrix. Various sampling methods have been proposed in the literature to choose such a subset, among which some based on ridge Leverage scores, which come with good theoretical and practical results. Nevertheless, direct computation of ridge leverage scores has an Θ(n3) computation cost if n is the number of data, which is prohibitive when n is large. To tackle this problem, we here propose a Θ(n) divide-and-conquer (DAC) method to approximate ridge leverage scores and we provide theoretical guarantees and empirical results regarding their ability to blend with the Nyström approximation strategy. Our experimental results show that the proposed approximate leverage score sampling scheme achieves a good trade-off between predictive performance and running time.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122324653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/ICASSP43922.2022.9746441
Zhong Zhang, Tao Jiang, Weiyong Yu
Reconfigurable intelligent surface (RIS) is capable of intelligently manipulating the phases of the incident electromagnetic wave to improve the wireless propagation environment between the base station (BS) and the users. This paper addresses the joint user scheduling, RIS configuration, and BS beamforming problem in an RIS-assisted downlink network with limited pilot overhead. We show that graph neural networks (GNN) with permutation invariance and equivariance properties can be used to appropriately schedule users and to design RIS configurations to achieve high overall throughput while accounting for fairness among the users. As compared to the conventional methodology of first estimating the channels then optimizing the user schedule, RIS configuration and the beamformers, this paper shows that an optimized user schedule can be obtained directly from a very short set of pilots using a GNN, then the RIS configuration can be optimized using a second GNN, and finally BS beamformers can be designed based on the overall effective channel. Numerical results show that the proposed approach can utilize received pilots more efficiently than conventional channel estimation based approach.
{"title":"User Scheduling Using Graph Neural Networks for Reconfigurable Intelligent Surface Assisted Multiuser Downlink Communications","authors":"Zhong Zhang, Tao Jiang, Weiyong Yu","doi":"10.1109/ICASSP43922.2022.9746441","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9746441","url":null,"abstract":"Reconfigurable intelligent surface (RIS) is capable of intelligently manipulating the phases of the incident electromagnetic wave to improve the wireless propagation environment between the base station (BS) and the users. This paper addresses the joint user scheduling, RIS configuration, and BS beamforming problem in an RIS-assisted downlink network with limited pilot overhead. We show that graph neural networks (GNN) with permutation invariance and equivariance properties can be used to appropriately schedule users and to design RIS configurations to achieve high overall throughput while accounting for fairness among the users. As compared to the conventional methodology of first estimating the channels then optimizing the user schedule, RIS configuration and the beamformers, this paper shows that an optimized user schedule can be obtained directly from a very short set of pilots using a GNN, then the RIS configuration can be optimized using a second GNN, and finally BS beamformers can be designed based on the overall effective channel. Numerical results show that the proposed approach can utilize received pilots more efficiently than conventional channel estimation based approach.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122392105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9746081
S. Sulis, D. Mary, L. Bigot
We propose a numerical methodology for detecting periodicities in unknown colored noise and for evaluating the ‘significance levels’ (p-values) of the test statistics. The procedure assumes and leverages the existence of a set of time series obtained under the null hypothesis (a null training sample, NTS) and possibly complementary side information. The test statistic is computed from a standardized periodogram, which is a pointwise division of the periodogram of the series under test to an averaged periodogram obtained from the NTS. The procedure provides accurate p-values estimation through a dedicated Monte Carlo procedure. While the methodology is general, our application is here exoplanet detection. The proposed methods are benchmarked on astrophysical data.
{"title":"Semi-Supervised Standardized Detection of Periodic Signals with Application to Exoplanet Detection","authors":"S. Sulis, D. Mary, L. Bigot","doi":"10.1109/icassp43922.2022.9746081","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746081","url":null,"abstract":"We propose a numerical methodology for detecting periodicities in unknown colored noise and for evaluating the ‘significance levels’ (p-values) of the test statistics. The procedure assumes and leverages the existence of a set of time series obtained under the null hypothesis (a null training sample, NTS) and possibly complementary side information. The test statistic is computed from a standardized periodogram, which is a pointwise division of the periodogram of the series under test to an averaged periodogram obtained from the NTS. The procedure provides accurate p-values estimation through a dedicated Monte Carlo procedure. While the methodology is general, our application is here exoplanet detection. The proposed methods are benchmarked on astrophysical data.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"413 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122786234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, learned image compression methods have shown their outstanding rate-distortion performance when compared to traditional frameworks. Although numerous progress has been made in learned image compression, the computation cost is still at a high level. To address this problem, we propose AdderIC, which utilizes adder neural networks (AdderNet) to construct an image compression framework. According to the characteristics of image compression, we introduce several strategies to improve the performance of AdderNet in this field. Specifically, Haar Wavelet Transform is adopted to make AdderIC learn high-frequency information efficiently. In addition, implicit deconvolution with the kernel size of 1 is applied after each adder layer to reduce spatial redundancies. Moreover, we develop a novel Adder-ID-PixelShuffle cascade upsampling structure to remove checkerboard artifacts. Experiments demonstrate that our AdderIC model can largely outperform conventional AdderNet when applied in image compression and achieve comparable rate-distortion performance to that of its CNN baseline with about 80% multiplication FLOPs and 30% energy consumption reduction.
{"title":"AdderIC: Towards Low Computation Cost Image Compression","authors":"Bowen Li, Xin Yao, Chao Li, Youneng Bao, Fanyang Meng, Yongsheng Liang","doi":"10.1109/icassp43922.2022.9747652","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747652","url":null,"abstract":"Recently, learned image compression methods have shown their outstanding rate-distortion performance when compared to traditional frameworks. Although numerous progress has been made in learned image compression, the computation cost is still at a high level. To address this problem, we propose AdderIC, which utilizes adder neural networks (AdderNet) to construct an image compression framework. According to the characteristics of image compression, we introduce several strategies to improve the performance of AdderNet in this field. Specifically, Haar Wavelet Transform is adopted to make AdderIC learn high-frequency information efficiently. In addition, implicit deconvolution with the kernel size of 1 is applied after each adder layer to reduce spatial redundancies. Moreover, we develop a novel Adder-ID-PixelShuffle cascade upsampling structure to remove checkerboard artifacts. Experiments demonstrate that our AdderIC model can largely outperform conventional AdderNet when applied in image compression and achieve comparable rate-distortion performance to that of its CNN baseline with about 80% multiplication FLOPs and 30% energy consumption reduction.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9746640
Lin Wang, Wanqian Zhang, Dayan Wu, Pingting Hong, Bo Li
Person re-identification (ReID) aims at retrieving images of the same person across non-overlapping camera views. The prior works focus on either fully supervised or unsupervised ReID settings, and achieve remarkable performances. In real scenarios, however, the major annotation cost comes from matching identity classes across camera views, thus leading to the Intra-Camera Supervised (ICS) ReID problem. In this work, we propose a Prototype-based Inter-camera ReID (PIRID) method, which tackles the ICS setting through the lens of prototype learning. Specifically, we first introduce the intra-camera learning with non-parametric classifiers to separately generate discriminative features within each camera view. Moreover, the inter-camera prototype learning provides prototypes as the representatives of each class in the common space, making the learned features to be camera-agnostic. Experiments conducted on three benchmarks, i.e., Market-1501, DukeMTMC-ReID, and MSMT17, show the superiority of our method.
{"title":"Prototype-Based Inter-Camera Learning for Person Re-Identification","authors":"Lin Wang, Wanqian Zhang, Dayan Wu, Pingting Hong, Bo Li","doi":"10.1109/icassp43922.2022.9746640","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746640","url":null,"abstract":"Person re-identification (ReID) aims at retrieving images of the same person across non-overlapping camera views. The prior works focus on either fully supervised or unsupervised ReID settings, and achieve remarkable performances. In real scenarios, however, the major annotation cost comes from matching identity classes across camera views, thus leading to the Intra-Camera Supervised (ICS) ReID problem. In this work, we propose a Prototype-based Inter-camera ReID (PIRID) method, which tackles the ICS setting through the lens of prototype learning. Specifically, we first introduce the intra-camera learning with non-parametric classifiers to separately generate discriminative features within each camera view. Moreover, the inter-camera prototype learning provides prototypes as the representatives of each class in the common space, making the learned features to be camera-agnostic. Experiments conducted on three benchmarks, i.e., Market-1501, DukeMTMC-ReID, and MSMT17, show the superiority of our method.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122888944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9747150
Eyal Fishel Ben-Knaan, Yonina C. Eldar, Nir Shlezinger
The ongoing pandemic and the necessity of frequent testing have spurred a growing interest in pooled testing. Conventional recovery methods from pooled tests are based on group testing or compressed sensing tools which rely on simplistic modeling of the pooling process, and may not be reliable in the presence of complex and noisy measurement procedures and highly infected populations. In this work, we propose a strategy for pooled testing designed for noisy settings, which bypasses the need for a tractable acquisition model. This is achieved by combining deep learning, for implicitly learning the measurement relationship from data, with factor graph inference, which exploits the structured known pooling pattern. Learned factor graphs provide a quantitative readout corresponding to the infection severity, as opposed to group testing which only detects the presence of infection. The proposed scheme is shown to achieve improved robustness to noise compared with previous approaches and to reliably estimate in highly infected populations.
{"title":"Recovery of Noisy Pooled Tests via Learned Factor Graphs with Application to COVID-19 Testing","authors":"Eyal Fishel Ben-Knaan, Yonina C. Eldar, Nir Shlezinger","doi":"10.1109/icassp43922.2022.9747150","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747150","url":null,"abstract":"The ongoing pandemic and the necessity of frequent testing have spurred a growing interest in pooled testing. Conventional recovery methods from pooled tests are based on group testing or compressed sensing tools which rely on simplistic modeling of the pooling process, and may not be reliable in the presence of complex and noisy measurement procedures and highly infected populations. In this work, we propose a strategy for pooled testing designed for noisy settings, which bypasses the need for a tractable acquisition model. This is achieved by combining deep learning, for implicitly learning the measurement relationship from data, with factor graph inference, which exploits the structured known pooling pattern. Learned factor graphs provide a quantitative readout corresponding to the infection severity, as opposed to group testing which only detects the presence of infection. The proposed scheme is shown to achieve improved robustness to noise compared with previous approaches and to reliably estimate in highly infected populations.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123033888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9747309
David Schenck, Katja Lübbe, Minh Trinh-Hoang, M. Pesavento
The Partial Relaxation framework has recently been introduced to address the Direction-of-Arrival (DOA) estimation problem [1]–[3]. DOA estimators under the Partial Relaxation (PR) framework are computationally efficient while preserving excellent DOA estimation accuracy. This is achieved by keeping the structure of the signal from the desired direction unchanged while relaxing the structure of the signals from the remaining undesired directions. This type of relaxation allows to compute closed-form estimates for the undesired signal part and improves the accuracy of the DOA estimates compared to conventional spectral-search methods like, e.g. MUSIC. Following a similar approach as in [4] the PR framework is combined with the Orthogonal Least Squares (OLS) technique of [5]. A novel DOA estimator is proposed that is based on Partially-Relaxed Weighted Subspace Fitting (PR-WSF) in which the DOAs are iteratively estimated. Thereby, one DOA is estimated per iteration, while accounting for both the signal contributions under the previously-determined DOAs, with full signal structure, as well as the remaining DOAs with relaxed structure. Moreover, an efficient implementation of the Partially-Relaxed Orthogonal Least Squares Weighted Subspace Fitting (PR-OLS-WSF) method is proposed that provides similar computational cost as the MUSIC algorithm. Simulation results show that the proposed PR-OLS-WSF estimator provides excellent performance especially in difficult scenarios with low Signal-to-Noise-Ratio (SNR) and closely spaced sources.
{"title":"Partially Relaxed Orthogonal Least Squares Weighted Subspace Fitting Direction-of-Arrival Estimation","authors":"David Schenck, Katja Lübbe, Minh Trinh-Hoang, M. Pesavento","doi":"10.1109/icassp43922.2022.9747309","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747309","url":null,"abstract":"The Partial Relaxation framework has recently been introduced to address the Direction-of-Arrival (DOA) estimation problem [1]–[3]. DOA estimators under the Partial Relaxation (PR) framework are computationally efficient while preserving excellent DOA estimation accuracy. This is achieved by keeping the structure of the signal from the desired direction unchanged while relaxing the structure of the signals from the remaining undesired directions. This type of relaxation allows to compute closed-form estimates for the undesired signal part and improves the accuracy of the DOA estimates compared to conventional spectral-search methods like, e.g. MUSIC. Following a similar approach as in [4] the PR framework is combined with the Orthogonal Least Squares (OLS) technique of [5]. A novel DOA estimator is proposed that is based on Partially-Relaxed Weighted Subspace Fitting (PR-WSF) in which the DOAs are iteratively estimated. Thereby, one DOA is estimated per iteration, while accounting for both the signal contributions under the previously-determined DOAs, with full signal structure, as well as the remaining DOAs with relaxed structure. Moreover, an efficient implementation of the Partially-Relaxed Orthogonal Least Squares Weighted Subspace Fitting (PR-OLS-WSF) method is proposed that provides similar computational cost as the MUSIC algorithm. Simulation results show that the proposed PR-OLS-WSF estimator provides excellent performance especially in difficult scenarios with low Signal-to-Noise-Ratio (SNR) and closely spaced sources.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122092417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/ICASSP43922.2022.9746744
Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao
This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate "content leakage" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.
{"title":"Prosodyspeech: Towards Advanced Prosody Model for Neural Text-to-Speech","authors":"Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao","doi":"10.1109/ICASSP43922.2022.9746744","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9746744","url":null,"abstract":"This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate \"content leakage\" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116851434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-23DOI: 10.1109/icassp43922.2022.9746552
S. F. Seyyedsalehi, H. Rabiee
In this paper we propose a novel hierarchical Bayesian model for sparse regression problem to use in semi-supervised hyperspectral unmixing which assumes the signal recorded in each hyperspectral pixel is a linear combination of members of the spectral library contaminated by an additive Gaussian noise. To effectively utilizing the spatial correlation between neighboring pixels during the unmixing process, we exploit a Markov random field to simultaneously group pixels to clusters which are associated to regions with homogeneous mixtures in a natural scene. We assume Sparse fractional abundances of members of a cluster to be generated from an exponential distribution with the same rate parameter. We show that our method is able to detect unconnected regions which have similar mixtures. Experiments on synthetic and real hyperspectral images confirm the superiority of the proposed method compared to alternatives.
{"title":"Improving Joint Sparse Hyperspectral Unmixing by Simultaneously Clustering Pixels According To Their Mixtures","authors":"S. F. Seyyedsalehi, H. Rabiee","doi":"10.1109/icassp43922.2022.9746552","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746552","url":null,"abstract":"In this paper we propose a novel hierarchical Bayesian model for sparse regression problem to use in semi-supervised hyperspectral unmixing which assumes the signal recorded in each hyperspectral pixel is a linear combination of members of the spectral library contaminated by an additive Gaussian noise. To effectively utilizing the spatial correlation between neighboring pixels during the unmixing process, we exploit a Markov random field to simultaneously group pixels to clusters which are associated to regions with homogeneous mixtures in a natural scene. We assume Sparse fractional abundances of members of a cluster to be generated from an exponential distribution with the same rate parameter. We show that our method is able to detect unconnected regions which have similar mixtures. Experiments on synthetic and real hyperspectral images confirm the superiority of the proposed method compared to alternatives.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117298401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}