Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683867
Qing Shen, Wei Liu, Li Wang, Yin Liu
The target localization problem for distributed sensor array networks where a sub-array is placed at each receiver is studied, and under the compressive sensing (CS) framework, a group sparsity based two-dimensional localization method is proposed. Instead of fusing the separately estimated angles of arrival (AOAs), it processes the information collected by all the receivers simultaneously to form the final target locations. Simulation results show that the proposed localization method provides a significant performance improvement compared with the commonly used maximum likelihood estimator (MLE).
{"title":"Group Sparsity Based Target Localization for Distributed Sensor Array Networks","authors":"Qing Shen, Wei Liu, Li Wang, Yin Liu","doi":"10.1109/ICASSP.2019.8683867","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683867","url":null,"abstract":"The target localization problem for distributed sensor array networks where a sub-array is placed at each receiver is studied, and under the compressive sensing (CS) framework, a group sparsity based two-dimensional localization method is proposed. Instead of fusing the separately estimated angles of arrival (AOAs), it processes the information collected by all the receivers simultaneously to form the final target locations. Simulation results show that the proposed localization method provides a significant performance improvement compared with the commonly used maximum likelihood estimator (MLE).","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"101 1","pages":"4190-4194"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77440245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683844
Yu Zhu, F. J. Garcia, A. Marques, Santiago Segarra
We study the problem of jointly estimating several network processes that are driven by the same input, recasting it as one of blind identification of a bank of graph filters. More precisely, we consider the observation of several graph signals – i.e., signals defined on the nodes of a graph – and we model each of these signals as the output of a different network process (represented by a graph filter) defined on a common known graph and driven by a common unknown input. Our goal is to recover the specifications of every network process by only observing the outputs. Since every process shares the same input, the estimation problems are coupled, and a joint inference method is proposed. We study two different scenarios, one where the orders of the filters are known, and one where they are not. For the former case we propose a least-squares approach and provide conditions for recovery. For the latter case, we put forth a sparse recovery algorithm with theoretical guarantees. Finally, we illustrate the methods here proposed via numerical experiments.
{"title":"Estimation of Network Processes via Blind Graph Multi-filter Identification","authors":"Yu Zhu, F. J. Garcia, A. Marques, Santiago Segarra","doi":"10.1109/ICASSP.2019.8683844","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683844","url":null,"abstract":"We study the problem of jointly estimating several network processes that are driven by the same input, recasting it as one of blind identification of a bank of graph filters. More precisely, we consider the observation of several graph signals – i.e., signals defined on the nodes of a graph – and we model each of these signals as the output of a different network process (represented by a graph filter) defined on a common known graph and driven by a common unknown input. Our goal is to recover the specifications of every network process by only observing the outputs. Since every process shares the same input, the estimation problems are coupled, and a joint inference method is proposed. We study two different scenarios, one where the orders of the filters are known, and one where they are not. For the former case we propose a least-squares approach and provide conditions for recovery. For the latter case, we put forth a sparse recovery algorithm with theoretical guarantees. Finally, we illustrate the methods here proposed via numerical experiments.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"5451-5455"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88927885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682802
Tao Lei, Qi Zhang, Dinghua Xue, Tao Chen, H. Meng, A. Nandi
In this paper, we propose a novel approach based on a symmetric fully convolutional network within pyramid pooling (FCN-PP) for landslide mapping (LM). The proposed approach has three advantages. Firstly, this approach is automatic and insensitive to noise because multivariate morphological reconstruction (MMR) is used for image preprocessing. Secondly, it is able to take into account features from multiple convolutional layers and explore efficiently the context of images, which leads to a good tradeoff between wider receptive field and the use of context. Finally, the selected pyramid pooling module addresses the drawback of single-scale pooling employed by convolutional neural network (CNN), fully convolutional network (FCN), U-Net, etc. Experimental results show that the proposed FCN-PP is effective for LM, and it outperforms state-of-the-art approaches in terms of four metrics, Precision, Recall, F -score, and Accuracy.
本文提出了一种基于金字塔池内对称全卷积网络(FCN-PP)的滑坡映射(LM)新方法。所提出的方法有三个优点。首先,该方法采用多变量形态学重构(multivariate morphological reconstruction, MMR)进行图像预处理,具有自动化和对噪声不敏感的特点。其次,它能够考虑来自多个卷积层的特征,并有效地探索图像的上下文,从而在更宽的接受域和使用上下文之间取得了很好的权衡。最后,所选择的金字塔池化模块解决了卷积神经网络(CNN)、全卷积网络(FCN)、U-Net等采用单尺度池化的缺点。实验结果表明,所提出的FCN-PP对于LM是有效的,并且在四个指标(Precision, Recall, F -score和Accuracy)方面优于目前最先进的方法。
{"title":"End-to-end Change Detection Using a Symmetric Fully Convolutional Network for Landslide Mapping","authors":"Tao Lei, Qi Zhang, Dinghua Xue, Tao Chen, H. Meng, A. Nandi","doi":"10.1109/ICASSP.2019.8682802","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682802","url":null,"abstract":"In this paper, we propose a novel approach based on a symmetric fully convolutional network within pyramid pooling (FCN-PP) for landslide mapping (LM). The proposed approach has three advantages. Firstly, this approach is automatic and insensitive to noise because multivariate morphological reconstruction (MMR) is used for image preprocessing. Secondly, it is able to take into account features from multiple convolutional layers and explore efficiently the context of images, which leads to a good tradeoff between wider receptive field and the use of context. Finally, the selected pyramid pooling module addresses the drawback of single-scale pooling employed by convolutional neural network (CNN), fully convolutional network (FCN), U-Net, etc. Experimental results show that the proposed FCN-PP is effective for LM, and it outperforms state-of-the-art approaches in terms of four metrics, Precision, Recall, F -score, and Accuracy.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"3027-3031"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88702966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682770
Chunhui Lu, Pengyuan Zhang, Yonghong Yan
Predicting prosodic boundaries from input text plays an important role in Chinese text-to-speech (TTS) system, which directly influences the naturalness and intelligibility of synthesized speech. In this paper, we propose to combine self-attention with multitask learning for prosodic boundary prediction. Self-attention is used to capture the dependency between two arbitrary characters in the input sentence, while multitask learning models the relationships between prosodic boundaries and lexicon words by setting word segmentation as an auxiliary task. The proposed method can generate prosodic boundary labels directly from Chinese characters and achieve the whole process end-to-end. Experimental results show the effectiveness of our proposed model and prove that the performance can be further improved by pretraining the model with extra word segmentation data.
{"title":"Self-attention Based Prosodic Boundary Prediction for Chinese Speech Synthesis","authors":"Chunhui Lu, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ICASSP.2019.8682770","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682770","url":null,"abstract":"Predicting prosodic boundaries from input text plays an important role in Chinese text-to-speech (TTS) system, which directly influences the naturalness and intelligibility of synthesized speech. In this paper, we propose to combine self-attention with multitask learning for prosodic boundary prediction. Self-attention is used to capture the dependency between two arbitrary characters in the input sentence, while multitask learning models the relationships between prosodic boundaries and lexicon words by setting word segmentation as an auxiliary task. The proposed method can generate prosodic boundary labels directly from Chinese characters and achieve the whole process end-to-end. Experimental results show the effectiveness of our proposed model and prove that the performance can be further improved by pretraining the model with extra word segmentation data.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"39 1","pages":"7035-7039"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89105283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682305
T. Ferreira, W. Martins, Markus V. S. Lima, P. Diniz
Set-membership affine projection (SM-AP) adaptive filters have been increasingly employed in the context of online data-selective learning. A key aspect for their good performance in terms of both convergence speed and steady-state mean-squared error is the choice of the so-called constraint vector. Optimal constraint vectors were recently proposed relying on convex optimization tools, which might sometimes lead to prohibitive computational burden. This paper proposes a convex combination of simpler constraint vectors whose performance approaches the optimal solution closely, utilizing much fewer computations. Some illustrative examples confirm that the sub-optimal solution follows the accomplishments of the optimal one.
{"title":"Convex Combination of Constraint Vectors for Set-membership Affine Projection Algorithms","authors":"T. Ferreira, W. Martins, Markus V. S. Lima, P. Diniz","doi":"10.1109/ICASSP.2019.8682305","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682305","url":null,"abstract":"Set-membership affine projection (SM-AP) adaptive filters have been increasingly employed in the context of online data-selective learning. A key aspect for their good performance in terms of both convergence speed and steady-state mean-squared error is the choice of the so-called constraint vector. Optimal constraint vectors were recently proposed relying on convex optimization tools, which might sometimes lead to prohibitive computational burden. This paper proposes a convex combination of simpler constraint vectors whose performance approaches the optimal solution closely, utilizing much fewer computations. Some illustrative examples confirm that the sub-optimal solution follows the accomplishments of the optimal one.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"222 1","pages":"4858-4862"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89124154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682832
Z. Wang, Xi Xiao, Guangwu Hu, Yao Yao, Dianyan Zhang, Zhendong Peng, Qing Li, Shutao Xia
Reinforcement learning is a framework to make sequential decisions. The combination with deep neural networks further improves the ability of this framework. Convolutional nerual networks make it possible to make sequential decisions based on raw pixels information directly and make reinforcement learning achieve satisfying performances in series of tasks. However, convolutional neural networks still have own limitations in representing geometric patterns and long-term dependencies that occur consistently in state inputs. To tackle with the limitation, we propose the self-attention architecture to augment the original network. It provides a better balance between ability to model long-range dependencies and computational efficiency. Experiments on Atari games illustrate that self-attention structure is significantly effective for function approximation in deep reinforcement learning.
{"title":"Non-local Self-attention Structure for Function Approximation in Deep Reinforcement Learning","authors":"Z. Wang, Xi Xiao, Guangwu Hu, Yao Yao, Dianyan Zhang, Zhendong Peng, Qing Li, Shutao Xia","doi":"10.1109/ICASSP.2019.8682832","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682832","url":null,"abstract":"Reinforcement learning is a framework to make sequential decisions. The combination with deep neural networks further improves the ability of this framework. Convolutional nerual networks make it possible to make sequential decisions based on raw pixels information directly and make reinforcement learning achieve satisfying performances in series of tasks. However, convolutional neural networks still have own limitations in representing geometric patterns and long-term dependencies that occur consistently in state inputs. To tackle with the limitation, we propose the self-attention architecture to augment the original network. It provides a better balance between ability to model long-range dependencies and computational efficiency. Experiments on Atari games illustrate that self-attention structure is significantly effective for function approximation in deep reinforcement learning.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"146 1","pages":"3042-3046"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86091462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683767
R. Luke, D. McAlpine
A novel approach to multi-microphone acoustic source localisation based on spiking neural networks is presented. We demonstrate that a two microphone system connected to a spiking neural network can be used to localise acoustic sources based purely on inter microphone timing differences, with no need for manually configured delay lines. A two sensor example is provided which includes 1) a front end which converts the acoustic signal to a series of spikes, 2) a hidden layer of spiking neurons, 3) an output layer of spiking neurons which represents the location of the acoustic source. We present details on training the network, and evaluation of its performance in quiet and noisy conditions. The system is trained on two locations, and we show that the lateralisation accuracy is 100% when presented with previously unseen data in quiet conditions. We also demonstrate the network generalises to modulation rates and background noise on which it was not trained.
{"title":"A Spiking Neural Network Approach to Auditory Source Lateralisation","authors":"R. Luke, D. McAlpine","doi":"10.1109/ICASSP.2019.8683767","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683767","url":null,"abstract":"A novel approach to multi-microphone acoustic source localisation based on spiking neural networks is presented. We demonstrate that a two microphone system connected to a spiking neural network can be used to localise acoustic sources based purely on inter microphone timing differences, with no need for manually configured delay lines. A two sensor example is provided which includes 1) a front end which converts the acoustic signal to a series of spikes, 2) a hidden layer of spiking neurons, 3) an output layer of spiking neurons which represents the location of the acoustic source. We present details on training the network, and evaluation of its performance in quiet and noisy conditions. The system is trained on two locations, and we show that the lateralisation accuracy is 100% when presented with previously unseen data in quiet conditions. We also demonstrate the network generalises to modulation rates and background noise on which it was not trained.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"5 1","pages":"1488-1492"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86447868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683742
Gen Li, Jingkai Yan, Yuantao Gu
Compressed sensing seeks to recover an unknown sparse vector from undersampled rate measurements. Since its introduction, there have been enormous works on compressed sensing that develop efficient algorithms for sparse signal recovery. The restricted isometry property (RIP) has become the dominant tool used for the analysis of exact reconstruction from seemingly undersampled measurements. Although the upper bound of the RIP constant has been studied extensively, as far as we know, the result is missing for the lower bound. In this work, we first present a tight lower bound for the RIP constant, filling the gap there. The lower bound is at the same order as the upper bound for the RIP constant. Moreover, we also show that our lower bound is close to the upper bound by numerical simulations. Our bound on the RIP constant provides an information-theoretic lower bound about the sampling rate for the first time, which is the essential question for practitioners.
{"title":"Information Theoretic Lower Bound of Restricted Isometry Property Constant","authors":"Gen Li, Jingkai Yan, Yuantao Gu","doi":"10.1109/ICASSP.2019.8683742","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683742","url":null,"abstract":"Compressed sensing seeks to recover an unknown sparse vector from undersampled rate measurements. Since its introduction, there have been enormous works on compressed sensing that develop efficient algorithms for sparse signal recovery. The restricted isometry property (RIP) has become the dominant tool used for the analysis of exact reconstruction from seemingly undersampled measurements. Although the upper bound of the RIP constant has been studied extensively, as far as we know, the result is missing for the lower bound. In this work, we first present a tight lower bound for the RIP constant, filling the gap there. The lower bound is at the same order as the upper bound for the RIP constant. Moreover, we also show that our lower bound is close to the upper bound by numerical simulations. Our bound on the RIP constant provides an information-theoretic lower bound about the sampling rate for the first time, which is the essential question for practitioners.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"25 1","pages":"5297-5301"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84616757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683152
Yousef El-Laham, P. Djurić, M. Bugallo
Adaptive importance sampling (AIS) methods are a family of algorithms which can be used to approximate Bayesian posterior distributions. Many AIS algorithms exist in the literature, where the differences arise in the manner by which the proposal distribution is adapted at each iteration. The adaptive population importance sampler (APIS), for example, deterministically samples from a mixture distribution and uses the local information given by the samples and weights to adapt the location parameter of each proposal. The update rules by nature are heuristic, but effective, especially in the case that the target posterior is multimodal. In this work, we introduce a novel AIS scheme which incorporates modern techniques in stochastic optimization to improve the methodology for higher-dimensional posterior inference. More specifically, we derive update rules for the parameters of each proposal by means of deterministic mixture sampling and show that the method outperforms other state-of-the-art approaches in high-dimensional scenarios.
{"title":"A Variational Adaptive Population Importance Sampler","authors":"Yousef El-Laham, P. Djurić, M. Bugallo","doi":"10.1109/ICASSP.2019.8683152","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683152","url":null,"abstract":"Adaptive importance sampling (AIS) methods are a family of algorithms which can be used to approximate Bayesian posterior distributions. Many AIS algorithms exist in the literature, where the differences arise in the manner by which the proposal distribution is adapted at each iteration. The adaptive population importance sampler (APIS), for example, deterministically samples from a mixture distribution and uses the local information given by the samples and weights to adapt the location parameter of each proposal. The update rules by nature are heuristic, but effective, especially in the case that the target posterior is multimodal. In this work, we introduce a novel AIS scheme which incorporates modern techniques in stochastic optimization to improve the methodology for higher-dimensional posterior inference. More specifically, we derive update rules for the parameters of each proposal by means of deterministic mixture sampling and show that the method outperforms other state-of-the-art approaches in high-dimensional scenarios.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"69 1","pages":"5052-5056"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84446581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682801
Tzu-Wei Sung, Jun-You Liu, Hung-yi Lee, Lin-Shan Lee
Speech-to-text translation (ST) refers to transforming the audio in source language to the text in target language. Mainstream solutions for such tasks are to cascade automatic speech recognition with machine translation, for which the transcriptions of the source language are needed in training. End-to-end approaches for ST tasks have been investigated because of not only technical interests such as to achieve globally optimized solution, but the need for ST tasks for the many source languages worldwide which do not have written form. In this paper, we propose a new end-to-end ST framework with two decoders to handle the relatively deeper relationships between the source language audio and target language text. The first-pass decoder generates some useful latent representations, and the second-pass decoder then integrates the output of both the encoder and the first-pass decoder to generate the text translation in target language. Only paired source language audio and target language text are used in training. Preliminary experiments on several language pairs showed improved performance, and offered some initial analysis.
{"title":"Towards End-to-end Speech-to-text Translation with Two-pass Decoding","authors":"Tzu-Wei Sung, Jun-You Liu, Hung-yi Lee, Lin-Shan Lee","doi":"10.1109/ICASSP.2019.8682801","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682801","url":null,"abstract":"Speech-to-text translation (ST) refers to transforming the audio in source language to the text in target language. Mainstream solutions for such tasks are to cascade automatic speech recognition with machine translation, for which the transcriptions of the source language are needed in training. End-to-end approaches for ST tasks have been investigated because of not only technical interests such as to achieve globally optimized solution, but the need for ST tasks for the many source languages worldwide which do not have written form. In this paper, we propose a new end-to-end ST framework with two decoders to handle the relatively deeper relationships between the source language audio and target language text. The first-pass decoder generates some useful latent representations, and the second-pass decoder then integrates the output of both the encoder and the first-pass decoder to generate the text translation in target language. Only paired source language audio and target language text are used in training. Preliminary experiments on several language pairs showed improved performance, and offered some initial analysis.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"39 1","pages":"7175-7179"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88029031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}