Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8462295
Minh Trinh-Hoang, M. Viberg, M. Pesavento
In the partial relaxation approach, at each desired direction, the manifold structure of the remaining interfering signals impinging on the sensor array is relaxed, which results in closed form estimates for the interference parameters. By adopting this approach, in this paper, a new estimator based on the unconstrained covariance fitting problem is proposed. To obtain the null-spectra efficiently, an iterative rooting scheme based on the rational function approximation is applied. Simulation results show that the performance of the proposed estimator is superior to the classical and other partial relaxation methods, especially in the case of low number of snapshots, irrespectively of any specific structure of the sensor array while maintaining a reasonable computational cost.
{"title":"An Improved Doa Estimator Based on Partial Relaxation Approach","authors":"Minh Trinh-Hoang, M. Viberg, M. Pesavento","doi":"10.1109/ICASSP.2018.8462295","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462295","url":null,"abstract":"In the partial relaxation approach, at each desired direction, the manifold structure of the remaining interfering signals impinging on the sensor array is relaxed, which results in closed form estimates for the interference parameters. By adopting this approach, in this paper, a new estimator based on the unconstrained covariance fitting problem is proposed. To obtain the null-spectra efficiently, an iterative rooting scheme based on the rational function approximation is applied. Simulation results show that the performance of the proposed estimator is superior to the classical and other partial relaxation methods, especially in the case of low number of snapshots, irrespectively of any specific structure of the sensor array while maintaining a reasonable computational cost.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"28 1","pages":"3246-3250"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90724756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461390
Christopher Gilliam, Adrian Bingham, T. Blu, B. Jelfs
Estimation of conduction velocity (CV) is an important task in the analysis of surface electromyography (sEMG). The problem can be framed as estimation of a time-varying delay (TVD) between electrode recordings. In this paper we present an algorithm which incorporates information from multiple electrodes into a single TVD estimation. The algorithm uses a common all-pass filter to relate two groups of signals at a local level. We also address a current limitation of CV estimators by providing an automated way of identifying the innervation zone from a set of electrode recordings, thus allowing incorporation of the entire array into the estimation. We validate the algorithm on both synthetic and real sEMG data with results showing the proposed algorithm is both robust and accurate.
{"title":"Time-Varying Delay Estimation Using Common Local All-Pass Filters with Application to Surface Electromyography","authors":"Christopher Gilliam, Adrian Bingham, T. Blu, B. Jelfs","doi":"10.1109/ICASSP.2018.8461390","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461390","url":null,"abstract":"Estimation of conduction velocity (CV) is an important task in the analysis of surface electromyography (sEMG). The problem can be framed as estimation of a time-varying delay (TVD) between electrode recordings. In this paper we present an algorithm which incorporates information from multiple electrodes into a single TVD estimation. The algorithm uses a common all-pass filter to relate two groups of signals at a local level. We also address a current limitation of CV estimators by providing an automated way of identifying the innervation zone from a set of electrode recordings, thus allowing incorporation of the entire array into the estimation. We validate the algorithm on both synthetic and real sEMG data with results showing the proposed algorithm is both robust and accurate.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"8 1","pages":"841-845"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79272251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461467
Bhusan Chettri, Bob L. Sturm
A “replay attack” involves replaying pre-recorded speech of an enrolled speaker to bypass an automatic speaker verification system. The 2017 ASVspoof Challenge focused on this kind of attack. In this paper, we describe our evaluation work after this challenge. First, we study the effectiveness of Gaussian Mixture Model (GMM) systems using six different hand-crafted features for detecting a replay attack. Second, we take a deeper look at these GMM systems and perform a frame-level analysis of log likelihoods. Our analysis shows how system performance can depend on a simple class-dependent cue in the dataset: initial silence frames of zeros appear in the genuine signals but missing in the spoofed version. Third, we show how we can fool these systems using this cue. For example, we find the equal error rate (EER) of one GMM system dramatically rises from 14.82 to 44.44 when we add the cue to the evaluation data. Finally, we explore whether this problem can be mitigated by pre-processing the 2017 ASV spoof Challenge dataset.
{"title":"A Deeper Look at Gaussian Mixture Model Based Anti-Spoofing Systems","authors":"Bhusan Chettri, Bob L. Sturm","doi":"10.1109/ICASSP.2018.8461467","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461467","url":null,"abstract":"A “replay attack” involves replaying pre-recorded speech of an enrolled speaker to bypass an automatic speaker verification system. The 2017 ASVspoof Challenge focused on this kind of attack. In this paper, we describe our evaluation work after this challenge. First, we study the effectiveness of Gaussian Mixture Model (GMM) systems using six different hand-crafted features for detecting a replay attack. Second, we take a deeper look at these GMM systems and perform a frame-level analysis of log likelihoods. Our analysis shows how system performance can depend on a simple class-dependent cue in the dataset: initial silence frames of zeros appear in the genuine signals but missing in the spoofed version. Third, we show how we can fool these systems using this cue. For example, we find the equal error rate (EER) of one GMM system dramatically rises from 14.82 to 44.44 when we add the cue to the evaluation data. Finally, we explore whether this problem can be mitigated by pre-processing the 2017 ASV spoof Challenge dataset.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"45 1","pages":"5159-5163"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77784942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8462584
Max Bläser, Christian Rohlfing, Yingbo Gao, M. Wien
Informed source separation (ISS) uses source separation for extracting audio objects out of their downmix given some pre-computed parameters. In recent years, non-negative tensor factorization (NTF) has proven to be a good choice for compressing audio objects at an encoding stage. At the decoding stage, these parameters are used to separate the downmix with Wiener-filtering. The quantized NTF parameters have to be encoded to a bit stream prior to transmission. In this paper, we propose to use context-based adaptive binary arithmetic coding (CABAC) for this task. CABAC is widely used in the video coding community and exploits local signal statistics. We adapt CABAC to the task of NTF-based ISS and show that our contribution outperforms reference coding methods.
{"title":"Adaptive Coding of Non-Negative Factorization Parameters with Application to Informed Source Separation","authors":"Max Bläser, Christian Rohlfing, Yingbo Gao, M. Wien","doi":"10.1109/ICASSP.2018.8462584","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462584","url":null,"abstract":"Informed source separation (ISS) uses source separation for extracting audio objects out of their downmix given some pre-computed parameters. In recent years, non-negative tensor factorization (NTF) has proven to be a good choice for compressing audio objects at an encoding stage. At the decoding stage, these parameters are used to separate the downmix with Wiener-filtering. The quantized NTF parameters have to be encoded to a bit stream prior to transmission. In this paper, we propose to use context-based adaptive binary arithmetic coding (CABAC) for this task. CABAC is widely used in the video coding community and exploits local signal statistics. We adapt CABAC to the task of NTF-based ISS and show that our contribution outperforms reference coding methods.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"48 1","pages":"751-755"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87567219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461483
Hsin Chou, Ming-Tso Chen, T. Chi
In this paper, we build up a hybrid neural network (NN) for singing melody extraction from polyphonic music by imitating human pitch perception. For human hearing, there are two pitch perception models, the spectral model and the temporal model, in accordance with whether harmonics are resolved or not. Here, we first use NNs to implement individual models and evaluate their performance in the task of singing melody extraction. Then, we combine the NNs to constitute the composite NN to simulate the duplex model, which complements the pitch perception from unresolved harmonics of the spectral model using the temporal model. Simulation results show the proposed composite NN outperforms other conventional methods in singing melody extraction.
{"title":"A Hybrid Neural Network Based on the Duplex Model of Pitch Perception for Singing Melody Extraction","authors":"Hsin Chou, Ming-Tso Chen, T. Chi","doi":"10.1109/ICASSP.2018.8461483","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461483","url":null,"abstract":"In this paper, we build up a hybrid neural network (NN) for singing melody extraction from polyphonic music by imitating human pitch perception. For human hearing, there are two pitch perception models, the spectral model and the temporal model, in accordance with whether harmonics are resolved or not. Here, we first use NNs to implement individual models and evaluate their performance in the task of singing melody extraction. Then, we combine the NNs to constitute the composite NN to simulate the duplex model, which complements the pitch perception from unresolved harmonics of the spectral model using the temporal model. Simulation results show the proposed composite NN outperforms other conventional methods in singing melody extraction.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"119 1","pages":"381-385"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80407476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461988
Zhi-Yi Lin, Jia-Lin Chen, Liang-Gee Chen
This paper introduces architecture with high throughput, low on-chip memory, and efficient data access for Improved Dense Trajectories (iDT) as video representations for realtime action recognition. The iDT feature can capture longterm motion cues better than any existing deep feature, which makes it crucial in state-of-the-art action recognition systems. There are three major features in our architecture design, including a low bandwidth frame-wise feature extraction, low on-chip memory architecture for point tracking, and two-stage trajectory pruning architecture for low bandwidth. Using TSMC 40nm technology, our chip area is 3.1 mm2, and the size of on-chip memory is 40.8 kB. The chip can support videos in resolution of 320×240 with a throughput of 203 fps under 215 MHz, which is a 81.2 times speedup compared with CPU. Under the same operating frequency, it can also provide feature extraction for six windows of size 320 × 240 in higher resolution videos with a throughput of 34 fps.
{"title":"A 203 FPS VLSI Architecture of Improved Dense Trajectories for Real-Time Human Action Recognition","authors":"Zhi-Yi Lin, Jia-Lin Chen, Liang-Gee Chen","doi":"10.1109/ICASSP.2018.8461988","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461988","url":null,"abstract":"This paper introduces architecture with high throughput, low on-chip memory, and efficient data access for Improved Dense Trajectories (iDT) as video representations for realtime action recognition. The iDT feature can capture longterm motion cues better than any existing deep feature, which makes it crucial in state-of-the-art action recognition systems. There are three major features in our architecture design, including a low bandwidth frame-wise feature extraction, low on-chip memory architecture for point tracking, and two-stage trajectory pruning architecture for low bandwidth. Using TSMC 40nm technology, our chip area is 3.1 mm2, and the size of on-chip memory is 40.8 kB. The chip can support videos in resolution of 320×240 with a throughput of 203 fps under 215 MHz, which is a 81.2 times speedup compared with CPU. Under the same operating frequency, it can also provide feature extraction for six windows of size 320 × 240 in higher resolution videos with a throughput of 34 fps.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"33 1","pages":"1115-1119"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80143035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8462690
Yu-Wen Lo, Yih-Liang Shen, Y. Liao, T. Chi
Before the era of the neural network (NN), features extracted from auditory models have been applied to various speech applications and been demonstrated more robust against noise than conventional speech-processing features. What's the role of auditory models in the current NN era? Are they obsolete? To answer this question, we construct a NN with a generative auditory model embedded to process speech signals. The generative auditory model consists of two stages, the stage of spectrum estimation in the logarithmic-frequency axis by the cochlea and the stage of spectral-temporal analysis in the modulation domain by the auditory cortex. The NN is evaluated in a simple speaker identification task. Experiment results show that the auditory model embedded NN is still more robust against noise, especially in low SNR conditions, than the randomly-initialized NN in speaker identification.
{"title":"A Generative Auditory Model Embedded Neural Network for Speech Processing","authors":"Yu-Wen Lo, Yih-Liang Shen, Y. Liao, T. Chi","doi":"10.1109/ICASSP.2018.8462690","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462690","url":null,"abstract":"Before the era of the neural network (NN), features extracted from auditory models have been applied to various speech applications and been demonstrated more robust against noise than conventional speech-processing features. What's the role of auditory models in the current NN era? Are they obsolete? To answer this question, we construct a NN with a generative auditory model embedded to process speech signals. The generative auditory model consists of two stages, the stage of spectrum estimation in the logarithmic-frequency axis by the cochlea and the stage of spectral-temporal analysis in the modulation domain by the auditory cortex. The NN is evaluated in a simple speaker identification task. Experiment results show that the auditory model embedded NN is still more robust against noise, especially in low SNR conditions, than the randomly-initialized NN in speaker identification.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 1","pages":"5179-5183"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78073383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461872
Oguzhan Teke, P. Vaidyanathan
This paper considers an autonomous network in which the nodes communicate only with their neighbors at random time instances, repeatedly and independently. Polynomial graph filters studied in the context of graph signal processing are inadequate to analyze signals on this type of networks. This is due to the fact that the basic shift on a graph requires all the nodes to communicate at the same time, which cannot be assumed in an autonomous setting. In order to analyze these type of networks, this paper studies an asynchronous power iteration that updates the values of only a subset of nodes. This paper further reveals the close connection between asynchronous updates and the notion of smooth signals on the graph. The paper also shows that a cascade of random asynchronous updates smooths out any arbitrary signal on the graph.
{"title":"The Asynchronous Power Iteration: A Graph Signal Perspective","authors":"Oguzhan Teke, P. Vaidyanathan","doi":"10.1109/ICASSP.2018.8461872","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461872","url":null,"abstract":"This paper considers an autonomous network in which the nodes communicate only with their neighbors at random time instances, repeatedly and independently. Polynomial graph filters studied in the context of graph signal processing are inadequate to analyze signals on this type of networks. This is due to the fact that the basic shift on a graph requires all the nodes to communicate at the same time, which cannot be assumed in an autonomous setting. In order to analyze these type of networks, this paper studies an asynchronous power iteration that updates the values of only a subset of nodes. This paper further reveals the close connection between asynchronous updates and the notion of smooth signals on the graph. The paper also shows that a cascade of random asynchronous updates smooths out any arbitrary signal on the graph.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 2","pages":"4059-4063"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72631210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8461523
W. Guan, Yuexian Zou, Xiaoqun Zhou
Though tremendous progresses have been made in object detection due to the deep convolutional networks, one of the remaining challenges is the multi-scale object detection(MOD). To improve the performance of MOD task, we take Faster region-based CNN (Faster R-CNN) framework and work on two specific problems: get more accurate localization for small objects and eliminate background region proposals, when there are many small objects exist. Specifically, a feature fusion module is introduced which jointly utilize the high-abstracted semantic knowledge captured in higher layer and details information captured in the lower layer to generate a fine resolution feature maps. As a result, the small objects can be localized more accurately. Besides, a novel Region Objectness Network is developed for generating effective proposals which are more likely to cover the target objects. Extensive experiments have been conducted over UA-DETRAC car datasets, as well as a self-built bird dataset (BSBDV 2017) collected from Shenzhen Bay coastal wetland, which demonstrate the competitive performance and the comparable detection speed of our proposed method.
{"title":"Multi-Scale Object Detection with Feature Fusion and Region Objectness Network","authors":"W. Guan, Yuexian Zou, Xiaoqun Zhou","doi":"10.1109/ICASSP.2018.8461523","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461523","url":null,"abstract":"Though tremendous progresses have been made in object detection due to the deep convolutional networks, one of the remaining challenges is the multi-scale object detection(MOD). To improve the performance of MOD task, we take Faster region-based CNN (Faster R-CNN) framework and work on two specific problems: get more accurate localization for small objects and eliminate background region proposals, when there are many small objects exist. Specifically, a feature fusion module is introduced which jointly utilize the high-abstracted semantic knowledge captured in higher layer and details information captured in the lower layer to generate a fine resolution feature maps. As a result, the small objects can be localized more accurately. Besides, a novel Region Objectness Network is developed for generating effective proposals which are more likely to cover the target objects. Extensive experiments have been conducted over UA-DETRAC car datasets, as well as a self-built bird dataset (BSBDV 2017) collected from Shenzhen Bay coastal wetland, which demonstrate the competitive performance and the comparable detection speed of our proposed method.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"2596-2600"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90483137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-22DOI: 10.1109/ICASSP.2018.8462505
Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, Bo Xu
In this paper, we propose a speaker-independent multi-speaker monaural speech separation system (CBLDNN-GAT) based on convolutional, bidirectional long short-term memory, deep feedforward neural network (CBLDNN) with generative adversarial training (GAT). Our system aims at obtaining better speech quality instead of only minimizing a mean square error (MSE). In the initial phase, we utilize log-mel filterbank and pitch features to warm up our CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. We execute GAT throughout the training, which makes the separated speech indistinguishable from the real one. We evaluate CBLDNN-GAT on WSJ0-2mix dataset. The experimental results show that the proposed model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.
{"title":"CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training","authors":"Chenxing Li, Lei Zhu, Shuang Xu, Peng Gao, Bo Xu","doi":"10.1109/ICASSP.2018.8462505","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462505","url":null,"abstract":"In this paper, we propose a speaker-independent multi-speaker monaural speech separation system (CBLDNN-GAT) based on convolutional, bidirectional long short-term memory, deep feedforward neural network (CBLDNN) with generative adversarial training (GAT). Our system aims at obtaining better speech quality instead of only minimizing a mean square error (MSE). In the initial phase, we utilize log-mel filterbank and pitch features to warm up our CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. We execute GAT throughout the training, which makes the separated speech indistinguishable from the real one. We evaluate CBLDNN-GAT on WSJ0-2mix dataset. The experimental results show that the proposed model achieves 11.0d-B signal-to-distortion ratio (SDR) improvement, which is the new state-of-the-art result.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"36 1","pages":"711-715"},"PeriodicalIF":0.0,"publicationDate":"2018-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87784806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}