Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287552
Simon W. McKnight, Aidan O. T. Hogg, P. Naylor
Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.
{"title":"Analysis of Phonetic Dependence of Segmentation Errors in Speaker Diarization","authors":"Simon W. McKnight, Aidan O. T. Hogg, P. Naylor","doi":"10.23919/Eusipco47968.2020.9287552","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287552","url":null,"abstract":"Evaluation of speaker segmentation and diarization normally makes use of forgiveness collars around ground truth speaker segment boundaries such that estimated speaker segment boundaries with such collars are considered completely correct. This paper shows that the popular recent approach of removing forgiveness collars from speaker diarization evaluation tools can unfairly penalize speaker diarization systems that correctly estimate speaker segment boundaries. The uncertainty in identifying the start and/or end of a particular phoneme means that the ground truth segmentation is not perfectly accurate, and even trained human listeners are unable to identify phoneme boundaries with full consistency. This research analyses the phoneme dependence of this uncertainty, and shows that it depends on (i) whether the phoneme being detected is at the start or end of an utterance and (ii) what the phoneme is, so that the use of a uniform forgiveness collar is inadequate. This analysis is expected to point the way towards more indicative and repeatable assessment of the performance of speaker diarization systems.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"20 1","pages":"381-385"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90884623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287489
M. Atashi, Parvin Malekzadeh, Mohammad Salimibeni, Zohreh Hajiakhondi-Meybodi, K. Plataniotis, Arash Mohammadi
Internet of Things (IoT) has penetrated different aspects of our modern life where smart sensors enabled with Bluetooth Low Energy (BLE) are deployed increasingly within our surrounding indoor environments. BLE-based localization is, typically, performed based on Received Signal Strength Indicator (RSSI), which suffers from different drawbacks due to its significant fluctuations. In this paper, we focus on a multiplemodel estimation framework for analyzing and addressing effects of orientation of a BLE-enabled device on indoor localization accuracy. The fusion unit of the proposed method would merge orientation estimated by RSSI values and heading estimated by Inertial Measurement Unit (IMU) sensors to gain higher accuracy in orientation classification. In contrary to existing RSSIbased solutions that use a single path-loss model, the proposed framework consists of eight orientation-matched path loss models coupled with a multi-sensor and data-driven classification model that estimates the orientation of a hand-held device with high accuracy of 99%. By estimating the orientation, we could mitigate the effect of orientation on the RSSI values and consequently improve RSSI-based distance estimates. In particular, the proposed data-driven and multiple-model framework is constructed based on over 10 million RSSI values and IMU sensor data collected via an implemented LBS platform.
物联网(IoT)已经渗透到我们现代生活的各个方面,支持低功耗蓝牙(BLE)的智能传感器越来越多地部署在我们周围的室内环境中。基于ble的定位通常基于接收信号强度指标(Received Signal Strength Indicator, RSSI),但RSSI的波动较大,存在不同的缺点。在本文中,我们重点研究了一个多模型估计框架,用于分析和解决启用ble的设备的方向对室内定位精度的影响。该方法的融合单元将RSSI值估计的方向与惯性测量单元(IMU)传感器估计的航向进行融合,以获得更高的方向分类精度。与现有的基于rssi的解决方案使用单一路径损耗模型相反,该框架由八个方向匹配的路径损耗模型以及一个多传感器和数据驱动的分类模型组成,该模型可以估计手持设备的方向,准确率高达99%。通过对方向的估计,可以减轻方向对RSSI值的影响,从而提高基于RSSI的距离估计。特别是,所提出的数据驱动和多模型框架是基于通过实现的LBS平台收集的超过1000万个RSSI值和IMU传感器数据构建的。
{"title":"Orientation-Matched Multiple Modeling for RSSI-based Indoor Localization via BLE Sensors","authors":"M. Atashi, Parvin Malekzadeh, Mohammad Salimibeni, Zohreh Hajiakhondi-Meybodi, K. Plataniotis, Arash Mohammadi","doi":"10.23919/Eusipco47968.2020.9287489","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287489","url":null,"abstract":"Internet of Things (IoT) has penetrated different aspects of our modern life where smart sensors enabled with Bluetooth Low Energy (BLE) are deployed increasingly within our surrounding indoor environments. BLE-based localization is, typically, performed based on Received Signal Strength Indicator (RSSI), which suffers from different drawbacks due to its significant fluctuations. In this paper, we focus on a multiplemodel estimation framework for analyzing and addressing effects of orientation of a BLE-enabled device on indoor localization accuracy. The fusion unit of the proposed method would merge orientation estimated by RSSI values and heading estimated by Inertial Measurement Unit (IMU) sensors to gain higher accuracy in orientation classification. In contrary to existing RSSIbased solutions that use a single path-loss model, the proposed framework consists of eight orientation-matched path loss models coupled with a multi-sensor and data-driven classification model that estimates the orientation of a hand-held device with high accuracy of 99%. By estimating the orientation, we could mitigate the effect of orientation on the RSSI values and consequently improve RSSI-based distance estimates. In particular, the proposed data-driven and multiple-model framework is constructed based on over 10 million RSSI values and IMU sensor data collected via an implemented LBS platform.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"83 1","pages":"1702-1706"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90325432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287396
Erik Månsson, Maria Sandsten
The matched window reassigned spectrogram relocates all signal energy of an oscillating transient to the time-and frequency locations, resulting in a sharp peak in the time-frequency plane. However, previous research has shown that the method may result in split energy peaks for close components and in high noise levels, and the peak energy is then erroneously estimated. With use of novel knowledge on the statistics when subjected to noise, we propose a novel method, the smoothed reassigned spectrogram, for obtaining a stable and accurate measure of the signal energy from the peak value, with retained resolution properties. We also suggest a simple set of rules to enhance the reassigned spectrogram and speed up its calculation. Simulations are performed to verify the accuracy and an application example on radar data is shown.
{"title":"The Smoothed Reassigned Spectrogram for Robust Energy Estimation","authors":"Erik Månsson, Maria Sandsten","doi":"10.23919/Eusipco47968.2020.9287396","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287396","url":null,"abstract":"The matched window reassigned spectrogram relocates all signal energy of an oscillating transient to the time-and frequency locations, resulting in a sharp peak in the time-frequency plane. However, previous research has shown that the method may result in split energy peaks for close components and in high noise levels, and the peak energy is then erroneously estimated. With use of novel knowledge on the statistics when subjected to noise, we propose a novel method, the smoothed reassigned spectrogram, for obtaining a stable and accurate measure of the signal energy from the peak value, with retained resolution properties. We also suggest a simple set of rules to enhance the reassigned spectrogram and speed up its calculation. Simulations are performed to verify the accuracy and an application example on radar data is shown.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"17 1","pages":"2210-2214"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76700863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287548
Juan Sebastián Gómez Cañón, Estefanía Cano, P. Herrera, E. Gómez
In this study, we address emotion recognition using unsupervised feature learning from speech data, and test its transferability to music. Our approach is to pre-train models using speech in English and Mandarin, and then fine-tune them with excerpts of music labeled with categories of emotion. Our initial hypothesis is that features automatically learned from speech should be transferable to music. Namely, we expect the intra-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in English) should result in improved performance over the cross-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in Mandarin). Our results confirm previous research on cross-domain transferability, and encourage research towards language-sensitive Music Emotion Recognition (MER) models.
{"title":"Transfer learning from speech to music: towards language-sensitive emotion recognition models","authors":"Juan Sebastián Gómez Cañón, Estefanía Cano, P. Herrera, E. Gómez","doi":"10.23919/Eusipco47968.2020.9287548","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287548","url":null,"abstract":"In this study, we address emotion recognition using unsupervised feature learning from speech data, and test its transferability to music. Our approach is to pre-train models using speech in English and Mandarin, and then fine-tune them with excerpts of music labeled with categories of emotion. Our initial hypothesis is that features automatically learned from speech should be transferable to music. Namely, we expect the intra-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in English) should result in improved performance over the cross-linguistic setting (e.g., pre-training on speech in English and fine-tuning on music in Mandarin). Our results confirm previous research on cross-domain transferability, and encourage research towards language-sensitive Music Emotion Recognition (MER) models.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"50 1","pages":"136-140"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76791631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287586
A. Bhandari, Matthias Beckmann, F. Krahmer
In this paper, we introduce the Modulo Radon Transform (MRT) which is complemented by an inversion algorithm. The MRT generalizes the conventional Radon Transform and is obtained via computing modulo of the line integral of a two-dimensional function at a given angle. Since the modulo operation has an aliasing effect on the range of a function, the recorded MRT sinograms are always bounded, thus avoiding information loss arising from saturation or clipping effects. This paves a new pathway for imaging applications such as high dynamic range tomography, a topic that is in its early stages of development. By capitalizing on the recent results on Unlimited Sensing architecture, we prove that the Modulo Radon Transform can be inverted when the resultant (discrete/continuous) measurements map to a band-limited function. Thus, the MRT leads to new possibilities for both conceptualization of inversion algorithms as well as development of new hardware, for instance, for single-shot high dynamic range tomography.
{"title":"The Modulo Radon Transform and its Inversion","authors":"A. Bhandari, Matthias Beckmann, F. Krahmer","doi":"10.23919/Eusipco47968.2020.9287586","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287586","url":null,"abstract":"In this paper, we introduce the Modulo Radon Transform (MRT) which is complemented by an inversion algorithm. The MRT generalizes the conventional Radon Transform and is obtained via computing modulo of the line integral of a two-dimensional function at a given angle. Since the modulo operation has an aliasing effect on the range of a function, the recorded MRT sinograms are always bounded, thus avoiding information loss arising from saturation or clipping effects. This paves a new pathway for imaging applications such as high dynamic range tomography, a topic that is in its early stages of development. By capitalizing on the recent results on Unlimited Sensing architecture, we prove that the Modulo Radon Transform can be inverted when the resultant (discrete/continuous) measurements map to a band-limited function. Thus, the MRT leads to new possibilities for both conceptualization of inversion algorithms as well as development of new hardware, for instance, for single-shot high dynamic range tomography.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"201 1","pages":"770-774"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76983798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287525
Julien Gérard, J. Tomasik, C. Morisseau, Arpad Rimmel, G. Vieillard
There are numerous formats which represent the micro-Doppler signature. Our goal is to determine which one is the most adapted to classify small UAV (Unmanned Aerial Vehicules) with Deep Learning. To achieve this goal, we compare drone classification results with the different micro-Doppler signatures for a given neural network. This comparison has been performed on data obtained during a radar measurement campaign. We evaluate the classification performance in function of different use conditions we identified with a given neural network. According to the experiments conducted, the recommended format is a spectrum issued from long observations as its classification results are better for most criteria.
{"title":"Micro-Doppler Signal Representation for Drone Classification by Deep Learning","authors":"Julien Gérard, J. Tomasik, C. Morisseau, Arpad Rimmel, G. Vieillard","doi":"10.23919/Eusipco47968.2020.9287525","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287525","url":null,"abstract":"There are numerous formats which represent the micro-Doppler signature. Our goal is to determine which one is the most adapted to classify small UAV (Unmanned Aerial Vehicules) with Deep Learning. To achieve this goal, we compare drone classification results with the different micro-Doppler signatures for a given neural network. This comparison has been performed on data obtained during a radar measurement campaign. We evaluate the classification performance in function of different use conditions we identified with a given neural network. According to the experiments conducted, the recommended format is a spectrum issued from long observations as its classification results are better for most criteria.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"16 1","pages":"1561-1565"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75400838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287668
Filip Wen-Fwu Tsai, Alireza M. Javid, S. Chatterjee
For prediction of a non-negative target signal using a non-negative input, we design a feed-forward neural network to achieve a better performance than a non-negative matrix factorization (NMF) algorithm. We provide a mathematical relation between the neural network and NMF. The architecture of the neural network is built on a property of rectified-linear-unit (ReLU) activation function and a convex optimization layer-wise training approach. For an illustrative example, we choose a speech enhancement application where a clean speech spectrum is estimated from a noisy spectrum.
{"title":"Design of a Non-negative Neural Network to Improve on NMF","authors":"Filip Wen-Fwu Tsai, Alireza M. Javid, S. Chatterjee","doi":"10.23919/Eusipco47968.2020.9287668","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287668","url":null,"abstract":"For prediction of a non-negative target signal using a non-negative input, we design a feed-forward neural network to achieve a better performance than a non-negative matrix factorization (NMF) algorithm. We provide a mathematical relation between the neural network and NMF. The architecture of the neural network is built on a property of rectified-linear-unit (ReLU) activation function and a convex optimization layer-wise training approach. For an illustrative example, we choose a speech enhancement application where a clean speech spectrum is estimated from a noisy spectrum.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"38 1","pages":"461-465"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77909461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287509
Karen J. Uribe-Murcia, J. Andrade-Lucio, Y. Shmaliy, Yuan Xu
This paper develops an unbiased finite impulse response (UFIR) filtering algorithm for networked systems where uncertain delays and packet dropouts can happen due to measurement failures and unreliable communication. The binary Bernoulli distribution with known delay probability is used to model the randomly arrived measures. A novel representation of the stochastic model is presented for FIR-type filter structures. To avoid packet dropouts and improve the estimation accuracy when a message arrives with no data, a predictive algorithm is used. An advantage of the UFIR filtering approach is demonstrated by comparing the mean square errors with the Kalman and H∞ filters under the same conditions. Experimental verifications are provided based on GPS vehicle tracking.
{"title":"Unbiased FIR Filtering under Bernoulli-Distributed Binary Randomly Delayed and Missing Data","authors":"Karen J. Uribe-Murcia, J. Andrade-Lucio, Y. Shmaliy, Yuan Xu","doi":"10.23919/Eusipco47968.2020.9287509","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287509","url":null,"abstract":"This paper develops an unbiased finite impulse response (UFIR) filtering algorithm for networked systems where uncertain delays and packet dropouts can happen due to measurement failures and unreliable communication. The binary Bernoulli distribution with known delay probability is used to model the randomly arrived measures. A novel representation of the stochastic model is presented for FIR-type filter structures. To avoid packet dropouts and improve the estimation accuracy when a message arrives with no data, a predictive algorithm is used. An advantage of the UFIR filtering approach is demonstrated by comparing the mean square errors with the Kalman and H∞ filters under the same conditions. Experimental verifications are provided based on GPS vehicle tracking.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"38 1","pages":"2408-2412"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78254418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287814
Roberto López-Valcarce, Marcos Martínez-Cotelo
We investigate the design of hybrid precoders and combiners for a millimeter wave (mmWave) point-to-point bidirectional link in which both nodes transmit and receive simultaneously and on the same carrier frequency. In such full-duplex configuration, mitigation of self-interference (SI) becomes critical. Large antenna arrays provide an opportunity for spatial SI suppression in mmWave. We assume a phase-shifter based, fully connected architecture for the analog part of the precoder and combiner. The proposed design, which aims at cancelling SI in the analog domain to avoid frontend saturation, significantly improves on the performance of previous approaches.
{"title":"Full-Duplex mmWave Communication with Hybrid Precoding and Combining","authors":"Roberto López-Valcarce, Marcos Martínez-Cotelo","doi":"10.23919/Eusipco47968.2020.9287814","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287814","url":null,"abstract":"We investigate the design of hybrid precoders and combiners for a millimeter wave (mmWave) point-to-point bidirectional link in which both nodes transmit and receive simultaneously and on the same carrier frequency. In such full-duplex configuration, mitigation of self-interference (SI) becomes critical. Large antenna arrays provide an opportunity for spatial SI suppression in mmWave. We assume a phase-shifter based, fully connected architecture for the analog part of the precoder and combiner. The proposed design, which aims at cancelling SI in the analog domain to avoid frontend saturation, significantly improves on the performance of previous approaches.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"13 1","pages":"1752-1756"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74901476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-01-24DOI: 10.23919/Eusipco47968.2020.9287224
C. Schymura, Tsubasa Ochiai, Marc Delcroix, K. Kinoshita, T. Nakatani, S. Araki, D. Kolossa
Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model. These types of models have been successfully applied to problems in natural language processing and automatic speech recognition. In this work, a multi-channel audio signal is encoded to a latent representation, which is subsequently decoded to a sequence of estimated directions-of-arrival. Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step. The framework is evaluated on three publicly available datasets for sound event localization. It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
{"title":"Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization","authors":"C. Schymura, Tsubasa Ochiai, Marc Delcroix, K. Kinoshita, T. Nakatani, S. Araki, D. Kolossa","doi":"10.23919/Eusipco47968.2020.9287224","DOIUrl":"https://doi.org/10.23919/Eusipco47968.2020.9287224","url":null,"abstract":"Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model. These types of models have been successfully applied to problems in natural language processing and automatic speech recognition. In this work, a multi-channel audio signal is encoded to a latent representation, which is subsequently decoded to a sequence of estimated directions-of-arrival. Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step. The framework is evaluated on three publicly available datasets for sound event localization. It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"31 1","pages":"231-235"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72990309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}