首页 > 最新文献

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Face recognition in real-world images 真实世界图像中的人脸识别
Xavier Fontaine, R. Achanta, S. Süsstrunk
Face recognition systems are designed to handle well-aligned images captured under controlled situations. However real-world images present varying orientations, expressions, and illumination conditions. Traditional face recognition algorithms perform poorly on such images. In this paper we present a method for face recognition adapted to real-world conditions that can be trained using very few training examples and is computationally efficient. Our method consists of performing a novel alignment process followed by classification using sparse representation techniques. We present our recognition rates on a difficult dataset that represents real-world faces where we significantly outperform state-of-the-art methods.
人脸识别系统旨在处理在受控情况下拍摄的对齐良好的图像。然而,现实世界的图像呈现不同的方向,表情和照明条件。传统的人脸识别算法在这类图像上表现不佳。在本文中,我们提出了一种适应现实世界条件的人脸识别方法,该方法可以使用很少的训练样本进行训练,并且计算效率很高。我们的方法包括执行一种新的对齐过程,然后使用稀疏表示技术进行分类。我们在一个复杂的数据集上展示了我们的识别率,该数据集代表了现实世界的面孔,我们在这方面的表现明显优于最先进的方法。
{"title":"Face recognition in real-world images","authors":"Xavier Fontaine, R. Achanta, S. Süsstrunk","doi":"10.1109/ICASSP.2017.7952403","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952403","url":null,"abstract":"Face recognition systems are designed to handle well-aligned images captured under controlled situations. However real-world images present varying orientations, expressions, and illumination conditions. Traditional face recognition algorithms perform poorly on such images. In this paper we present a method for face recognition adapted to real-world conditions that can be trained using very few training examples and is computationally efficient. Our method consists of performing a novel alignment process followed by classification using sparse representation techniques. We present our recognition rates on a difficult dataset that represents real-world faces where we significantly outperform state-of-the-art methods.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128957156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Patch-based multiple view image denoising with occlusion handling 基于补丁的多视图图像去噪与遮挡处理
Shiwei Zhou, Y. Hu, Hongrui Jiang
A novel patch-based multi-view image denoising algorithm is proposed. This method leverages the 3D focus image stacks structure to exploit self-similarity and image redundancy inherent in multiple view images. Then a depth-guided adaptive window and dynamic view selection criterion is developed to aid proper selection of most consistent patches for the multi-view image denoising. Extensive experiments have been performed. Comparing the outcomes against those of state of the art image denoising algorithms, our proposed algorithm demonstrates significant performance advantage.
提出了一种新的基于补丁的多视点图像去噪算法。该方法利用三维焦点图像堆栈结构,利用多视图图像固有的自相似性和图像冗余。然后,提出了深度引导自适应窗口和动态视图选择准则,以帮助选择最一致的补丁进行多视图图像去噪。已经进行了大量的实验。将结果与最先进的图像去噪算法进行比较,我们提出的算法显示出显着的性能优势。
{"title":"Patch-based multiple view image denoising with occlusion handling","authors":"Shiwei Zhou, Y. Hu, Hongrui Jiang","doi":"10.1109/ICASSP.2017.7952463","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952463","url":null,"abstract":"A novel patch-based multi-view image denoising algorithm is proposed. This method leverages the 3D focus image stacks structure to exploit self-similarity and image redundancy inherent in multiple view images. Then a depth-guided adaptive window and dynamic view selection criterion is developed to aid proper selection of most consistent patches for the multi-view image denoising. Extensive experiments have been performed. Comparing the outcomes against those of state of the art image denoising algorithms, our proposed algorithm demonstrates significant performance advantage.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114611982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Affect recognition from lip articulations 影响唇部发音的识别
Rizwan Sadiq, E. Erzin
Lips deliver visually active clues for speech articulation. Affective states define how humans articulate speech; hence, they also change articulation of lip motion. In this paper, we investigate effect of phonetic classes for affect recognition from lip articulations. The affect recognition problem is formalized in discrete activation, valence and dominance attributes. We use the symmetric KullbackLeibler divergence (KLD) to rate phonetic classes with larger discrimination across different affective states. We perform experimental evaluations using the IEMOCAP database. Our results demonstrate that lip articulations over a set of discriminative phonetic classes improves the affect recognition performance, and attains 3-class recognition rates for the activation, valence and dominance (AVD) attributes as 72.16%, 46.44% and 64.92%, respectively.
嘴唇为发音提供视觉上活跃的线索。情感状态定义了人类如何表达语言;因此,它们也改变了嘴唇运动的发音。本文探讨了语音分类对唇音情感识别的影响。将情感识别问题形式化为离散激活属性、效价属性和优势属性。我们使用对称KullbackLeibler散度(KLD)对不同情感状态下具有较大区别的语音类别进行评级。我们使用IEMOCAP数据库进行实验评估。结果表明,唇部发音在一组区分语音类别上提高了情感识别性能,激活、效价和优势(AVD)属性的3类识别率分别达到72.16%、46.44%和64.92%。
{"title":"Affect recognition from lip articulations","authors":"Rizwan Sadiq, E. Erzin","doi":"10.1109/ICASSP.2017.7952593","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952593","url":null,"abstract":"Lips deliver visually active clues for speech articulation. Affective states define how humans articulate speech; hence, they also change articulation of lip motion. In this paper, we investigate effect of phonetic classes for affect recognition from lip articulations. The affect recognition problem is formalized in discrete activation, valence and dominance attributes. We use the symmetric KullbackLeibler divergence (KLD) to rate phonetic classes with larger discrimination across different affective states. We perform experimental evaluations using the IEMOCAP database. Our results demonstrate that lip articulations over a set of discriminative phonetic classes improves the affect recognition performance, and attains 3-class recognition rates for the activation, valence and dominance (AVD) attributes as 72.16%, 46.44% and 64.92%, respectively.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115331219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Particle PHD filter based multi-target tracking using discriminative group-structured dictionary learning 基于粒子PHD滤波的判别组结构字典学习多目标跟踪
Zeyu Fu, P. Feng, S. M. Naqvi, J. Chambers
Structured sparse representation has been recently found to achieve better efficiency and robustness in exploiting the target appearance model in tracking systems with both holistic and local information. Therefore, to better simultaneously discriminate multi-targets from their background, we propose a novel video-based multi-target tracking system that combines the particle probability hypothesis density (PHD) filter with discriminative group-structured dictionary learning. The discriminative dictionary with group structure learned by the hierarchical K-means clustering algorithm implicitly associates the dictionary atoms with the group labels, simultaneously enforcing the target candidates from the same group (class) to share the same structured sparsity pattern. Furthermore, we propose a new joint likelihood calculation by relating the discriminative sparse codes with the maximum voting technique to enhance the particle PHD updating step. Experimental results on two publicly available benchmark video sequences confirm the improved performance of our proposed method over other state-of-the-art techniques in video-based multi-target tracking.
结构化稀疏表示在具有整体和局部信息的跟踪系统中具有更好的效率和鲁棒性。因此,为了更好地同时识别多目标及其背景,我们提出了一种基于视频的多目标跟踪系统,该系统将粒子概率假设密度(PHD)滤波器与判别组结构字典学习相结合。通过分层K-means聚类算法学习的具有组结构的判别字典隐式地将字典原子与组标签关联起来,同时强制来自同一组(类)的目标候选对象共享相同的结构化稀疏模式。此外,我们提出了一种新的联合似然计算方法,将区别稀疏码与最大投票技术相结合,以提高粒子PHD更新的步长。在两个公开可用的基准视频序列上的实验结果证实了我们提出的方法在基于视频的多目标跟踪方面的性能优于其他最先进的技术。
{"title":"Particle PHD filter based multi-target tracking using discriminative group-structured dictionary learning","authors":"Zeyu Fu, P. Feng, S. M. Naqvi, J. Chambers","doi":"10.1109/ICASSP.2017.7952983","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952983","url":null,"abstract":"Structured sparse representation has been recently found to achieve better efficiency and robustness in exploiting the target appearance model in tracking systems with both holistic and local information. Therefore, to better simultaneously discriminate multi-targets from their background, we propose a novel video-based multi-target tracking system that combines the particle probability hypothesis density (PHD) filter with discriminative group-structured dictionary learning. The discriminative dictionary with group structure learned by the hierarchical K-means clustering algorithm implicitly associates the dictionary atoms with the group labels, simultaneously enforcing the target candidates from the same group (class) to share the same structured sparsity pattern. Furthermore, we propose a new joint likelihood calculation by relating the discriminative sparse codes with the maximum voting technique to enhance the particle PHD updating step. Experimental results on two publicly available benchmark video sequences confirm the improved performance of our proposed method over other state-of-the-art techniques in video-based multi-target tracking.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116044764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Respiratory airflow estimation from lung sounds based on regression 基于回归的肺音呼吸气流估计
Elmar Messner, Martin Hagmüller, P. Swatek, F. Smolle-Jüttner, F. Pernkopf
The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acoustic airflow estimation. With a 16-channel lung sound recording device, we simultaneously record the respiratory flow and the lung sounds on the posterior chest from six lung-healthy subjects in supine position. For the recordings of four selected sensor positions, we extract linear frequency cepstral coefficient (LFCC) features and map these on the airflow signal. We use multivariate polynomial regression to fit the features to the airflow signal. Compared to most of the previous approaches, the proposed method uses lung sounds instead of trachea sounds. Furthermore, our method masters the estimation of the airflow without prior knowledge of the respiratory phase, i.e. no additional algorithm for phase detection is required. Another benefit is the avoidance of time-consuming calibration. In experiments, we evaluate the proposed method for various selections of sensor positions in terms of mean squared error (MSE) between estimated and actual airflow. Moreover, we show the accuracy of the method regarding a frame-based breathing-phase detection.
这项工作的目的是估计呼吸流量从肺录音,即声气流估计。我们使用16通道肺音记录装置,同时记录6名肺健康受试者仰卧位后胸部的呼吸流量和肺音。对于四个选定的传感器位置的记录,我们提取线性频率倒谱系数(LFCC)特征并将这些特征映射到气流信号上。我们使用多元多项式回归来拟合气流信号的特征。与之前的大多数方法相比,该方法使用肺音而不是气管音。此外,我们的方法掌握了气流的估计,而不需要事先知道呼吸相位,即不需要额外的相位检测算法。另一个好处是避免了耗时的校准。在实验中,我们根据估计气流和实际气流之间的均方误差(MSE)来评估所提出的传感器位置的各种选择方法。此外,我们还展示了该方法在基于帧的呼吸相位检测方面的准确性。
{"title":"Respiratory airflow estimation from lung sounds based on regression","authors":"Elmar Messner, Martin Hagmüller, P. Swatek, F. Smolle-Jüttner, F. Pernkopf","doi":"10.1109/ICASSP.2017.7952331","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952331","url":null,"abstract":"The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acoustic airflow estimation. With a 16-channel lung sound recording device, we simultaneously record the respiratory flow and the lung sounds on the posterior chest from six lung-healthy subjects in supine position. For the recordings of four selected sensor positions, we extract linear frequency cepstral coefficient (LFCC) features and map these on the airflow signal. We use multivariate polynomial regression to fit the features to the airflow signal. Compared to most of the previous approaches, the proposed method uses lung sounds instead of trachea sounds. Furthermore, our method masters the estimation of the airflow without prior knowledge of the respiratory phase, i.e. no additional algorithm for phase detection is required. Another benefit is the avoidance of time-consuming calibration. In experiments, we evaluate the proposed method for various selections of sensor positions in terms of mean squared error (MSE) between estimated and actual airflow. Moreover, we show the accuracy of the method regarding a frame-based breathing-phase detection.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"147 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126307337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A PLLR and multi-stage Staircase Regression framework for speech-based emotion prediction 基于语音的情感预测的PLLR和多阶段阶梯回归框架
Zhaocheng Huang, J. Epps
Continuous prediction of dimensional emotions (e.g. arousal and valence) has attracted increasing research interest recently. When processing emotional speech signals, phonetic features have been rarely used due to the assumption that phonetic variability is a confounding factor that degrades emotion recognition/prediction performance. In this paper, instead of eliminating phonetic variability, we investigated whether Phone Log-Likelihood Ratio (PLLR) features could be used to index arousal and valence in a pairwise low/high framework. A multi-stage staircase regression (SR) framework which enables fusion at three different stages is also investigated. Results on the RECOLA database show that PLLR outperforms EGEMAPS features for arousal and valence. Interestingly, long-term averaged PLLR proved to be more robust and emotionally informative than local frame-level PLLR, which contains more phoneme-specific information. Within the multistage SR framework, PLLR yielded an 8.2% and 11.6% relative improvement in CCC for arousal and valence respectively, showing great promise for including phonetic features in emotion prediction systems.
连续预测维度情绪(如唤醒和效价)近年来引起了越来越多的研究兴趣。在处理情绪语音信号时,语音特征很少被使用,因为语音变异性是降低情绪识别/预测性能的混杂因素。在本文中,我们研究了电话对数似然比(PLLR)特征是否可以在两两低/高框架中用于指数唤起和价态,而不是消除语音变异。一个多级阶梯回归(SR)框架,使融合在三个不同的阶段也进行了研究。在RECOLA数据库上的结果表明,PLLR在唤醒和效价方面优于EGEMAPS特征。有趣的是,长期平均PLLR被证明比包含更多音素特定信息的局部帧级PLLR更具鲁棒性和情感信息性。在多阶段SR框架下,PLLR在唤醒和效价方面分别产生了8.2%和11.6%的相对提高,显示了在情绪预测系统中包含语音特征的巨大希望。
{"title":"A PLLR and multi-stage Staircase Regression framework for speech-based emotion prediction","authors":"Zhaocheng Huang, J. Epps","doi":"10.1109/ICASSP.2017.7953137","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7953137","url":null,"abstract":"Continuous prediction of dimensional emotions (e.g. arousal and valence) has attracted increasing research interest recently. When processing emotional speech signals, phonetic features have been rarely used due to the assumption that phonetic variability is a confounding factor that degrades emotion recognition/prediction performance. In this paper, instead of eliminating phonetic variability, we investigated whether Phone Log-Likelihood Ratio (PLLR) features could be used to index arousal and valence in a pairwise low/high framework. A multi-stage staircase regression (SR) framework which enables fusion at three different stages is also investigated. Results on the RECOLA database show that PLLR outperforms EGEMAPS features for arousal and valence. Interestingly, long-term averaged PLLR proved to be more robust and emotionally informative than local frame-level PLLR, which contains more phoneme-specific information. Within the multistage SR framework, PLLR yielded an 8.2% and 11.6% relative improvement in CCC for arousal and valence respectively, showing great promise for including phonetic features in emotion prediction systems.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Biologically inspired speech emotion recognition 生物语言情感识别
Reza Lotfidereshgi, P. Gournay
Conventional feature-based classification methods do not apply well to automatic recognition of speech emotions, mostly because the precise set of spectral and prosodic features that is required to identify the emotional state of a speaker has not been determined yet. This paper presents a method that operates directly on the speech signal, thus avoiding the problematic step of feature extraction. Furthermore, this method combines the strengths of the classical source-filter model of human speech production with those of the recently introduced liquid state machine (LSM), a biologically-inspired spiking neural network (SNN). The source and vocal tract components of the speech signal are first separated and converted into perceptually relevant spectral representations. These representations are then processed separately by two reservoirs of neurons. The output of each reservoir is reduced in dimensionality and fed to a final classifier. This method is shown to provide very good classification performance on the Berlin Database of Emotional Speech (Emo-DB). This seems a very promising framework for solving efficiently many other problems in speech processing.
传统的基于特征的分类方法不能很好地应用于语音情绪的自动识别,主要是因为识别说话者情绪状态所需的精确的频谱和韵律特征集尚未确定。本文提出了一种直接对语音信号进行处理的方法,从而避免了存在问题的特征提取步骤。此外,该方法结合了人类语音产生的经典源滤波模型和最近引入的液体状态机(LSM),一种生物激发的峰值神经网络(SNN)的优势。首先将语音信号的源和声道成分分离并转换为感知相关的频谱表示。这些表征然后由两个神经元库分别处理。每个储存库的输出被降维,并被送入最终的分类器。该方法在柏林情绪语音数据库(Emo-DB)上取得了很好的分类效果。这似乎是一个非常有前途的框架,可以有效地解决语音处理中的许多其他问题。
{"title":"Biologically inspired speech emotion recognition","authors":"Reza Lotfidereshgi, P. Gournay","doi":"10.1109/ICASSP.2017.7953135","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7953135","url":null,"abstract":"Conventional feature-based classification methods do not apply well to automatic recognition of speech emotions, mostly because the precise set of spectral and prosodic features that is required to identify the emotional state of a speaker has not been determined yet. This paper presents a method that operates directly on the speech signal, thus avoiding the problematic step of feature extraction. Furthermore, this method combines the strengths of the classical source-filter model of human speech production with those of the recently introduced liquid state machine (LSM), a biologically-inspired spiking neural network (SNN). The source and vocal tract components of the speech signal are first separated and converted into perceptually relevant spectral representations. These representations are then processed separately by two reservoirs of neurons. The output of each reservoir is reduced in dimensionality and fed to a final classifier. This method is shown to provide very good classification performance on the Berlin Database of Emotional Speech (Emo-DB). This seems a very promising framework for solving efficiently many other problems in speech processing.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115301322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Automatic speech emotion recognition using recurrent neural networks with local attention 基于局部注意的递归神经网络的语音情感自动识别
Seyedmahdad Mirsamadi, Emad Barsoum, Cha Zhang
Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.
语音情感自动识别是一项具有挑战性的任务,它很大程度上依赖于用于分类的语音特征的有效性。在这项工作中,我们研究了使用深度学习从语音中自动发现情感相关特征。研究表明,使用深度递归神经网络,我们既可以学习与情感相关的短时间帧级声学特征,也可以将这些特征适当的时间聚合成紧凑的话语级表示。此外,我们提出了一种新颖的随时间特征池策略,该策略使用局部注意力来关注语音信号中更具有情感突出性的特定区域。提出的解决方案在IEMOCAP语料库上进行了评估,与现有的情感识别算法相比,它提供了更准确的预测。
{"title":"Automatic speech emotion recognition using recurrent neural networks with local attention","authors":"Seyedmahdad Mirsamadi, Emad Barsoum, Cha Zhang","doi":"10.1109/ICASSP.2017.7952552","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952552","url":null,"abstract":"Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128664871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 511
Infrasonic scene fingerprinting for authenticating speaker location 次声现场指纹识别,用于验证说话者的位置
K. Aono, S. Chakrabartty, T. Yamasaki
Ambient infrasound with frequency ranges well below 20 Hz is known to carry robust navigation cues that can be exploited to authenticate the location of a speaker. Unfortunately, many of the mobile devices like smartphones have been optimized to work in the human auditory range, thereby suppressing information in the infrasonic region. In this paper, we show that these ultra-low frequency cues can still be extracted from a standard smartphone recording by using acceleration-based cepstral features. To validate our claim, we have collected smartphone recordings from more than 30 different scenes and used the cues for scene fingerprinting. We report scene recognition rates in excess of 90% and a feature set analysis reveals the importance of the infrasonic signatures towards achieving the state-of-the-art recognition performance.
众所周知,频率范围远低于20赫兹的环境次声可以携带强大的导航线索,可以用来验证扬声器的位置。不幸的是,许多像智能手机这样的移动设备已经被优化为在人类听觉范围内工作,从而抑制了次声区域的信息。在本文中,我们证明了这些超低频率线索仍然可以通过使用基于加速度的倒谱特征从标准智能手机录音中提取出来。为了验证我们的说法,我们收集了30多个不同场景的智能手机录音,并使用这些线索进行场景指纹识别。我们报告的场景识别率超过90%,特征集分析揭示了次声特征对实现最先进的识别性能的重要性。
{"title":"Infrasonic scene fingerprinting for authenticating speaker location","authors":"K. Aono, S. Chakrabartty, T. Yamasaki","doi":"10.1109/ICASSP.2017.7952178","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952178","url":null,"abstract":"Ambient infrasound with frequency ranges well below 20 Hz is known to carry robust navigation cues that can be exploited to authenticate the location of a speaker. Unfortunately, many of the mobile devices like smartphones have been optimized to work in the human auditory range, thereby suppressing information in the infrasonic region. In this paper, we show that these ultra-low frequency cues can still be extracted from a standard smartphone recording by using acceleration-based cepstral features. To validate our claim, we have collected smartphone recordings from more than 30 different scenes and used the cues for scene fingerprinting. We report scene recognition rates in excess of 90% and a feature set analysis reveals the importance of the infrasonic signatures towards achieving the state-of-the-art recognition performance.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Surrounding adaptive tone mapping in displayed images under ambient light 环境光下显示图像的周围自适应色调映射
Lu Wang, Cheolkon Jung
In this paper, we propose surrounding adaptive tone mapping in displayed images under ambient light. Under strong ambient light, the displayed images on the screen are darkly perceived by human eyes, especially in dark regions. We deal with the ambient light problem in mobile devices by brightness enhancement and adaptive tone mapping. First, we perform brightness compensation in dark regions using Bartleson-Breneman equation which represents lightness effect on the image under different surrounding illuminations. Then, we perform adaptive tone mapping to reproduce the whole image under various ambient light conditions. Adaptive tone mapping combines human visual characteristics with a tone mapping operation considering ambient light influence. Experimental results demonstrate that the proposed method significantly enhances the readability of displayed images under different surrounding light conditions.
本文提出了环境光下显示图像的周围自适应色调映射。在强环境光下,人眼对屏幕上显示的图像感知较暗,尤其是在较暗的区域。我们通过亮度增强和自适应色调映射来解决移动设备中的环境光问题。首先,我们使用Bartleson-Breneman方程对暗区进行亮度补偿,该方程表示不同周围照度下图像的亮度效应。然后,我们进行自适应色调映射,在各种环境光条件下再现整个图像。自适应色调映射将人的视觉特征与考虑环境光影响的色调映射操作相结合。实验结果表明,该方法显著提高了显示图像在不同周围光线条件下的可读性。
{"title":"Surrounding adaptive tone mapping in displayed images under ambient light","authors":"Lu Wang, Cheolkon Jung","doi":"10.1109/ICASSP.2017.7952505","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7952505","url":null,"abstract":"In this paper, we propose surrounding adaptive tone mapping in displayed images under ambient light. Under strong ambient light, the displayed images on the screen are darkly perceived by human eyes, especially in dark regions. We deal with the ambient light problem in mobile devices by brightness enhancement and adaptive tone mapping. First, we perform brightness compensation in dark regions using Bartleson-Breneman equation which represents lightness effect on the image under different surrounding illuminations. Then, we perform adaptive tone mapping to reproduce the whole image under various ambient light conditions. Adaptive tone mapping combines human visual characteristics with a tone mapping operation considering ambient light influence. Experimental results demonstrate that the proposed method significantly enhances the readability of displayed images under different surrounding light conditions.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114166618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1