2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文中文

Non-parallel voice conversion using i-vector PLDA: towards unifying speaker verification and transformation 使用i矢量PLDA的非并行语音转换:面向统一说话人验证和转换

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-06-16 DOI: 10.1109/ICASSP.2017.7953215

T. Kinnunen, Lauri Juvela, P. Alku, J. Yamagishi

Text-independent speaker verification (recognizing speakers regardless of content) and non-parallel voice conversion (transforming voice identities without requiring content-matched training utterances) are related problems. We adopt i-vector method to voice conversion. An i-vector is a fixed-dimensional representation of a speech utterance that enables treating voice conversion in utterance domain, as opposed to frame domain. The high dimensionality (800) and small number of training utterances (24) necessitates using prior information of speakers. We adopt probabilistic linear discriminant analysis (PLDA) for voice conversion. The proposed approach requires neither parallel utterances, transcriptions nor time alignment procedures at any stage.

与文本无关的说话人验证(识别说话人而不考虑内容)和非并行语音转换(转换语音身份而不需要内容匹配的训练话语)是相关的问题。我们采用i向量法进行语音转换。i向量是语音的固定维表示，它可以在话语域(而不是帧域)处理语音转换。高维(800)和少量的训练话语(24)需要使用说话人的先验信息。我们采用概率线性判别分析(PLDA)进行语音转换。所建议的方法在任何阶段都不需要平行的话语、转录或时间对齐程序。

引用次数: 70

LDA-based context dependent recurrent neural network language model using document-based topic distribution of words 基于lda的上下文相关递归神经网络语言模型，利用基于文档的词的主题分布

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-06-16 DOI: 10.1109/ICASSP.2017.7953254

Md. Akmal Haidar, M. Kurimo

Adding context information into recurrent neural network language models (RNNLMs) have been investigated recently to improve the effectiveness of learning RNNLM. Conventionally, a fast approximate topic representation for a block of words was proposed by using corpus-based topic distribution of word incorporating latent Dirichlet allocation (LDA) model. It is then updated for each subsequent word using an exponential decay. However, words could represent different topics in different documents. In this paper, we form document-based distribution over topics for each word using LDA model and apply it in the computation of fast approximate exponentially decaying features. We have shown experimental results on a well known Penn Treebank corpus and found that our approach outperforms the conventional LDA-based context RNNLM approach. Moreover, we carried out speech recognition experiments on Wall Street Journal corpus and achieved word error rate (WER) improvements over the other approach.

为了提高递归神经网络语言模型的学习效率，在递归神经网络语言模型(RNNLM)中加入上下文信息已成为近年来研究的热点。传统上，基于语料库的词主题分布结合潜在狄利克雷分配(latent Dirichlet allocation, LDA)模型，提出了一种快速近似的词块主题表示方法。然后使用指数衰减对每个后续单词进行更新。但是，单词可以在不同的文档中表示不同的主题。在本文中，我们使用LDA模型对每个词在主题上形成基于文档的分布，并将其应用于快速近似指数衰减特征的计算。我们已经在一个著名的Penn Treebank语料库上展示了实验结果，发现我们的方法优于传统的基于lda的上下文RNNLM方法。此外，我们在《华尔街日报》语料库上进行了语音识别实验，并取得了比其他方法更好的词错误率。

引用次数: 5

Privacy preserving encrypted phonetic search of speech data 隐私保护加密语音数据的语音搜索

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-06-16 DOI: 10.1109/ICASSP.2017.7953391

C. Glackin, G. Chollet, Nazim Dugan, Nigel Cannings, J. Wall, Shahzaib Tahir, I. G. Ray, M. Rajarajan

This paper presents a strategy for enabling speech recognition to be performed in the cloud whilst preserving the privacy of users. The approach advocates a demarcation of responsibilities between the client and server-side components for performing the speech recognition task. On the client-side resides the acoustic model, which symbolically encodes the audio and encrypts the data before uploading to the server. The server-side then employs searchable encryption to enable the phonetic search of the speech content. Some preliminary results for speech encoding and searchable encryption are presented.

本文提出了一种策略，使语音识别能够在云中执行，同时保护用户的隐私。该方法提倡在执行语音识别任务的客户端和服务器端组件之间划分职责。在客户端驻留声学模型，它在上传到服务器之前对音频进行符号编码并对数据进行加密。然后，服务器端使用可搜索的加密来启用语音内容的语音搜索。给出了语音编码和可搜索加密的一些初步结果。

引用次数: 31

Mood detection from daily conversational speech using denoising autoencoder and LSTM 基于去噪自编码器和LSTM的日常会话语音情绪检测

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-06-16 DOI: 10.1109/ICASSP.2017.7953133

Kun-Yi Huang, Chung-Hsien Wu, Ming-Hsiang Su, Hsiang-Chi Fu

In current studies, an extended subjective self-report method is generally used for measuring emotions. Even though it is commonly accepted that speech emotion perceived by the listener is close to the intended emotion conveyed by the speaker, research has indicated that there still remains a mismatch between them. In addition, the individuals with different personalities generally have different emotion expressions. Based on the investigation, in this study, a support vector machine (SVM)-based emotion model is first developed to detect perceived emotion from daily conversational speech. Then, a denoising autoencoder (DAE) is used to construct an emotion conversion model to characterize the relationship between the perceived emotion and the expressed emotion of the subject for a specific personality. Finally, a long short-term memory (LSTM)-based mood model is constructed to model the temporal fluctuation of speech emotions for mood detection. Experimental results show that the proposed method achieved a detection accuracy of 64.5%, improving by 5.0% compared to the HMM-based method.

在目前的研究中，通常采用一种扩展的主观自我报告方法来测量情绪。尽管人们普遍认为，听者感知到的言语情绪与说话者想要传达的情绪接近，但研究表明，两者之间仍然存在不匹配。此外，不同性格的个体通常有不同的情绪表达。在此基础上，本文首先建立了一种基于支持向量机(SVM)的情感模型来检测日常会话语音中的感知情感。然后，利用去噪自编码器(DAE)构建情绪转换模型，表征被试对特定人格的感知情绪与表达情绪之间的关系。最后，构建了基于长短期记忆的情绪模型，对语音情绪的时间波动进行建模，用于情绪检测。实验结果表明，该方法的检测准确率为64.5%，比基于hmm的方法提高了5.0%。

{"title":"Mood detection from daily conversational speech using denoising autoencoder and LSTM","authors":"Kun-Yi Huang, Chung-Hsien Wu, Ming-Hsiang Su, Hsiang-Chi Fu","doi":"10.1109/ICASSP.2017.7953133","DOIUrl":"https://doi.org/10.1109/ICASSP.2017.7953133","url":null,"abstract":"In current studies, an extended subjective self-report method is generally used for measuring emotions. Even though it is commonly accepted that speech emotion perceived by the listener is close to the intended emotion conveyed by the speaker, research has indicated that there still remains a mismatch between them. In addition, the individuals with different personalities generally have different emotion expressions. Based on the investigation, in this study, a support vector machine (SVM)-based emotion model is first developed to detect perceived emotion from daily conversational speech. Then, a denoising autoencoder (DAE) is used to construct an emotion conversion model to characterize the relationship between the perceived emotion and the expressed emotion of the subject for a specific personality. Finally, a long short-term memory (LSTM)-based mood model is constructed to model the temporal fluctuation of speech emotions for mood detection. Experimental results show that the proposed method achieved a detection accuracy of 64.5%, improving by 5.0% compared to the HMM-based method.","PeriodicalId":118243,"journal":{"name":"2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126606150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Decorrelation for audio object coding 音频对象编码的去相关

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-05-19 DOI: 10.1109/ICASSP.2017.7952247

L. Villemoes, T. Hirvonen, H. Purnhagen

Object-based representations of audio content are increasingly used in entertainment systems to deliver immersive and personalized experiences. Efficient storage and transmission of such content can be achieved by joint object coding algorithms that convey a reduced number of downmix signals together with parametric side information that enables object reconstruction in the decoder. This paper presents an approach to improve the performance of joint object coding by adding one or more decorrelators to the decoding process. Listening test results illustrate the performance as a function of the number of decorrelators. The method is adopted as part of the Dolby AC-4 system standardized by ETSI.

基于对象的音频内容表示越来越多地用于娱乐系统，以提供沉浸式和个性化的体验。这种内容的有效存储和传输可以通过联合对象编码算法来实现，该算法将减少数量的下混信号与能够在解码器中重建对象的参数侧信息一起传输。本文提出了一种通过在解码过程中加入一个或多个去相关器来提高联合目标编码性能的方法。听力测试结果说明了其性能与去相关器数量的关系。该方法作为ETSI标准化的杜比AC-4系统的一部分被采用。

引用次数: 8

Data analysis as a web service: A case study using IoT sensor data 作为web服务的数据分析:使用物联网传感器数据的案例研究

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-05-19 DOI: 10.1109/ICASSP.2017.7953308

Alireza Ahrabian, Ş. Kolozali, Shirin Enshaeifar, C. C. Took, P. Barnaghi

The advent of Internet of Things, has resulted in the development of infrastructure for capturing and storing data from domains ranging from smart devices (e.g. smartphones) to smart cities. This data is often available publicly and has enabled a wider range of data consumers to utilise such data sets for applications ranging from scientific experimentation to enhancing commercial activity for businesses. Accordingly this has resulted in the need for the development data analysis tools that are both simple to use and provide the most effective tools for a given data set. To this end, we introduce data analysis tools as web service, that enables the data consumer to make a simple HTTP request for processing data over the internet. By providing such tools as a web service, we demonstrate the potential of such a system to aid both the advanced and novice data consumer. Furthermore, this work provides an use case example of the proposed tool on publicly available data extracted from the smart city CityPulse IoT project.

物联网的出现导致了基础设施的发展，用于从智能设备(例如智能手机)到智能城市等领域捕获和存储数据。这些数据通常是公开的，并使更广泛的数据消费者能够将这些数据集用于从科学实验到增强企业商业活动的各种应用。因此，这导致了对开发数据分析工具的需求，这些工具既简单易用，又能为给定的数据集提供最有效的工具。为此，我们将数据分析工具作为web服务引入，使数据消费者能够通过internet发出简单的HTTP请求来处理数据。通过提供这样的工具作为web服务，我们展示了这样一个系统在帮助高级和新手数据使用者方面的潜力。此外，本工作还提供了一个用例示例，用于从智慧城市城市脉冲物联网项目中提取的公开可用数据。

引用次数: 15

Feature++: Cross dimension feature fusion for road detection Feature++:用于道路检测的交叉维度特征融合

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-05-08 DOI: 10.1109/ICASSP.2017.7952439

Wenli He, Guorong Cai, Zhun Zhong, Songzhi Su

Road detection is a key component of Advanced Driving Assistance Systems, which provides valid space and candidate regions of objects for vehicles. Mainstream road detection methods have focused on extracting discriminative features. In this paper, we propose a robust feature fusion framework, called “Feature++”, which is combined with superpixel feature and 3D feature extracted from stereo images. Then a neural network classifier is been trained to decide whether a superpixel is road region or not. Finally, the classified results are further refined by conditional random field. Experiments conducted on the KITTI ROAD benchmark show that the proposed “Feature++” method outperforms most manually designed features, and are comparable with state-of-the-art methods that based on deep learning architecture.

道路检测是高级驾驶辅助系统的关键组成部分，它为车辆提供有效的目标空间和候选区域。主流的道路检测方法主要集中在提取判别特征上。本文提出了一种鲁棒的特征融合框架“feature ++”，该框架将超像素特征和从立体图像中提取的三维特征相结合。然后训练神经网络分类器来判断超像素是否为道路区域。最后，利用条件随机场对分类结果进行进一步细化。在KITTI ROAD基准测试上进行的实验表明，提出的“Feature++”方法优于大多数手动设计的特征，并且与基于深度学习架构的最先进方法相当。

引用次数: 1

Implementation of efficient, low power deep neural networks on next-generation intel client platforms 在下一代英特尔客户端平台上实现高效、低功耗的深度神经网络

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-04-12 DOI: 10.1109/ICASSP.2017.8005304

M. Deisher, A. Polonski

In recent years many signal processing applications involving classification, detection, and inference have enjoyed substantial accuracy improvements due to advances in deep learning. At the same time, the “Internet of Things” has become an important class of devices. Although the paradigm of local sensing and remote inference has been very successful (e.g., Apple Siri, Google Now, Microsoft Cortana, Amazon Alexa, and others) there exist many valuable applications where sensing duration is very long, the cost of communication is high, and scaling to millions or billions of devices is not practical. In such cases, local inference “at the edge” is attractive provided it can be done without compromising accuracy and within the thermal envelope and expected battery life of the edge device.

近年来，由于深度学习的进步，许多涉及分类、检测和推理的信号处理应用都有了实质性的准确性提高。与此同时，“物联网”已成为设备的重要类别。虽然本地感知和远程推理的范例已经非常成功(例如，Apple Siri, Google Now, Microsoft Cortana, Amazon Alexa等)，但仍然存在许多有价值的应用，其中感知持续时间非常长，通信成本高，并且扩展到数百万或数十亿设备是不现实的。在这种情况下，“在边缘”的局部推断是有吸引力的，只要它可以在不影响精度的情况下完成，并且在边缘设备的热包络和预期电池寿命内完成。

引用次数: 11

Time of arrival disambiguation using the linear Radon transform 利用线性Radon变换的到达时间消歧

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-04-05 DOI: 10.1109/ICASSP.2017.7952127

Youssef El Baba, A. Walther, Emanuël Habets

Echo labeling, the challenging task of assigning acoustic reflections to image sources, is equivalent to the highly-important disambiguation task in room geometry inference. A method using the Radon transform, an image processing tool, is proposed to address this challenge. The method relies on acoustic wavefront detection in room impulse response stacks, obtained with a uniform linear array of loudspeakers and one microphone. We show in our experiments that the proposed method can both label and detect echoes.

回声标记是一项具有挑战性的任务，将声学反射分配给图像源，相当于房间几何推理中非常重要的消歧任务。提出了一种使用Radon变换(一种图像处理工具)的方法来解决这一挑战。该方法依赖于室内脉冲响应堆栈中的声波前检测，由均匀线性阵列的扬声器和一个麦克风获得。实验结果表明，该方法既能标记回波，又能检测回波。

引用次数: 7

Key frames extraction using graph modularity clustering for efficient video summarization 利用图形模块化聚类提取关键帧，实现高效视频摘要

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2017-04-04 DOI: 10.1109/ICASSP.2017.7952407

Hana Gharbi, S. Bahroun, M. Massaoudi, E. Zagrouba

Keyframe extraction is one of the basic procedures relating to video retrieval and summary. It consists on presenting an abstract of the video with the most representative frames. This paper presents an efficient keyframe extraction approach based on local description and graph modularity clustering. The first step is to generate a set of candidate keyframes using a windowing rule in order to reduce the data to be examined. After that, detect interest points in these set of images. Then compute repeatability between each two images belonging to the candidate set and stocks these values in a matrix that we called repeatability matrix. Finally, the repeatability matrix is modelled by an oriented graph and we will select keyframes using graph modularity clustering principle. The experiments showed that this method succeeds in extracting keyframes while preserving the salient content of the video. Further, we found good values in term of precision, PSNR and compression rate.

关键帧提取是视频检索和摘要的基本步骤之一。它包括用最具代表性的帧呈现视频的摘要。提出了一种基于局部描述和图形模块化聚类的关键帧提取方法。第一步是使用窗口规则生成一组候选关键帧，以减少要检查的数据。然后，在这组图像中检测兴趣点。然后计算属于候选集的每两个图像之间的可重复性，并将这些值存储在我们称为可重复性矩阵的矩阵中。最后，利用有向图对可重复性矩阵进行建模，并利用图模块化聚类原理选择关键帧。实验表明，该方法在保留视频突出内容的同时，成功地提取了关键帧。此外，我们还发现了精度、PSNR和压缩率方面的良好值。

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀