IEEE Transactions on Audio Speech and Language Processing最新文献

英文中文

Declipping of Audio Signals Using Perceptual Compressed Sensing 基于感知压缩感知的音频信号降噪

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2281570

Bruno Defraene, Naim Mansour, S. D. Hertogh, T. Waterschoot, M. Diehl, M. Moonen

The restoration of clipped audio signals, commonly known as declipping, is important to achieve an improved level of audio quality in many audio applications. In this paper, a novel declipping algorithm is presented, jointly based on the theory of compressed sensing (CS) and on well-established properties of human auditory perception. Declipping is formulated as a sparse signal recovery problem using the CS framework. By additionally exploiting knowledge of human auditory perception, a novel perceptual compressed sensing (PCS) framework is devised. A PCS-based declipping algorithm is proposed which uses $ell _{1}$-norm type reconstruction. Comparative objective and subjective evaluation experiments reveal a significant audio quality increase for the proposed PCS-based declipping algorithm compared to CS-based declipping algorithms.

在许多音频应用中，恢复被截断的音频信号(通常称为衰落)对于提高音频质量水平是很重要的。本文基于压缩感知(CS)理论和人类听觉感知特性，提出了一种新的衰落算法。使用CS框架将衰落表述为一个稀疏信号恢复问题。利用听觉感知知识，设计了一种新的感知压缩感知(PCS)框架。提出了一种基于pc的衰落算法，该算法采用$ well _{1}$范数类型重构。对比客观和主观评价实验表明，与基于cs的衰落算法相比，本文提出的基于pcs的衰落算法的音频质量有显著提高。

引用次数: 56

Understanding Effects of Subjectivity in Measuring Chord Estimation Accuracy 认识主观性对弦估计精度测量的影响

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2280218

Y. Ni, Matt McVicar, Raúl Santos-Rodríguez, T. D. Bie

To assess the performance of an automatic chord estimation system, reference annotations are indispensable. However, owing to the complexity of music and the sometimes ambiguous harmonic structure of polyphonic music, chord annotations are inherently subjective, and as a result any derived accuracy estimates will be subjective as well. In this paper, we investigate the extent of the confounding effect of subjectivity in reference annotations. Our results show that this effect is important, and they affect different types of automatic chord estimation systems in different ways. Our results have implications for research on automatic chord estimation, but also on other fields that evaluate performance by comparing against human provided annotations that are confounded by subjectivity.

为了评估自动弦估计系统的性能，参考注释是必不可少的。然而，由于音乐的复杂性和复调音乐中有时模棱两可的和声结构，和弦注释本质上是主观的，因此任何导出的准确性估计也将是主观的。本文研究了参考文献注释中主观性混淆效应的程度。我们的研究结果表明，这种影响是重要的，并且它们以不同的方式影响不同类型的自动和弦估计系统。我们的研究结果不仅对自动和弦估计的研究有启示意义，而且对其他通过与人为提供的受主观性混淆的注释进行比较来评估性能的领域也有启示意义。

引用次数: 42

A Bag of Systems Representation for Music Auto-Tagging 音乐自动标注的一种系统表示

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2279318

Katherine Ellis, E. Coviello, Antoni B. Chan, Gert R. G. Lanckriet

We present a content-based automatic tagging system for music that relies on a high-level, concise “Bag of Systems” (BoS) representation of the characteristics of a musical piece. The BoS representation leverages a rich dictionary of musical codewords, where each codeword is a generative model that captures timbral and temporal characteristics of music. Songs are represented as a BoS histogram over codewords, which allows for the use of traditional algorithms for text document retrieval to perform auto-tagging. Compared to estimating a single generative model to directly capture the musical characteristics of songs associated with a tag, the BoS approach offers the flexibility to combine different generative models at various time resolutions through the selection of the BoS codewords. Additionally, decoupling the modeling of audio characteristics from the modeling of tag-specific patterns makes BoS a more robust and rich representation of music. Experiments show that this leads to superior auto-tagging performance.

我们提出了一个基于内容的音乐自动标记系统，它依赖于一个高层次的、简洁的“系统包”(BoS)来表示音乐作品的特征。BoS表示利用了丰富的音乐码字字典，其中每个码字都是一个生成模型，可以捕获音乐的音色和时间特征。歌曲表示为代码字上的BoS直方图，这允许使用传统的文本文档检索算法来执行自动标记。与估计单个生成模型来直接捕获与标签相关的歌曲的音乐特征相比，BoS方法通过选择BoS码字，提供了在不同时间分辨率下组合不同生成模型的灵活性。此外，将音频特征的建模与标签特定模式的建模解耦，使BoS成为更健壮、更丰富的音乐表示。实验表明，这可以提高自动标记的性能。

引用次数: 22

Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays 基于几何的分布式麦克风阵列空间声音采集

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2280210

O. Thiergart, G. D. Galdo, Maja Taseska, Emanuël Habets

Traditional spatial sound acquisition aims at capturing a sound field with multiple microphones such that at the reproduction side a listener can perceive the sound image as it was at the recording location. Standard techniques for spatial sound acquisition usually use spaced omnidirectional microphones or coincident directional microphones. Alternatively, microphone arrays and spatial filters can be used to capture the sound field. From a geometric point of view, the perspective of the sound field is fixed when using such techniques. In this paper, a geometry-based spatial sound acquisition technique is proposed to compute virtual microphone signals that manifest a different perspective of the sound field. The proposed technique uses a parametric sound field model that is formulated in the time-frequency domain. It is assumed that each time-frequency instant of a microphone signal can be decomposed into one direct and one diffuse sound component. It is further assumed that the direct component is the response of a single isotropic point-like source (IPLS) of which the position is estimated for each time-frequency instant using distributed microphone arrays. Given the sound components and the position of the IPLS, it is possible to synthesize a signal that corresponds to a virtual microphone at an arbitrary position and with an arbitrary pick-up pattern.

传统的空间声音采集旨在用多个麦克风捕获声场，以便在再现侧听者可以感知到在录制位置的声音图像。空间声音采集的标准技术通常使用间隔全向麦克风或同步定向麦克风。另外，可以使用麦克风阵列和空间滤波器来捕获声场。从几何角度来看，使用这种技术时，声场的角度是固定的。本文提出了一种基于几何的空间声音采集技术来计算显示声场不同角度的虚拟麦克风信号。所提出的技术使用了一个参数声场模型，该模型是在时频域中制定的。假设麦克风信号的每个时频瞬间可以分解为一个直接声分量和一个扩散声分量。进一步假设直接分量是单个各向同性点源(IPLS)的响应，使用分布式麦克风阵列估计其每个时频瞬间的位置。给定声音组件和IPLS的位置，可以合成一个与任意位置的虚拟麦克风相对应的信号，并具有任意拾取模式。

{"title":"Geometry-Based Spatial Sound Acquisition Using Distributed Microphone Arrays","authors":"O. Thiergart, G. D. Galdo, Maja Taseska, Emanuël Habets","doi":"10.1109/TASL.2013.2280210","DOIUrl":"https://doi.org/10.1109/TASL.2013.2280210","url":null,"abstract":"Traditional spatial sound acquisition aims at capturing a sound field with multiple microphones such that at the reproduction side a listener can perceive the sound image as it was at the recording location. Standard techniques for spatial sound acquisition usually use spaced omnidirectional microphones or coincident directional microphones. Alternatively, microphone arrays and spatial filters can be used to capture the sound field. From a geometric point of view, the perspective of the sound field is fixed when using such techniques. In this paper, a geometry-based spatial sound acquisition technique is proposed to compute virtual microphone signals that manifest a different perspective of the sound field. The proposed technique uses a parametric sound field model that is formulated in the time-frequency domain. It is assumed that each time-frequency instant of a microphone signal can be decomposed into one direct and one diffuse sound component. It is further assumed that the direct component is the response of a single isotropic point-like source (IPLS) of which the position is estimated for each time-frequency instant using distributed microphone arrays. Given the sound components and the position of the IPLS, it is possible to synthesize a signal that corresponds to a virtual microphone at an arbitrary position and with an arbitrary pick-up pattern.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2583-2594"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2280210","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Source/Filter Factorial Hidden Markov Model, With Application to Pitch and Formant Tracking 源/滤波器阶乘隐马尔可夫模型及其在基音和峰跟踪中的应用

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2277941

Jean-Louis Durrieu, J. Thiran

Tracking vocal tract formant frequencies $(f_{p})$ and estimating the fundamental frequency $(f_{0})$ are two tracking problems that have been tackled in many speech processing works, often independently, with applications to articulatory parameters estimations, speech analysis/synthesis or linguistics. Many works assume an auto-regressive (AR) model to fit the spectral envelope, hence indirectly estimating the formant tracks from the AR parameters. However, directly estimating the formant frequencies, or equivalently the poles of the AR filter, allows to further model the smoothness of the desired tracks. In this paper, we propose a Factorial Hidden Markov Model combined with a vocal source/filter model, with parameters naturally encoding the $f_{0}$ and $f_{p}$ tracks. Two algorithms are proposed, with two different strategies: first, a simplification of the underlying model, with a parameter estimation based on variational methods, and second, a sparse decomposition of the signal, based on Non-negative Matrix Factorization methodology. The results are comparable to state-of-the-art formant tracking algorithms. With the use of a complete production model, the proposed systems provide robust formant tracks which can be used in various applications. The algorithms could also be extended to deal with multiple-speaker signals.

跟踪声道形成峰频率$(f_{p})$和估计基频$(f_{0})$是许多语音处理工作中已经解决的两个跟踪问题，通常是独立的，应用于发音参数估计，语音分析/合成或语言学。许多研究假设一个自回归(AR)模型来拟合光谱包络线，从而间接地从AR参数估计形成峰轨迹。然而，直接估计形成峰频率，或等效的AR滤波器的极点，允许进一步建模所需轨道的平滑度。在本文中，我们提出了一个阶乘隐马尔可夫模型与声源/滤波器模型相结合，参数自然编码$f_{0}$和$f_{p}$音轨。提出了两种算法，采用两种不同的策略:第一种是基于变分方法的参数估计的底层模型简化，第二种是基于非负矩阵分解方法的信号稀疏分解。其结果可与最先进的峰跟踪算法相媲美。由于使用了完整的生产模型，所提出的系统提供了可用于各种应用的鲁棒阵轨迹。该算法还可以扩展到处理多扬声器信号。

{"title":"Source/Filter Factorial Hidden Markov Model, With Application to Pitch and Formant Tracking","authors":"Jean-Louis Durrieu, J. Thiran","doi":"10.1109/TASL.2013.2277941","DOIUrl":"https://doi.org/10.1109/TASL.2013.2277941","url":null,"abstract":"Tracking vocal tract formant frequencies <formula formulatype=\"inline\"> <tex Notation=\"TeX\">$(f_{p})$</tex></formula> and estimating the fundamental frequency <formula formulatype=\"inline\"><tex Notation=\"TeX\">$(f_{0})$</tex> </formula> are two tracking problems that have been tackled in many speech processing works, often independently, with applications to articulatory parameters estimations, speech analysis/synthesis or linguistics. Many works assume an auto-regressive (AR) model to fit the spectral envelope, hence indirectly estimating the formant tracks from the AR parameters. However, directly estimating the formant frequencies, or equivalently the poles of the AR filter, allows to further model the smoothness of the desired tracks. In this paper, we propose a Factorial Hidden Markov Model combined with a vocal source/filter model, with parameters naturally encoding the <formula formulatype=\"inline\"><tex Notation=\"TeX\">$f_{0}$</tex></formula> and <formula formulatype=\"inline\"> <tex Notation=\"TeX\">$f_{p}$</tex></formula> tracks. Two algorithms are proposed, with two different strategies: first, a simplification of the underlying model, with a parameter estimation based on variational methods, and second, a sparse decomposition of the signal, based on Non-negative Matrix Factorization methodology. The results are comparable to state-of-the-art formant tracking algorithms. With the use of a complete production model, the proposed systems provide robust formant tracks which can be used in various applications. The algorithms could also be extended to deal with multiple-speaker signals.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2541-2553"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2277941","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

HMM Based Intermediate Matching Kernel for Classification of Sequential Patterns of Speech Using Support Vector Machines 基于HMM的中间匹配核支持向量机语音序列模式分类

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-12-01 DOI: 10.1109/TASL.2013.2279338

A. D. Dileep, C. Sekhar

In this paper, we address the issues in the design of an intermediate matching kernel (IMK) for classification of sequential patterns using support vector machine (SVM) based classifier for tasks such as speech recognition. Specifically, we address the issues in constructing a kernel for matching sequences of feature vectors extracted from the speech signal data of utterances. The codebook based IMK and Gaussian mixture model (GMM) based IMK have been proposed earlier for matching the varying length patterns represented as sets of features vectors for tasks such as image classification and speaker recognition. These methods consider the centers of clusters and the components of GMM as the virtual feature vectors used in the design of IMK. As these methods do not use sequence information in matching the patterns, these methods are not suitable for matching sequential patterns. We propose the hidden Markov model (HMM) based IMK for matching sequential patterns of varying length. We consider two approaches to design the HMM-based IMK. In the first approach, each of the two sequences to be matched is segmented into subsequences with each subsequence aligned to a state of the HMM. Then the HMM-based IMK is constructed as a combination of state-specific GMM-based IMKs that match the subsequences aligned with the particular states of the HMM. In the second approach, the HMM-based IMK is constructed without segmenting sequences, and by matching the local feature vectors selected using the responsibility terms that account for being in a state and generating the feature vectors by a component of the GMM of that state. We study the performance of the SVM based classifiers using the proposed HMM-based IMK for recognition of isolated utterances of E-set in English alphabet and recognition of consonent–vowel segments in Hindi language.

在本文中，我们解决了在使用基于支持向量机(SVM)的分类器对语音识别等任务进行序列模式分类的中间匹配核(IMK)设计中的问题。具体来说，我们解决了从话语的语音信号数据中提取特征向量序列的核匹配问题。基于码本的IMK和基于高斯混合模型(GMM)的IMK之前已经被提出，用于匹配表示为特征向量集的不同长度模式，用于图像分类和说话人识别等任务。这些方法将聚类的中心和GMM的分量作为虚拟特征向量用于IMK的设计。由于这些方法在匹配模式时没有使用序列信息，因此不适合匹配序列模式。我们提出了基于隐马尔可夫模型(HMM)的IMK来匹配变长度的序列模式。我们考虑了两种方法来设计基于hmm的IMK。在第一种方法中，要匹配的两个序列中的每一个都被分割成子序列，每个子序列与HMM的状态对齐。然后，将基于HMM的IMK构建为与HMM的特定状态对齐的子序列相匹配的特定状态的基于gmm的IMK的组合。在第二种方法中，基于hmm的IMK是在没有分割序列的情况下构建的，通过匹配使用说明处于某种状态的责任项选择的局部特征向量，并通过该状态的GMM的一个组件生成特征向量。我们研究了基于SVM的分类器在识别英语字母中E-set的孤立发音和识别印地语中辅音-元音片段方面的性能。

{"title":"HMM Based Intermediate Matching Kernel for Classification of Sequential Patterns of Speech Using Support Vector Machines","authors":"A. D. Dileep, C. Sekhar","doi":"10.1109/TASL.2013.2279338","DOIUrl":"https://doi.org/10.1109/TASL.2013.2279338","url":null,"abstract":"In this paper, we address the issues in the design of an intermediate matching kernel (IMK) for classification of sequential patterns using support vector machine (SVM) based classifier for tasks such as speech recognition. Specifically, we address the issues in constructing a kernel for matching sequences of feature vectors extracted from the speech signal data of utterances. The codebook based IMK and Gaussian mixture model (GMM) based IMK have been proposed earlier for matching the varying length patterns represented as sets of features vectors for tasks such as image classification and speaker recognition. These methods consider the centers of clusters and the components of GMM as the virtual feature vectors used in the design of IMK. As these methods do not use sequence information in matching the patterns, these methods are not suitable for matching sequential patterns. We propose the hidden Markov model (HMM) based IMK for matching sequential patterns of varying length. We consider two approaches to design the HMM-based IMK. In the first approach, each of the two sequences to be matched is segmented into subsequences with each subsequence aligned to a state of the HMM. Then the HMM-based IMK is constructed as a combination of state-specific GMM-based IMKs that match the subsequences aligned with the particular states of the HMM. In the second approach, the HMM-based IMK is constructed without segmenting sequences, and by matching the local feature vectors selected using the responsibility terms that account for being in a state and generating the feature vectors by a component of the GMM of that state. We study the performance of the SVM based classifiers using the proposed HMM-based IMK for recognition of isolated utterances of E-set in English alphabet and recognition of consonent–vowel segments in Hindi language.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2570-2582"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2279338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62892483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Improving Graph-Based Dependency Parsing Models With Dependency Language Models 用依赖语言模型改进基于图的依赖解析模型

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2273715

Min Zhang, Wenliang Chen, Xiangyu Duan, Rong Zhang

For graph-based dependency parsing, how to enrich high-order features without increasing decoding complexity is a very challenging problem. To solve this problem, this paper presents an approach to representing high-order features for graph-based dependency parsing models using a dependency language model and beam search. Firstly, we use a baseline parser to parse a large-amount of unannotated data. Then we build the dependency language model (DLM) on the auto-parsed data. A set of new features is represented based on the DLM. Finally, we integrate the DLM-based features into the parsing model during decoding by beam search. We also utilize the features in bilingual text (bitext) parsing models. The main advantages of our approach are: 1) we utilize rich high-order features defined over a view of large scope and additional large raw corpus; 2) our approach does not increase the decoding complexity. We evaluate the proposed approach on the monotext and bitext parsing tasks. In the monotext parsing task, we conduct the experiments on Chinese and English data. The experimental results show that our new parser achieves the best accuracy on the Chinese data and comparable accuracy with the best known systems on the English data. In the bitext parsing task, we conduct the experiments on a Chinese-English bilingual data and our score is the best reported so far.

对于基于图的依赖项解析，如何在不增加解码复杂度的前提下丰富高阶特征是一个非常具有挑战性的问题。为了解决这一问题，本文提出了一种利用依赖语言模型和束搜索来表示基于图的依赖解析模型的高阶特征的方法。首先，我们使用基线解析器解析大量未注释的数据。然后在自动解析数据的基础上建立依赖语言模型(DLM)。基于DLM表示了一组新特性。最后，在波束搜索解码过程中，将基于dlm的特征整合到解析模型中。我们还利用了双语文本(bitext)解析模型的特征。我们的方法的主要优点是:1)我们利用在大范围和额外的大型原始语料库的视图上定义的丰富的高阶特征;2)我们的方法不会增加解码的复杂度。我们在单文本和双文本解析任务上评估了所提出的方法。在单文本解析任务中，我们对中文和英文数据进行了实验。实验结果表明，我们的解析器在中文数据上达到了最好的准确率，在英文数据上的准确率与目前最知名的系统相当。在文本解析任务中，我们在一个中英文双语数据上进行了实验，我们的成绩是目前报道的最好的。

{"title":"Improving Graph-Based Dependency Parsing Models With Dependency Language Models","authors":"Min Zhang, Wenliang Chen, Xiangyu Duan, Rong Zhang","doi":"10.1109/TASL.2013.2273715","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273715","url":null,"abstract":"For graph-based dependency parsing, how to enrich high-order features without increasing decoding complexity is a very challenging problem. To solve this problem, this paper presents an approach to representing high-order features for graph-based dependency parsing models using a dependency language model and beam search. Firstly, we use a baseline parser to parse a large-amount of unannotated data. Then we build the dependency language model (DLM) on the auto-parsed data. A set of new features is represented based on the DLM. Finally, we integrate the DLM-based features into the parsing model during decoding by beam search. We also utilize the features in bilingual text (bitext) parsing models. The main advantages of our approach are: 1) we utilize rich high-order features defined over a view of large scope and additional large raw corpus; 2) our approach does not increase the decoding complexity. We evaluate the proposed approach on the monotext and bitext parsing tasks. In the monotext parsing task, we conduct the experiments on Chinese and English data. The experimental results show that our new parser achieves the best accuracy on the Chinese data and comparable accuracy with the best known systems on the English data. In the bitext parsing task, we conduct the experiments on a Chinese-English bilingual data and our score is the best reported so far.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"14 1","pages":"2313-2323"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Acoustic Modeling With Hierarchical Reservoirs 分层储层声学建模

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2280209

Fabian Triefenbach, A. Jalalvand, Kris Demuynck, J. Martens

Accurate acoustic modeling is an essential requirement of a state-of-the-art continuous speech recognizer. The Acoustic Model (AM) describes the relation between the observed speech signal and the non-observable sequence of phonetic units uttered by the speaker. Nowadays, most recognizers use Hidden Markov Models (HMMs) in combination with Gaussian Mixture Models (GMMs) to model the acoustics, but neural-based architectures are on the rise again. In this work, the recently introduced Reservoir Computing (RC) paradigm is used for acoustic modeling. A reservoir is a fixed - and thus non-trained - Recurrent Neural Network (RNN) that is combined with a trained linear model. This approach combines the ability of an RNN to model the recent past of the input sequence with a simple and reliable training procedure. It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.

准确的声学建模是最先进的连续语音识别器的基本要求。声学模型(AM)描述了观察到的语音信号与说话人发出的不可观察的语音单位序列之间的关系。目前，大多数识别器使用隐马尔可夫模型(hmm)结合高斯混合模型(GMMs)来建模声学，但基于神经的架构再次兴起。在这项工作中，最近引入的储层计算(RC)范式被用于声学建模。一个水库是一个固定的，因此是未经训练的循环神经网络(RNN)，它与一个训练好的线性模型相结合。这种方法结合了RNN对输入序列最近的过去进行建模的能力和简单可靠的训练过程。研究表明，简单的基于储层的AMs实现了合理的电话识别，而深层分层和双向储层架构在著名的TIMIT任务上的电话错误率(PER)为23.1%，非常具有竞争优势。

引用次数: 72

Robust Ultra-Low Latency Soft-Decision Decoding of Linear PCM Audio 线性PCM音频的鲁棒超低延迟软判决解码

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2273716

Florian Pflug, T. Fingscheidt

Applications such as professional wireless digital microphones require a transmission of practically uncoded high-quality audio with ultra-low latency on the one hand and robustness to error-prone channels on the other hand. The delay restrictions, however, prohibit the utilization of efficient block or convolutional channel codes for error protection. The contribution of this work is fourfold: We revise and summarize concisely a Bayesian framework for soft-decision audio decoding and present three novel approaches to (almost) latency-free robust decoding of uncompressed audio. Bit reliability information from the transmission channel is exploited, as well as short-term and long-term residual redundancy within the audio signal, and optionally some explicit redundancy in terms of a sample-individual block code. In all cases we utilize variants of higher-order linear prediction to compute prediction probabilities in three novel ways: Firstly by employing a serial cascade of multiple predictors, secondly by exploiting explicit redundancy in form of parity bits, and thirdly by utilizing an interpolative forward/backward prediction algorithm. The first two presented approaches work fully delayless, while the third one introduces an ultra-low algorithmic delay of just a few samples. The effectiveness of the proposed algorithms is proven in simulations with BPSK and typical digital microphone FSK modulation schemes on AWGN and bursty fading channels.

专业无线数字麦克风等应用需要传输几乎未编码的高质量音频，一方面具有超低延迟，另一方面具有对易出错通道的鲁棒性。延迟限制，然而，禁止使用有效的块或卷积信道码进行错误保护。这项工作的贡献有四个方面:我们修改并简要总结了用于软判决音频解码的贝叶斯框架，并提出了三种(几乎)无延迟的未压缩音频鲁棒解码的新方法。利用传输信道的位可靠性信息，以及音频信号中的短期和长期剩余冗余，以及可选的一些样本单个块码的显式冗余。在所有情况下，我们利用高阶线性预测的变体以三种新颖的方式计算预测概率:首先通过采用多个预测器的串行级联，其次通过利用奇偶校验位形式的显式冗余，第三通过利用内插式前向/后向预测算法。前两种方法是完全无延迟的，而第三种方法引入了只有几个样本的超低算法延迟。在AWGN和突发衰落信道上，用BPSK和典型数字传声器FSK调制方案进行了仿真，验证了算法的有效性。

{"title":"Robust Ultra-Low Latency Soft-Decision Decoding of Linear PCM Audio","authors":"Florian Pflug, T. Fingscheidt","doi":"10.1109/TASL.2013.2273716","DOIUrl":"https://doi.org/10.1109/TASL.2013.2273716","url":null,"abstract":"Applications such as professional wireless digital microphones require a transmission of practically uncoded high-quality audio with ultra-low latency on the one hand and robustness to error-prone channels on the other hand. The delay restrictions, however, prohibit the utilization of efficient block or convolutional channel codes for error protection. The contribution of this work is fourfold: We revise and summarize concisely a Bayesian framework for soft-decision audio decoding and present three novel approaches to (almost) latency-free robust decoding of uncompressed audio. Bit reliability information from the transmission channel is exploited, as well as short-term and long-term residual redundancy within the audio signal, and optionally some explicit redundancy in terms of a sample-individual block code. In all cases we utilize variants of higher-order linear prediction to compute prediction probabilities in three novel ways: Firstly by employing a serial cascade of multiple predictors, secondly by exploiting explicit redundancy in form of parity bits, and thirdly by utilizing an interpolative forward/backward prediction algorithm. The first two presented approaches work fully delayless, while the third one introduces an ultra-low algorithmic delay of just a few samples. The effectiveness of the proposed algorithms is proven in simulations with BPSK and typical digital microphone FSK modulation schemes on AWGN and bursty fading channels.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2324-2336"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2273716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Scalable Speech Coding for IP Networks: Beyond iLBC IP网络的可扩展语音编码:超越iLBC

IEEE Transactions on Audio Speech and Language Processing

Pub Date : 2013-11-01 DOI: 10.1109/TASL.2013.2274694

Koji Seto, T. Ogunfunmi

High quality speech at low bit rates makes code excited linear prediction (CELP) the dominant choice for a narrowband coding technique despite the susceptibility to packet loss. One of the few techniques which received attention after the introduction of CELP coding technique is the internet low bitrate codec (iLBC) because of inherent high robustness to packet loss. Addition of rate flexibility and scalability makes the iLBC an attractive choice for voice communication over IP networks. In this paper, performance improvement schemes of multi-rate iLBC and its scalable structure are proposed, and the proposed codec enhanced from the previous work is re-designed based on the subjective listening quality instead of the objective quality. In particular, perceptual weighting and the modified discrete cosine transform (MDCT) with short overlap in weighted signal domain are employed along with the improved packet loss concealment (PLC) algorithm. The subjective evaluation results show that the speech quality of the proposed codec is equivalent to that of state-of-the-art codec, G.718, under both a clean channel condition and lossy channel conditions. This result is significant considering that development of the proposed codec is still in early stage.

低比特率下的高质量语音使得编码激发线性预测(CELP)成为窄带编码技术的主要选择，尽管它容易丢包。internet低比特率编解码器(internet low bitrate codec, iLBC)由于其对丢包具有较高的鲁棒性，在引入CELP编码技术后受到关注。速率灵活性和可扩展性使iLBC成为IP网络上语音通信的一个有吸引力的选择。本文提出了多速率iLBC及其可扩展结构的性能改进方案，并基于主观聆听质量而不是客观聆听质量对所提出的编解码器进行了重新设计。该算法采用感知加权和加权信号域短重叠的改进离散余弦变换(MDCT)以及改进的丢包隐藏(PLC)算法。主观评价结果表明，无论在干净信道条件下还是在有损信道条件下，所提编解码器的语音质量都与目前最先进的G.718编解码器相当。考虑到所提出的编解码器的开发仍处于早期阶段，这一结果意义重大。

{"title":"Scalable Speech Coding for IP Networks: Beyond iLBC","authors":"Koji Seto, T. Ogunfunmi","doi":"10.1109/TASL.2013.2274694","DOIUrl":"https://doi.org/10.1109/TASL.2013.2274694","url":null,"abstract":"High quality speech at low bit rates makes code excited linear prediction (CELP) the dominant choice for a narrowband coding technique despite the susceptibility to packet loss. One of the few techniques which received attention after the introduction of CELP coding technique is the internet low bitrate codec (iLBC) because of inherent high robustness to packet loss. Addition of rate flexibility and scalability makes the iLBC an attractive choice for voice communication over IP networks. In this paper, performance improvement schemes of multi-rate iLBC and its scalable structure are proposed, and the proposed codec enhanced from the previous work is re-designed based on the subjective listening quality instead of the objective quality. In particular, perceptual weighting and the modified discrete cosine transform (MDCT) with short overlap in weighted signal domain are employed along with the improved packet loss concealment (PLC) algorithm. The subjective evaluation results show that the speech quality of the proposed codec is equivalent to that of state-of-the-art codec, G.718, under both a clean channel condition and lossy channel conditions. This result is significant considering that development of the proposed codec is still in early stage.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"2337-2345"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2274694","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62891744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Transactions on Audio Speech and Language Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀