ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文中文

Deep Reinforcement Learning-based Rate Adaptation for Adaptive 360-Degree Video Streaming 基于深度强化学习的自适应360度视频流速率自适应

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683779

Nuowen Kan, Junni Zou, Kexin Tang, Chenglin Li, Ning Liu, H. Xiong

In this paper, we propose a deep reinforcement learning (DRL)-based rate adaptation algorithm for adaptive 360-degree video streaming, which is able to maximize the quality of experience of viewers by adapting the transmitted video quality to the time-varying network conditions. Specifically, to reduce the possible switching latency of the field of view (FoV), we design a new QoE metric by introducing a penalty term for the large buffer occupancy. A scalable FoV method is further proposed to alleviate the combinatorial explosion of the action space in the DRL formulation. Then, we model the rate adaptation logic as a Markov decision process and employ the DRL-based algorithm to dynamically learn the optimal video transmission rate. Simulation results show the superior performance of the proposed algorithm compared to the existing algorithms.

在本文中，我们提出了一种基于深度强化学习(DRL)的自适应360度视频流的速率自适应算法，该算法通过使传输的视频质量适应时变的网络条件，从而最大限度地提高观众的体验质量。具体来说，为了减少视场切换的延迟，我们设计了一个新的QoE指标，引入了对大缓冲区占用的惩罚项。进一步提出了一种可扩展的视场方法，以缓解DRL公式中动作空间的组合爆炸。然后，我们将速率自适应逻辑建模为马尔可夫决策过程，并采用基于drl的算法动态学习最优视频传输速率。仿真结果表明，该算法的性能优于现有算法。

引用次数: 28

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization 基于dnn的低比特量化神经波形发生器频谱增强

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683016

Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling

This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this method builds a multiple-target DNN, which predicts log amplitude spectra of natural high-bit waveforms together with the amplitude ratios between natural and distorted spectra. Log amplitude spectra of the waveforms reconstructed by low-bit neural waveform generators are adopted as model input. At generation stage, the enhanced amplitude spectra are obtained by an ensemble decoding strategy, and are further combined with the phase spectra of low-bit waveforms to produce the final waveforms by inverse STFT. In our experiments on WaveRNN vocoders, an 8-bit WaveRNN with spectral enhancement outperforms a 16-bit counterpart with the same model complexity in terms of the quality of reconstructed waveforms. Besides, the proposed spectral enhancement method can also help an 8-bit WaveRNN with reduced model complexity to achieve similar subjective performance with a conventional 16-bit WaveRNN.

本文提出了一种频谱增强方法来提高低比特量化神经波形发生器重建语音的质量。在训练阶段，该方法构建多目标深度神经网络，预测自然高比特波形的对数振幅谱以及自然和畸变波形的振幅比。采用低比特神经波形发生器重建的波形的对数幅值谱作为模型输入。在生成阶段，通过集成解码策略获得增强的幅度谱，并进一步与低比特波形的相位谱结合，通过逆STFT生成最终波形。在我们对WaveRNN声编码器的实验中，具有频谱增强的8位WaveRNN在重建波形的质量方面优于具有相同模型复杂性的16位WaveRNN。此外，本文提出的频谱增强方法还可以帮助降低模型复杂度的8位WaveRNN达到与传统16位WaveRNN相似的主观性能。

{"title":"Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization","authors":"Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling","doi":"10.1109/ICASSP.2019.8683016","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683016","url":null,"abstract":"This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this method builds a multiple-target DNN, which predicts log amplitude spectra of natural high-bit waveforms together with the amplitude ratios between natural and distorted spectra. Log amplitude spectra of the waveforms reconstructed by low-bit neural waveform generators are adopted as model input. At generation stage, the enhanced amplitude spectra are obtained by an ensemble decoding strategy, and are further combined with the phase spectra of low-bit waveforms to produce the final waveforms by inverse STFT. In our experiments on WaveRNN vocoders, an 8-bit WaveRNN with spectral enhancement outperforms a 16-bit counterpart with the same model complexity in terms of the quality of reconstructed waveforms. Besides, the proposed spectral enhancement method can also help an 8-bit WaveRNN with reduced model complexity to achieve similar subjective performance with a conventional 16-bit WaveRNN.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"7025-7029"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80747101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Speaker Recognition for Multi-speaker Conversations Using X-vectors 使用x向量的多说话人对话的说话人识别

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683760

David Snyder, D. Garcia-Romero, Gregory Sell, A. McCree, Daniel Povey, S. Khudanpur

Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.

最近，将话语映射到固定维度嵌入的深度神经网络已经成为说话人识别的最新技术。我们之前的工作引入了x向量，这是一种对说话人识别和拨号都非常有效的嵌入。本文结合前人的研究成果，将其应用于多说话人对话中的说话人识别问题。我们在野外扬声器上测量性能，并报告我们认为该数据集上发布的最佳错误率。此外，我们发现，当有多个扬声器时，拨号化大大降低了错误率，同时在单扬声器录音时保持了优异的性能。最后，我们介绍了一种易于实现的方法来去除通常在分类系统的聚类阶段使用的域敏感阈值。该方法对域漂移具有更强的鲁棒性，并且与使用调优阈值获得的结果相似。

引用次数: 244

Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking 基于集成时频掩蔽的深度神经网络语音信号到达时间差估计

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8682574

Pasi Pertilä, Mikko Parviainen

The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.

声波前的到达时间差(TDoA)携带着声源的空间信息。然而，捕获的语音通常包含动态的非语音干扰源和噪声。因此，TDoA估计在语音和干扰之间波动。深度神经网络(DNNs)已被应用于声源定位(ASL)的时频(TF)掩蔽，以从说话人位置似然函数中滤除非语音成分。然而，该任务的TF掩码类型并不明显。其次，DNN应该估计TDoA值，但现有的解决方案估计TF掩码。为了克服这些问题，提出了一种将TF掩蔽作为基于dnn的ASL结构的一部分的直接配方。此外，所提出的网络以在线方式运行，即逐帧产生估计。结合使用循环层，它利用说话人相关tdoa的顺序进展。使用不同麦克风间距的训练允许模型在推理中重用不同的麦克风对几何形状。干扰下智能手机语音记录的真实数据实验证明了该网络的泛化能力。

{"title":"Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking","authors":"Pasi Pertilä, Mikko Parviainen","doi":"10.1109/ICASSP.2019.8682574","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682574","url":null,"abstract":"The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"436-440"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82498885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

An Ensemble of Deep Recurrent Neural Networks for P-wave Detection in Electrocardiogram 用于心电图p波检测的深度递归神经网络集成

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8682307

A. Peimankar, S. Puthusserypady

Detection of P-waves in electrocardiogram (ECG) signals is of great importance to cardiologists in order to help them diagnosing arrhythmias such as atrial fibrillation. This paper proposes an end-to-end deep learning approach for detection of P-waves in ECG signals. Four different deep Recurrent Neural Networks (RNNs), namely, the Long-Short Term Memory (LSTM) are used in an ensemble framework. Each of these networks are trained to extract the useful features from raw ECG signals and determine the absence/presence of P-waves. Outputs of these classifiers are then combined for final detection of the P-waves. The proposed algorithm was trained and validated on a database which consists of more than 111000 annotated heart beats and the results show consistently high classification accuracy and sensitivity of around 98.48% and 97.22%, respectively.

检测心电图信号中的p波对心脏科医生诊断心房颤动等心律失常具有重要意义。本文提出了一种端到端的深度学习方法来检测心电信号中的p波。在集成框架中使用了四种不同的深度递归神经网络(rnn)，即长短期记忆(LSTM)。每个网络都经过训练，从原始心电信号中提取有用的特征，并确定p波的存在与否。然后将这些分类器的输出组合起来进行p波的最终检测。在111000多个标注心跳数据的数据库上进行了训练和验证，结果表明该算法的分类准确率和灵敏度分别保持在98.48%和97.22%左右。

引用次数: 18

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr 基于Dfsmn-ctc-smbr的普通话语音识别建模单元研究

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/icassp.2019.8683859

Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li

The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.

声学建模单元的选择对于大词汇量连续语音识别(LVCSR)任务中的声学建模至关重要。最近基于连接时间分类(CTC)的声学模型在建模单元的选择上有了更多的选择。在这项工作中，我们提出了DFSMN-CTC-sMBR声学模型，并研究了用于普通话语音识别的各种建模单元。除了常用的上下文无关声母(CI-IF)、上下文相关声母(CD-IF)和音节外，我们还提出了一种混合高频汉字和音节的字符-音节混合建模单元。实验结果表明，包含所有这些建模单元的DFSMN-CTC-sMBR模型都能显著优于训练良好的传统混合模型。此外，我们发现所提出的混合字音节建模单元是基于CTC声学建模的普通话语音识别的最佳选择，因为它可以显著减少识别结果中的替换错误。在2万小时的普通话语音识别任务中，字符-音节混合的DFSMN-CTC-sMBR系统的字符错误率为7.45%，而经过良好训练的DFSMN-CE-sMBR系统的错误率为9.49%。

{"title":"Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr","authors":"Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li","doi":"10.1109/icassp.2019.8683859","DOIUrl":"https://doi.org/10.1109/icassp.2019.8683859","url":null,"abstract":"The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"65 1","pages":"7085-7089"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86494193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Cascaded Point Network for 3D Hand Pose Estimation* 级联点网络的3D手姿态估计*

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683356

Yikun Dou, Xuguang Wang, Yuying Zhu, Xiaoming Deng, Cuixia Ma, Liang Chang, Hongan Wang

Recent PointNet-family hand pose methods have the advantages of high pose estimation performance and small model size, and it is a key problem to get effective sample points for PointNet-family methods. In this paper, we propose a two-stage coarse to fine hand pose estimation method, which belongs to PointNet-family methods and explores a new sample point strategy. In the first stage, we use 3D coordinate and surface normal of normalized point cloud as input to regress coarse hand joints. In the second stage, we use the hand joints in the first stage as the initial sample points to refine the hand joints. Experiments on widely used datasets demonstrate that using joints as sample points is more effective and our method achieves top-rank performance.

最近的pointnet家族手位姿方法具有姿态估计性能高、模型尺寸小的优点，如何获得有效的样本点是pointnet家族方法的关键问题。本文提出了一种由粗到精的两阶段手部姿态估计方法，该方法属于pointnet家族方法，并探索了一种新的样本点策略。在第一阶段，我们使用归一化点云的三维坐标和表面法线作为输入对粗糙的手关节进行回归。在第二阶段，我们使用第一阶段的手部关节作为初始样本点来细化手部关节。在广泛使用的数据集上的实验表明，使用关节作为样本点更有效，我们的方法达到了一流的性能。

引用次数: 3

Learning Semantic-preserving Space Using User Profile and Multimodal Media Content from Political Social Network 利用政治社交网络用户档案和多模态媒体内容学习语义保留空间

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8682596

Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee

The use of social media in politics has dramatically changed the way campaigns are run and how elected officials interact with their constituents. An advanced algorithm is required to analyze and understand this large amount of heterogeneous social media data to investigate several key issues, such as stance and strategy, in political science. Most of previous works concentrate their studies using text-as-data approach, where the rich yet heterogeneous information in the user profile, social relationship, and multimodal media content is largely ignored. In this work, we propose a two-branch network that jointly maps the post contents and politician profile into the same latent space, which is trained using a large-margin objective that combines a cross-instance distance constraint with a within-instance semantic-preserving constraint. Our proposed political embedding space can be utilized not only in reliably identifying political spectrum and message type but also in providing a political representation space for interpretable ease-of-visualization.

社交媒体在政治中的使用极大地改变了竞选活动的运作方式，以及当选官员与选民的互动方式。需要一种先进的算法来分析和理解这大量异构的社交媒体数据，以研究政治学中的几个关键问题，如立场和策略。以往的研究大多集中在文本即数据的方法上，忽略了用户档案、社会关系和多模态媒体内容中丰富而异构的信息。在这项工作中，我们提出了一个双分支网络，该网络将帖子内容和政治家简介共同映射到相同的潜在空间中，该网络使用结合了跨实例距离约束和实例内语义保留约束的大边界目标进行训练。我们提出的政治嵌入空间不仅可以用于可靠地识别政治光谱和信息类型，还可以用于提供可解释的易于可视化的政治表示空间。

{"title":"Learning Semantic-preserving Space Using User Profile and Multimodal Media Content from Political Social Network","authors":"Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee","doi":"10.1109/ICASSP.2019.8682596","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682596","url":null,"abstract":"The use of social media in politics has dramatically changed the way campaigns are run and how elected officials interact with their constituents. An advanced algorithm is required to analyze and understand this large amount of heterogeneous social media data to investigate several key issues, such as stance and strategy, in political science. Most of previous works concentrate their studies using text-as-data approach, where the rich yet heterogeneous information in the user profile, social relationship, and multimodal media content is largely ignored. In this work, we propose a two-branch network that jointly maps the post contents and politician profile into the same latent space, which is trained using a large-margin objective that combines a cross-instance distance constraint with a within-instance semantic-preserving constraint. Our proposed political embedding space can be utilized not only in reliably identifying political spectrum and message type but also in providing a political representation space for interpretable ease-of-visualization.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"42 1","pages":"3990-3994"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85860122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems 口语对话系统中情绪状态的分层两级建模

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683240

Oxana Verkholyak, D. Fedotov, Heysem Kaya, Yang Zhang, Alexey Karpov

Emotions occur in complex social interactions, and thus processing of isolated utterances may not be sufficient to grasp the nature of underlying emotional states. Dialog speech provides useful information about context that explains nuances of emotions and their transitions. Context can be defined on different levels; this paper proposes a hierarchical context modelling approach based on RNN-LSTM architecture, which models acoustical context on the frame level and partner’s emotional context on the dialog level. The method is proved effective together with cross-corpus training setup and domain adaptation technique in a set of speaker independent cross-validation experiments on IEMOCAP corpus for three levels of activation and valence classification. As a result, the state-of-the-art on this corpus is advanced for both dimensions using only acoustic modality.

情绪发生在复杂的社会互动中，因此处理孤立的话语可能不足以掌握潜在情绪状态的本质。对话语言提供了有关上下文的有用信息，解释了情绪及其转换的细微差别。语境可以在不同的层次上定义;本文提出了一种基于RNN-LSTM架构的分层上下文建模方法，该方法在框架层面对声音上下文进行建模，在对话层面对伴侣的情感上下文进行建模。在IEMOCAP语料库上进行了三种激活和价态分类的独立于说话人的交叉验证实验，验证了该方法与跨语料库训练设置和领域自适应技术的有效性。因此，仅使用声学模态，该语料库的最新技术在两个维度上都是先进的。

引用次数: 5

Improved Metrical Alignment of Midi Performance Based on a Repetition-aware Online-adapted Grammar 基于重复感知在线适应语法的Midi演奏的韵律一致性改进

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Pub Date : 2019-05-12 DOI: 10.1109/ICASSP.2019.8683808

Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii

This paper presents an improvement on an existing grammar-based method for metrical structure detection and alignment, a task which involves aligning a repeated tree structure with an input stream of musical notes. The previous method achieves state-of-the-art results, but performs poorly when it lacks training data. Data annotated as it requires is not widely available, making this drawback of the method significant. We present a novel online learning technique to improve the grammar’s performance on unseen rhythmic patterns using a dynamically learned piece-specific grammar. The piece-specific grammar can measure the musical well-formedness of the underlying alignment without requiring any training data. It instead relies on musical repetition and self-similarity, enabling the model to recognize repeated rhythmic patterns, even when a similar pattern was never seen in the training data. Using it, we see improved performance on a corpus containing only Bach compositions, as well as a second corpus containing works from a variety of composers, indicating that the online-learned grammar helps the model generalize to unseen rhythms and styles.

本文对现有的基于语法的格律结构检测和对齐方法进行了改进，该方法涉及将重复的树结构与音符输入流对齐。之前的方法可以获得最先进的结果，但在缺乏训练数据时表现不佳。按要求注释的数据并不广泛可用，这使得该方法的缺点很明显。我们提出了一种新的在线学习技术，利用动态学习的片段特定语法来提高语法在看不见的节奏模式上的表现。特定于片段的语法可以在不需要任何训练数据的情况下测量底层对齐的音乐格式良好性。相反，它依赖于音乐的重复和自相似性，使模型能够识别重复的节奏模式，即使在训练数据中从未见过类似的模式。使用它，我们看到在一个只包含巴赫作品的语料库上的表现有所改善，以及第二个包含各种作曲家作品的语料库，这表明在线学习的语法帮助模型概括了看不见的节奏和风格。

{"title":"Improved Metrical Alignment of Midi Performance Based on a Repetition-aware Online-adapted Grammar","authors":"Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii","doi":"10.1109/ICASSP.2019.8683808","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683808","url":null,"abstract":"This paper presents an improvement on an existing grammar-based method for metrical structure detection and alignment, a task which involves aligning a repeated tree structure with an input stream of musical notes. The previous method achieves state-of-the-art results, but performs poorly when it lacks training data. Data annotated as it requires is not widely available, making this drawback of the method significant. We present a novel online learning technique to improve the grammar’s performance on unseen rhythmic patterns using a dynamically learned piece-specific grammar. The piece-specific grammar can measure the musical well-formedness of the underlying alignment without requiring any training data. It instead relies on musical repetition and self-similarity, enabling the model to recognize repeated rhythmic patterns, even when a similar pattern was never seen in the training data. Using it, we see improved performance on a corpus containing only Bach compositions, as well as a second corpus containing works from a variety of composers, indicating that the online-learned grammar helps the model generalize to unseen rhythms and styles.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"46 1","pages":"186-190"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81064727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀