Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683779
Nuowen Kan, Junni Zou, Kexin Tang, Chenglin Li, Ning Liu, H. Xiong
In this paper, we propose a deep reinforcement learning (DRL)-based rate adaptation algorithm for adaptive 360-degree video streaming, which is able to maximize the quality of experience of viewers by adapting the transmitted video quality to the time-varying network conditions. Specifically, to reduce the possible switching latency of the field of view (FoV), we design a new QoE metric by introducing a penalty term for the large buffer occupancy. A scalable FoV method is further proposed to alleviate the combinatorial explosion of the action space in the DRL formulation. Then, we model the rate adaptation logic as a Markov decision process and employ the DRL-based algorithm to dynamically learn the optimal video transmission rate. Simulation results show the superior performance of the proposed algorithm compared to the existing algorithms.
{"title":"Deep Reinforcement Learning-based Rate Adaptation for Adaptive 360-Degree Video Streaming","authors":"Nuowen Kan, Junni Zou, Kexin Tang, Chenglin Li, Ning Liu, H. Xiong","doi":"10.1109/ICASSP.2019.8683779","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683779","url":null,"abstract":"In this paper, we propose a deep reinforcement learning (DRL)-based rate adaptation algorithm for adaptive 360-degree video streaming, which is able to maximize the quality of experience of viewers by adapting the transmitted video quality to the time-varying network conditions. Specifically, to reduce the possible switching latency of the field of view (FoV), we design a new QoE metric by introducing a penalty term for the large buffer occupancy. A scalable FoV method is further proposed to alleviate the combinatorial explosion of the action space in the DRL formulation. Then, we model the rate adaptation logic as a Markov decision process and employ the DRL-based algorithm to dynamically learn the optimal video transmission rate. Simulation results show the superior performance of the proposed algorithm compared to the existing algorithms.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"75 1","pages":"4030-4034"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83823437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683016
Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling
This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this method builds a multiple-target DNN, which predicts log amplitude spectra of natural high-bit waveforms together with the amplitude ratios between natural and distorted spectra. Log amplitude spectra of the waveforms reconstructed by low-bit neural waveform generators are adopted as model input. At generation stage, the enhanced amplitude spectra are obtained by an ensemble decoding strategy, and are further combined with the phase spectra of low-bit waveforms to produce the final waveforms by inverse STFT. In our experiments on WaveRNN vocoders, an 8-bit WaveRNN with spectral enhancement outperforms a 16-bit counterpart with the same model complexity in terms of the quality of reconstructed waveforms. Besides, the proposed spectral enhancement method can also help an 8-bit WaveRNN with reduced model complexity to achieve similar subjective performance with a conventional 16-bit WaveRNN.
{"title":"Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization","authors":"Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling","doi":"10.1109/ICASSP.2019.8683016","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683016","url":null,"abstract":"This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this method builds a multiple-target DNN, which predicts log amplitude spectra of natural high-bit waveforms together with the amplitude ratios between natural and distorted spectra. Log amplitude spectra of the waveforms reconstructed by low-bit neural waveform generators are adopted as model input. At generation stage, the enhanced amplitude spectra are obtained by an ensemble decoding strategy, and are further combined with the phase spectra of low-bit waveforms to produce the final waveforms by inverse STFT. In our experiments on WaveRNN vocoders, an 8-bit WaveRNN with spectral enhancement outperforms a 16-bit counterpart with the same model complexity in terms of the quality of reconstructed waveforms. Besides, the proposed spectral enhancement method can also help an 8-bit WaveRNN with reduced model complexity to achieve similar subjective performance with a conventional 16-bit WaveRNN.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"7025-7029"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80747101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683760
David Snyder, D. Garcia-Romero, Gregory Sell, A. McCree, Daniel Povey, S. Khudanpur
Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.
{"title":"Speaker Recognition for Multi-speaker Conversations Using X-vectors","authors":"David Snyder, D. Garcia-Romero, Gregory Sell, A. McCree, Daniel Povey, S. Khudanpur","doi":"10.1109/ICASSP.2019.8683760","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683760","url":null,"abstract":"Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"4 1","pages":"5796-5800"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81538209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682574
Pasi Pertilä, Mikko Parviainen
The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.
{"title":"Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking","authors":"Pasi Pertilä, Mikko Parviainen","doi":"10.1109/ICASSP.2019.8682574","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682574","url":null,"abstract":"The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"436-440"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82498885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682307
A. Peimankar, S. Puthusserypady
Detection of P-waves in electrocardiogram (ECG) signals is of great importance to cardiologists in order to help them diagnosing arrhythmias such as atrial fibrillation. This paper proposes an end-to-end deep learning approach for detection of P-waves in ECG signals. Four different deep Recurrent Neural Networks (RNNs), namely, the Long-Short Term Memory (LSTM) are used in an ensemble framework. Each of these networks are trained to extract the useful features from raw ECG signals and determine the absence/presence of P-waves. Outputs of these classifiers are then combined for final detection of the P-waves. The proposed algorithm was trained and validated on a database which consists of more than 111000 annotated heart beats and the results show consistently high classification accuracy and sensitivity of around 98.48% and 97.22%, respectively.
{"title":"An Ensemble of Deep Recurrent Neural Networks for P-wave Detection in Electrocardiogram","authors":"A. Peimankar, S. Puthusserypady","doi":"10.1109/ICASSP.2019.8682307","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682307","url":null,"abstract":"Detection of P-waves in electrocardiogram (ECG) signals is of great importance to cardiologists in order to help them diagnosing arrhythmias such as atrial fibrillation. This paper proposes an end-to-end deep learning approach for detection of P-waves in ECG signals. Four different deep Recurrent Neural Networks (RNNs), namely, the Long-Short Term Memory (LSTM) are used in an ensemble framework. Each of these networks are trained to extract the useful features from raw ECG signals and determine the absence/presence of P-waves. Outputs of these classifiers are then combined for final detection of the P-waves. The proposed algorithm was trained and validated on a database which consists of more than 111000 annotated heart beats and the results show consistently high classification accuracy and sensitivity of around 98.48% and 97.22%, respectively.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"5 1","pages":"1284-1288"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82555306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/icassp.2019.8683859
Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li
The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.
{"title":"Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr","authors":"Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li","doi":"10.1109/icassp.2019.8683859","DOIUrl":"https://doi.org/10.1109/icassp.2019.8683859","url":null,"abstract":"The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"65 1","pages":"7085-7089"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86494193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent PointNet-family hand pose methods have the advantages of high pose estimation performance and small model size, and it is a key problem to get effective sample points for PointNet-family methods. In this paper, we propose a two-stage coarse to fine hand pose estimation method, which belongs to PointNet-family methods and explores a new sample point strategy. In the first stage, we use 3D coordinate and surface normal of normalized point cloud as input to regress coarse hand joints. In the second stage, we use the hand joints in the first stage as the initial sample points to refine the hand joints. Experiments on widely used datasets demonstrate that using joints as sample points is more effective and our method achieves top-rank performance.
{"title":"Cascaded Point Network for 3D Hand Pose Estimation*","authors":"Yikun Dou, Xuguang Wang, Yuying Zhu, Xiaoming Deng, Cuixia Ma, Liang Chang, Hongan Wang","doi":"10.1109/ICASSP.2019.8683356","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683356","url":null,"abstract":"Recent PointNet-family hand pose methods have the advantages of high pose estimation performance and small model size, and it is a key problem to get effective sample points for PointNet-family methods. In this paper, we propose a two-stage coarse to fine hand pose estimation method, which belongs to PointNet-family methods and explores a new sample point strategy. In the first stage, we use 3D coordinate and surface normal of normalized point cloud as input to regress coarse hand joints. In the second stage, we use the hand joints in the first stage as the initial sample points to refine the hand joints. Experiments on widely used datasets demonstrate that using joints as sample points is more effective and our method achieves top-rank performance.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"33 1","pages":"1982-1986"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83660401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8682596
Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee
The use of social media in politics has dramatically changed the way campaigns are run and how elected officials interact with their constituents. An advanced algorithm is required to analyze and understand this large amount of heterogeneous social media data to investigate several key issues, such as stance and strategy, in political science. Most of previous works concentrate their studies using text-as-data approach, where the rich yet heterogeneous information in the user profile, social relationship, and multimodal media content is largely ignored. In this work, we propose a two-branch network that jointly maps the post contents and politician profile into the same latent space, which is trained using a large-margin objective that combines a cross-instance distance constraint with a within-instance semantic-preserving constraint. Our proposed political embedding space can be utilized not only in reliably identifying political spectrum and message type but also in providing a political representation space for interpretable ease-of-visualization.
{"title":"Learning Semantic-preserving Space Using User Profile and Multimodal Media Content from Political Social Network","authors":"Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee","doi":"10.1109/ICASSP.2019.8682596","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682596","url":null,"abstract":"The use of social media in politics has dramatically changed the way campaigns are run and how elected officials interact with their constituents. An advanced algorithm is required to analyze and understand this large amount of heterogeneous social media data to investigate several key issues, such as stance and strategy, in political science. Most of previous works concentrate their studies using text-as-data approach, where the rich yet heterogeneous information in the user profile, social relationship, and multimodal media content is largely ignored. In this work, we propose a two-branch network that jointly maps the post contents and politician profile into the same latent space, which is trained using a large-margin objective that combines a cross-instance distance constraint with a within-instance semantic-preserving constraint. Our proposed political embedding space can be utilized not only in reliably identifying political spectrum and message type but also in providing a political representation space for interpretable ease-of-visualization.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"42 1","pages":"3990-3994"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85860122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683240
Oxana Verkholyak, D. Fedotov, Heysem Kaya, Yang Zhang, Alexey Karpov
Emotions occur in complex social interactions, and thus processing of isolated utterances may not be sufficient to grasp the nature of underlying emotional states. Dialog speech provides useful information about context that explains nuances of emotions and their transitions. Context can be defined on different levels; this paper proposes a hierarchical context modelling approach based on RNN-LSTM architecture, which models acoustical context on the frame level and partner’s emotional context on the dialog level. The method is proved effective together with cross-corpus training setup and domain adaptation technique in a set of speaker independent cross-validation experiments on IEMOCAP corpus for three levels of activation and valence classification. As a result, the state-of-the-art on this corpus is advanced for both dimensions using only acoustic modality.
{"title":"Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems","authors":"Oxana Verkholyak, D. Fedotov, Heysem Kaya, Yang Zhang, Alexey Karpov","doi":"10.1109/ICASSP.2019.8683240","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683240","url":null,"abstract":"Emotions occur in complex social interactions, and thus processing of isolated utterances may not be sufficient to grasp the nature of underlying emotional states. Dialog speech provides useful information about context that explains nuances of emotions and their transitions. Context can be defined on different levels; this paper proposes a hierarchical context modelling approach based on RNN-LSTM architecture, which models acoustical context on the frame level and partner’s emotional context on the dialog level. The method is proved effective together with cross-corpus training setup and domain adaptation technique in a set of speaker independent cross-validation experiments on IEMOCAP corpus for three levels of activation and valence classification. As a result, the state-of-the-art on this corpus is advanced for both dimensions using only acoustic modality.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"310 1","pages":"6700-6704"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76454143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-12DOI: 10.1109/ICASSP.2019.8683808
Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii
This paper presents an improvement on an existing grammar-based method for metrical structure detection and alignment, a task which involves aligning a repeated tree structure with an input stream of musical notes. The previous method achieves state-of-the-art results, but performs poorly when it lacks training data. Data annotated as it requires is not widely available, making this drawback of the method significant. We present a novel online learning technique to improve the grammar’s performance on unseen rhythmic patterns using a dynamically learned piece-specific grammar. The piece-specific grammar can measure the musical well-formedness of the underlying alignment without requiring any training data. It instead relies on musical repetition and self-similarity, enabling the model to recognize repeated rhythmic patterns, even when a similar pattern was never seen in the training data. Using it, we see improved performance on a corpus containing only Bach compositions, as well as a second corpus containing works from a variety of composers, indicating that the online-learned grammar helps the model generalize to unseen rhythms and styles.
{"title":"Improved Metrical Alignment of Midi Performance Based on a Repetition-aware Online-adapted Grammar","authors":"Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii","doi":"10.1109/ICASSP.2019.8683808","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683808","url":null,"abstract":"This paper presents an improvement on an existing grammar-based method for metrical structure detection and alignment, a task which involves aligning a repeated tree structure with an input stream of musical notes. The previous method achieves state-of-the-art results, but performs poorly when it lacks training data. Data annotated as it requires is not widely available, making this drawback of the method significant. We present a novel online learning technique to improve the grammar’s performance on unseen rhythmic patterns using a dynamically learned piece-specific grammar. The piece-specific grammar can measure the musical well-formedness of the underlying alignment without requiring any training data. It instead relies on musical repetition and self-similarity, enabling the model to recognize repeated rhythmic patterns, even when a similar pattern was never seen in the training data. Using it, we see improved performance on a corpus containing only Bach compositions, as well as a second corpus containing works from a variety of composers, indicating that the online-learned grammar helps the model generalize to unseen rhythms and styles.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"46 1","pages":"186-190"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81064727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}