Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980228
Kenta Yamada, Yoshiki Masuyama, Yukoh Wakabayashi, Nobutaka Ono
In this paper, we present a short-time frequency estimation method that can handle multiple sinusoids simultaneously. Frequency estimation is a fundamental problem in audio analysis. For realizing high-temporal resolution, an approach based on a differential equation of a sinusoid, which is referred to as the sinusoidal constraint differential equation (SCDE), has been proposed. The SCDE-based method can efficiently and accurately estimate frequency even from a short-term signal. However, in terms of simultaneous estimation, up to two sinusoids have been considered so far. In this paper, we extend this approach to three or more sinusoids. Our experimental results show that our method outperformed existing methods based on the discrete Fourier transform.
{"title":"Simultaneous Frequency Estimation for Three or More Sinusoids Based on Sinusoidal Constraint Differential Equation","authors":"Kenta Yamada, Yoshiki Masuyama, Yukoh Wakabayashi, Nobutaka Ono","doi":"10.23919/APSIPAASC55919.2022.9980228","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980228","url":null,"abstract":"In this paper, we present a short-time frequency estimation method that can handle multiple sinusoids simultaneously. Frequency estimation is a fundamental problem in audio analysis. For realizing high-temporal resolution, an approach based on a differential equation of a sinusoid, which is referred to as the sinusoidal constraint differential equation (SCDE), has been proposed. The SCDE-based method can efficiently and accurately estimate frequency even from a short-term signal. However, in terms of simultaneous estimation, up to two sinusoids have been considered so far. In this paper, we extend this approach to three or more sinusoids. Our experimental results show that our method outperformed existing methods based on the discrete Fourier transform.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129084214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Music Information Retrieval is a crucial task which has ample opportunities in Music Industries. Currently, audio engineers have to create custom karaoke tracks manually for songs. The technique of producing a high-quality karaoke track for a song is not accessible to the public. Audacity and other specialised software must be needed to generate karaoke. In this work, we review different methods and approaches, which give a high-quality karaoke track by presenting a simple and quick separation of vocals from a given song with both vocal and instrumental components. It does not need the use of any specific audio processing software. We review techniques and approaches for generating karaoke such as Spleeter, Hybrid Demucs, D3Net, Open-Unmix, Sams-Net etc. These approaches are based on current state-of-the-art machine learning and deep learning techniques. We believe that this review will serve the purpose as a good resource for researchers working in this field.
音乐信息检索是音乐产业的一项重要任务,具有广阔的发展前景。目前,音频工程师必须手动为歌曲创建自定义卡拉ok音轨。为歌曲制作高质量的卡拉ok音轨的技术还没有普及。生成卡拉ok必须需要Audacity和其他专业软件。在这项工作中,我们回顾了不同的方法和途径,这些方法和途径通过简单而快速地将人声从给定的歌曲中分离出来,同时具有人声和器乐成分,从而获得高质量的卡拉ok音轨。它不需要使用任何特定的音频处理软件。我们回顾了生成卡拉ok的技术和方法,如Spleeter, Hybrid Demucs, D3Net, Open-Unmix, sam - net等。这些方法基于当前最先进的机器学习和深度学习技术。我们相信这篇综述将为在这一领域工作的研究人员提供一个很好的资源。
{"title":"Karaoke Generation from songs: recent trends and opportunities","authors":"Preet Patel, Ansh Ray, Khushboo Thakkar, Kahan Sheth, Sapan H. Mankad","doi":"10.23919/APSIPAASC55919.2022.9980133","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980133","url":null,"abstract":"Music Information Retrieval is a crucial task which has ample opportunities in Music Industries. Currently, audio engineers have to create custom karaoke tracks manually for songs. The technique of producing a high-quality karaoke track for a song is not accessible to the public. Audacity and other specialised software must be needed to generate karaoke. In this work, we review different methods and approaches, which give a high-quality karaoke track by presenting a simple and quick separation of vocals from a given song with both vocal and instrumental components. It does not need the use of any specific audio processing software. We review techniques and approaches for generating karaoke such as Spleeter, Hybrid Demucs, D3Net, Open-Unmix, Sams-Net etc. These approaches are based on current state-of-the-art machine learning and deep learning techniques. We believe that this review will serve the purpose as a good resource for researchers working in this field.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134211165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979911
Atikhun Thongpool, D. Hormdee, Raksit Chutipakdeevong, Wasan Tansakul
Nowadays, there has been a rapid evolution and transformation of technology. Many innovative technologies have emerged including artificial intelligence, biomedical engineering, automation systems, quantum computing, big data, blockchain, etc. These emerging technologies have also transformed our lifestyles. This transformation has then inevitably required a new set of skills; Computational Thinking/Logical Thinking which can be compiled into Coding skills. Several novel educational media and teaching materials have been promoted. Current educational kits in the market can be classified into 3 main categories. These structures vary from physical kits vs virtual kits vs hybrid kits, while coding styles vary from block-based vs text-based. This paper presents an educational multi-purpose kit for coding and robotic design which has a hybrid kits structure with block-based coding style. Its connection scheme has been designed as wired/wireless plug-and-play via magnetic. The implemented prototype could be resilient for various learning activities, including emulating three (touch, hearing and sight) out of five basic human senses via sensors and actuators. A use case on shape recognition, using computer vision, has been illustrated to show how the implemented system works.
{"title":"Educational Multi-Purpose Kit for Coding and Robotic Design","authors":"Atikhun Thongpool, D. Hormdee, Raksit Chutipakdeevong, Wasan Tansakul","doi":"10.23919/APSIPAASC55919.2022.9979911","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979911","url":null,"abstract":"Nowadays, there has been a rapid evolution and transformation of technology. Many innovative technologies have emerged including artificial intelligence, biomedical engineering, automation systems, quantum computing, big data, blockchain, etc. These emerging technologies have also transformed our lifestyles. This transformation has then inevitably required a new set of skills; Computational Thinking/Logical Thinking which can be compiled into Coding skills. Several novel educational media and teaching materials have been promoted. Current educational kits in the market can be classified into 3 main categories. These structures vary from physical kits vs virtual kits vs hybrid kits, while coding styles vary from block-based vs text-based. This paper presents an educational multi-purpose kit for coding and robotic design which has a hybrid kits structure with block-based coding style. Its connection scheme has been designed as wired/wireless plug-and-play via magnetic. The implemented prototype could be resilient for various learning activities, including emulating three (touch, hearing and sight) out of five basic human senses via sensors and actuators. A use case on shape recognition, using computer vision, has been illustrated to show how the implemented system works.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"492 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132196612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980293
Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu
Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art performance on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.
{"title":"VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings","authors":"Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu","doi":"10.23919/APSIPAASC55919.2022.9980293","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980293","url":null,"abstract":"Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art performance on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132818689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979893
Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima
This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.
{"title":"Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation","authors":"Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima","doi":"10.23919/APSIPAASC55919.2022.9979893","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979893","url":null,"abstract":"This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122507691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980030
Tanatpon Duangta, Watcharaphong Yookwan, K. Chinnasarn, A. Boonsongsrikul
4G Signal RSSI Recommendation System is one of the monitoring methods. The usage rate of local users improves the quality of traffic signals to cycle to receive increased traffic. This paper proposed a method for Prediction and the traffic of data rates used within the area at each location. The result of the proposed approach comparing the performance of models was: the RMSE Gradient Boost Tree, Decision Tree, and Random Forest were 0.291, 0.316 and 0.346, respectively. The correlation will be 0.976, 0.971, and 0.966 for Gradient Boost Tree, Decision Tree, and Random Forest, respectively, and the accuracy of Gradient Boost Tree, Decision Tree, and Random Forest were 97.8%, 97.4%, and 97%, respectively. The results of ensemble learning methods, the RMSE, correlation, and accuracy were: 0.312, 0.972, and 97.5%.
{"title":"4G Signal RSSI Recommendation System for ISP Quality of Service Improvement","authors":"Tanatpon Duangta, Watcharaphong Yookwan, K. Chinnasarn, A. Boonsongsrikul","doi":"10.23919/APSIPAASC55919.2022.9980030","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980030","url":null,"abstract":"4G Signal RSSI Recommendation System is one of the monitoring methods. The usage rate of local users improves the quality of traffic signals to cycle to receive increased traffic. This paper proposed a method for Prediction and the traffic of data rates used within the area at each location. The result of the proposed approach comparing the performance of models was: the RMSE Gradient Boost Tree, Decision Tree, and Random Forest were 0.291, 0.316 and 0.346, respectively. The correlation will be 0.976, 0.971, and 0.966 for Gradient Boost Tree, Decision Tree, and Random Forest, respectively, and the accuracy of Gradient Boost Tree, Decision Tree, and Random Forest were 97.8%, 97.4%, and 97%, respectively. The results of ensemble learning methods, the RMSE, correlation, and accuracy were: 0.312, 0.972, and 97.5%.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117310819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979964
Youngjin Oh, G. Park, N. Cho
Under-Display Camera (UDC) systems have been developed to remove noticeable camera holes or notches and entirely cover the front side with the screen. As the name implies, UDCs are placed under the display, generally organic light-emitting diode (OLED) these days. Since the OLED panel is not transparent and consists of circuits and display devices, the light reaching the camera experiences a loss of photons and a complicated point spread function (PSF). As a result, the obtained images through the UDC system usually experi-ence a color shift, decreased intensity, complex artifacts due to the PSF, and loss/distortion in high-frequency details. To overcome these degradations, we exploit the multi-stage image restoration network and frequency loss function. The network utilizes deformable convolutions to solve the spatially-variant degradations in UDC images based on the fact that the kernel of deformable convolutions is dynamic and adaptive to input. We also apply frequency reconstruction loss when training our models to better restore the lost high-frequency components due to the complicated PSF. We show that our method effectively removes the degradation caused by the UDC system and achieves state-of-the-art performance on a benchmark dataset.
{"title":"Restoration of High-Frequency Components in Under Display Camera Images","authors":"Youngjin Oh, G. Park, N. Cho","doi":"10.23919/APSIPAASC55919.2022.9979964","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979964","url":null,"abstract":"Under-Display Camera (UDC) systems have been developed to remove noticeable camera holes or notches and entirely cover the front side with the screen. As the name implies, UDCs are placed under the display, generally organic light-emitting diode (OLED) these days. Since the OLED panel is not transparent and consists of circuits and display devices, the light reaching the camera experiences a loss of photons and a complicated point spread function (PSF). As a result, the obtained images through the UDC system usually experi-ence a color shift, decreased intensity, complex artifacts due to the PSF, and loss/distortion in high-frequency details. To overcome these degradations, we exploit the multi-stage image restoration network and frequency loss function. The network utilizes deformable convolutions to solve the spatially-variant degradations in UDC images based on the fact that the kernel of deformable convolutions is dynamic and adaptive to input. We also apply frequency reconstruction loss when training our models to better restore the lost high-frequency components due to the complicated PSF. We show that our method effectively removes the degradation caused by the UDC system and achieves state-of-the-art performance on a benchmark dataset.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116854253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980032
Taiyang Guo, Sixia Li, M. Unoki, S. Okada
Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.
{"title":"Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition","authors":"Taiyang Guo, Sixia Li, M. Unoki, S. Okada","doi":"10.23919/APSIPAASC55919.2022.9980032","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980032","url":null,"abstract":"Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115474304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9979811
Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor
Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.
在一个典型AMI语料库会议的五分钟摘录上进行了基于人的说话人语音化实验,以查看仅基于听力的人类评论有多大差异,并与相同摘录上的最先进语音化系统进行比较。有三个不同的实验:(a)一个没有先验信息;(b) ground truth speech activity detection (GT-SAD);(c)空白基础真值标签(gt -标签)。结果表明,尽管存在一些异常值,但大多数人类评论往往非常相似,但是gt标签的选择可以对得分表现产生巨大差异。使用GT-SAD提供了一个很大的优势,并大大提高了人类的审查分数,尽管使用的GT-SAD的微小差异会对结果产生巨大的影响。事实证明,使用宽恕项圈是没有用的。结果表明,在没有提供先验信息的情况下,最先进的系统可以胜过最好的人工评论。然而,从GT-SAD开始,最好的人类评估仍然优于最先进的系统。
{"title":"Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems","authors":"Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor","doi":"10.23919/APSIPAASC55919.2022.9979811","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979811","url":null,"abstract":"Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116040071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-07DOI: 10.23919/APSIPAASC55919.2022.9980120
Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang
How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.
{"title":"Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods","authors":"Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang","doi":"10.23919/APSIPAASC55919.2022.9980120","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980120","url":null,"abstract":"How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}