2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)最新文献

英文中文

Simultaneous Frequency Estimation for Three or More Sinusoids Based on Sinusoidal Constraint Differential Equation 基于正弦约束微分方程的三个或多个正弦波同步频率估计

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980228

Kenta Yamada, Yoshiki Masuyama, Yukoh Wakabayashi, Nobutaka Ono

In this paper, we present a short-time frequency estimation method that can handle multiple sinusoids simultaneously. Frequency estimation is a fundamental problem in audio analysis. For realizing high-temporal resolution, an approach based on a differential equation of a sinusoid, which is referred to as the sinusoidal constraint differential equation (SCDE), has been proposed. The SCDE-based method can efficiently and accurately estimate frequency even from a short-term signal. However, in terms of simultaneous estimation, up to two sinusoids have been considered so far. In this paper, we extend this approach to three or more sinusoids. Our experimental results show that our method outperformed existing methods based on the discrete Fourier transform.

本文提出了一种能同时处理多个正弦波的短时频率估计方法。频率估计是音频分析中的一个基本问题。为了实现高时间分辨率，提出了一种基于正弦波微分方程的方法，即正弦约束微分方程(SCDE)。基于scde的方法可以有效准确地从短期信号中估计频率。然而，在同时估计方面，到目前为止，已经考虑了多达两个正弦波。在本文中，我们将这种方法推广到三个或更多的正弦波。实验结果表明，该方法优于现有的基于离散傅里叶变换的方法。

引用次数: 1

Karaoke Generation from songs: recent trends and opportunities 卡拉ok一代:最近的趋势和机遇

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980133

Preet Patel, Ansh Ray, Khushboo Thakkar, Kahan Sheth, Sapan H. Mankad

Music Information Retrieval is a crucial task which has ample opportunities in Music Industries. Currently, audio engineers have to create custom karaoke tracks manually for songs. The technique of producing a high-quality karaoke track for a song is not accessible to the public. Audacity and other specialised software must be needed to generate karaoke. In this work, we review different methods and approaches, which give a high-quality karaoke track by presenting a simple and quick separation of vocals from a given song with both vocal and instrumental components. It does not need the use of any specific audio processing software. We review techniques and approaches for generating karaoke such as Spleeter, Hybrid Demucs, D3Net, Open-Unmix, Sams-Net etc. These approaches are based on current state-of-the-art machine learning and deep learning techniques. We believe that this review will serve the purpose as a good resource for researchers working in this field.

音乐信息检索是音乐产业的一项重要任务，具有广阔的发展前景。目前，音频工程师必须手动为歌曲创建自定义卡拉ok音轨。为歌曲制作高质量的卡拉ok音轨的技术还没有普及。生成卡拉ok必须需要Audacity和其他专业软件。在这项工作中，我们回顾了不同的方法和途径，这些方法和途径通过简单而快速地将人声从给定的歌曲中分离出来，同时具有人声和器乐成分，从而获得高质量的卡拉ok音轨。它不需要使用任何特定的音频处理软件。我们回顾了生成卡拉ok的技术和方法，如Spleeter, Hybrid Demucs, D3Net, Open-Unmix, sam - net等。这些方法基于当前最先进的机器学习和深度学习技术。我们相信这篇综述将为在这一领域工作的研究人员提供一个很好的资源。

引用次数: 0

Educational Multi-Purpose Kit for Coding and Robotic Design 编程和机器人设计教育多用途工具包

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979911

Atikhun Thongpool, D. Hormdee, Raksit Chutipakdeevong, Wasan Tansakul

Nowadays, there has been a rapid evolution and transformation of technology. Many innovative technologies have emerged including artificial intelligence, biomedical engineering, automation systems, quantum computing, big data, blockchain, etc. These emerging technologies have also transformed our lifestyles. This transformation has then inevitably required a new set of skills; Computational Thinking/Logical Thinking which can be compiled into Coding skills. Several novel educational media and teaching materials have been promoted. Current educational kits in the market can be classified into 3 main categories. These structures vary from physical kits vs virtual kits vs hybrid kits, while coding styles vary from block-based vs text-based. This paper presents an educational multi-purpose kit for coding and robotic design which has a hybrid kits structure with block-based coding style. Its connection scheme has been designed as wired/wireless plug-and-play via magnetic. The implemented prototype could be resilient for various learning activities, including emulating three (touch, hearing and sight) out of five basic human senses via sensors and actuators. A use case on shape recognition, using computer vision, has been illustrated to show how the implemented system works.

如今，技术发生了迅速的演变和变革。人工智能、生物医学工程、自动化系统、量子计算、大数据、区块链等创新技术层出不穷。这些新兴技术也改变了我们的生活方式。这种转变不可避免地需要一套新的技能;计算思维/逻辑思维，可以编译成编码技能。推出了一些新颖的教育媒体和教材。目前市场上的教育套件可以分为三大类。这些结构不同于物理工具包、虚拟工具包和混合工具包，而编码风格也不同于基于块的工具包和基于文本的工具包。本文提出了一种多用途的编程和机器人设计教育工具包，该工具包具有基于块的编码风格的混合工具包结构。它的连接方案被设计为通过磁性的有线/无线即插即用。实现的原型可以适应各种学习活动，包括通过传感器和执行器模拟人类五种基本感官中的三种(触觉、听觉和视觉)。一个用例的形状识别，使用计算机视觉，已经说明了如何实现系统的工作原理。

{"title":"Educational Multi-Purpose Kit for Coding and Robotic Design","authors":"Atikhun Thongpool, D. Hormdee, Raksit Chutipakdeevong, Wasan Tansakul","doi":"10.23919/APSIPAASC55919.2022.9979911","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979911","url":null,"abstract":"Nowadays, there has been a rapid evolution and transformation of technology. Many innovative technologies have emerged including artificial intelligence, biomedical engineering, automation systems, quantum computing, big data, blockchain, etc. These emerging technologies have also transformed our lifestyles. This transformation has then inevitably required a new set of skills; Computational Thinking/Logical Thinking which can be compiled into Coding skills. Several novel educational media and teaching materials have been promoted. Current educational kits in the market can be classified into 3 main categories. These structures vary from physical kits vs virtual kits vs hybrid kits, while coding styles vary from block-based vs text-based. This paper presents an educational multi-purpose kit for coding and robotic design which has a hybrid kits structure with block-based coding style. Its connection scheme has been designed as wired/wireless plug-and-play via magnetic. The implemented prototype could be resilient for various learning activities, including emulating three (touch, hearing and sight) out of five basic human senses via sensors and actuators. A use case on shape recognition, using computer vision, has been illustrated to show how the implemented system works.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"492 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132196612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings VocEmb4SVS:通过声音嵌入改善歌唱声音分离

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980293

Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, Zhiyong Wu

Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art performance on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.

基于深度学习的方法在歌唱声音分离(SVS)方面表现出了良好的性能。最近，与歌词和声音活动相关的嵌入被证明可以有效地提高SVS任务的性能。然而，与歌手相关的嵌入从未被研究过。在本文中，我们提出了VocEmb4SVS，一个利用歌手的声音嵌入作为SVS调节的辅助知识的SVS框架。首先，使用预训练的分离网络从混合音乐信号中获得预分离的人声。其次，训练一个声音编码器从预先分离的声音中提取声音嵌入。最后，将语音嵌入集成到分离网络中，以提高SVS的性能。实验结果表明，我们提出的方法在MUSDB18数据集上达到了最先进的性能，对人声的SDR为9.56 dB。

引用次数: 1

Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation 随机扰动下乐器声音的音色、音高和变奏特征的无监督解耦

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979893

Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima

This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.

本文提出了一种无监督解纠缠表征学习方法，用于有音高和无音高谱乐器声音的学习。由于传统方法通常试图分离音质特征(例如，乐器)和音高(例如，MIDI音符数和FOs)，因此它们只能应用于音高声音。乐器独特的整体音色和局部变化(例如，表情和演奏风格)也没有区别对待。相反，我们用变分自编码器(VAE)表示乐器声音的频谱图，该变分自编码器(VAE)具有音色，音高和变化特征作为潜在变量。音高的清晰度或打击性、亮度和FOs(如果存在的话)被认为是抽象音高特征的代表。无监督解纠缠是通过从随机音高移输入声音中提取时不变和时变特征作为全局音色和局部变化，从随机音色失真输入声音中提取时变特征作为局部音色特征来实现的。为了增强音质和音高特征的分离，输入声音通过倒谱分析被分离成频谱包络和精细结构。实验表明，该方法能够提供有效的音色和音高特征，从而更好地进行乐器分类和音高估计。

{"title":"Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation","authors":"Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima","doi":"10.23919/APSIPAASC55919.2022.9979893","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979893","url":null,"abstract":"This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122507691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

4G Signal RSSI Recommendation System for ISP Quality of Service Improvement 4G信号RSSI推荐系统，提高ISP服务质量

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980030

Tanatpon Duangta, Watcharaphong Yookwan, K. Chinnasarn, A. Boonsongsrikul

4G Signal RSSI Recommendation System is one of the monitoring methods. The usage rate of local users improves the quality of traffic signals to cycle to receive increased traffic. This paper proposed a method for Prediction and the traffic of data rates used within the area at each location. The result of the proposed approach comparing the performance of models was: the RMSE Gradient Boost Tree, Decision Tree, and Random Forest were 0.291, 0.316 and 0.346, respectively. The correlation will be 0.976, 0.971, and 0.966 for Gradient Boost Tree, Decision Tree, and Random Forest, respectively, and the accuracy of Gradient Boost Tree, Decision Tree, and Random Forest were 97.8%, 97.4%, and 97%, respectively. The results of ensemble learning methods, the RMSE, correlation, and accuracy were: 0.312, 0.972, and 97.5%.

4G信号RSSI推荐系统就是其中一种监控方法。本地用户的使用率提高了交通信号的质量，以循环接收更多的流量。本文提出了一种预测的方法，并在每个位置的区域内使用的数据速率流量。该方法对模型性能的比较结果为:梯度提升树、决策树和随机森林的RMSE分别为0.291、0.316和0.346。梯度增强树、决策树和随机森林的相关性分别为0.976、0.971和0.966，梯度增强树、决策树和随机森林的准确率分别为97.8%、97.4%和97%。集成学习方法的RMSE、相关系数和准确率分别为0.312、0.972和97.5%。

引用次数: 0

Restoration of High-Frequency Components in Under Display Camera Images 下显相机图像中高频分量的恢复

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979964

Youngjin Oh, G. Park, N. Cho

Under-Display Camera (UDC) systems have been developed to remove noticeable camera holes or notches and entirely cover the front side with the screen. As the name implies, UDCs are placed under the display, generally organic light-emitting diode (OLED) these days. Since the OLED panel is not transparent and consists of circuits and display devices, the light reaching the camera experiences a loss of photons and a complicated point spread function (PSF). As a result, the obtained images through the UDC system usually experi-ence a color shift, decreased intensity, complex artifacts due to the PSF, and loss/distortion in high-frequency details. To overcome these degradations, we exploit the multi-stage image restoration network and frequency loss function. The network utilizes deformable convolutions to solve the spatially-variant degradations in UDC images based on the fact that the kernel of deformable convolutions is dynamic and adaptive to input. We also apply frequency reconstruction loss when training our models to better restore the lost high-frequency components due to the complicated PSF. We show that our method effectively removes the degradation caused by the UDC system and achieves state-of-the-art performance on a benchmark dataset.

显示屏下摄像头(UDC)系统已经开发出来，以消除明显的摄像头孔或缺口，并完全覆盖屏幕的正面。顾名思义，udc被放置在显示器下面，现在通常是有机发光二极管(OLED)。由于OLED面板不透明，并且由电路和显示设备组成，因此到达相机的光会经历光子损失和复杂的点扩散函数(PSF)。因此，通过UDC系统获得的图像通常会经历色移，强度降低，由于PSF而产生的复杂伪影以及高频细节的丢失/失真。为了克服这些退化，我们利用了多级图像恢复网络和频率损失函数。该网络利用可变形卷积的核是动态的和自适应输入的特性来解决UDC图像的空间变退化问题。我们还在训练模型时应用了频率重建损失，以更好地恢复由于复杂的PSF而丢失的高频分量。我们表明，我们的方法有效地消除了由UDC系统引起的退化，并在基准数据集上实现了最先进的性能。

{"title":"Restoration of High-Frequency Components in Under Display Camera Images","authors":"Youngjin Oh, G. Park, N. Cho","doi":"10.23919/APSIPAASC55919.2022.9979964","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979964","url":null,"abstract":"Under-Display Camera (UDC) systems have been developed to remove noticeable camera holes or notches and entirely cover the front side with the screen. As the name implies, UDCs are placed under the display, generally organic light-emitting diode (OLED) these days. Since the OLED panel is not transparent and consists of circuits and display devices, the light reaching the camera experiences a loss of photons and a complicated point spread function (PSF). As a result, the obtained images through the UDC system usually experi-ence a color shift, decreased intensity, complex artifacts due to the PSF, and loss/distortion in high-frequency details. To overcome these degradations, we exploit the multi-stage image restoration network and frequency loss function. The network utilizes deformable convolutions to solve the spatially-variant degradations in UDC images based on the fact that the kernel of deformable convolutions is dynamic and adaptive to input. We also apply frequency reconstruction loss when training our models to better restore the lost high-frequency components due to the complicated PSF. We show that our method effectively removes the degradation caused by the UDC system and achieves state-of-the-art performance on a benchmark dataset.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116854253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition 语音情感识别中调制谱特征的噪声-混响-鲁棒性研究

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980032

Taiyang Guo, Sixia Li, M. Unoki, S. Okada

Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.

嘈杂混响环境下的语音情感识别(SER)是一项现实应用的基础技术，包括呼叫中心服务和心理疾病诊断。然而，在具有噪声和混响的日常听觉环境中，以往利用声学特征的研究无法达到与理想实验环境(无噪声和无混响)相同的情绪识别率。为了弥补这一缺陷，有必要为SER找到抗噪声和混响的健壮特性。然而，已经证明，日常嘈杂的混响环境(信噪比大于10 dB，混响时间小于1.0 s)并不影响人类的声音情感识别。在人类感知听觉系统的基础上，已有研究提出了调制谱特征(MSFs)，该特征有助于人类对声音-情绪的识别。使用msf有可能改善嘈杂混响环境中的SER。我们研究了msf在嘈杂混响环境中对SER的有效性和鲁棒性。我们使用了噪声语音编码语音，这是一种合成语音，在嘈杂的混响环境中保留语音信号的情感成分作为语音数据。我们还使用支持向量机作为分类器进行情感识别。实验结果表明，与两种广泛使用的特征集相比，msf在26个环境中的13个环境中提高了识别精度，平均提高了11.38%。因此，msf有助于SER，并且对噪声和混响具有鲁棒性。

{"title":"Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition","authors":"Taiyang Guo, Sixia Li, M. Unoki, S. Okada","doi":"10.23919/APSIPAASC55919.2022.9980032","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980032","url":null,"abstract":"Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115474304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems 基于人的说话人分化研究及与先进系统的比较

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9979811

Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor

Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.

在一个典型AMI语料库会议的五分钟摘录上进行了基于人的说话人语音化实验，以查看仅基于听力的人类评论有多大差异，并与相同摘录上的最先进语音化系统进行比较。有三个不同的实验:(a)一个没有先验信息;(b) ground truth speech activity detection (GT-SAD);(c)空白基础真值标签(gt -标签)。结果表明，尽管存在一些异常值，但大多数人类评论往往非常相似，但是gt标签的选择可以对得分表现产生巨大差异。使用GT-SAD提供了一个很大的优势，并大大提高了人类的审查分数，尽管使用的GT-SAD的微小差异会对结果产生巨大的影响。事实证明，使用宽恕项圈是没有用的。结果表明，在没有提供先验信息的情况下，最先进的系统可以胜过最好的人工评论。然而，从GT-SAD开始，最好的人类评估仍然优于最先进的系统。

{"title":"Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems","authors":"Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor","doi":"10.23919/APSIPAASC55919.2022.9979811","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979811","url":null,"abstract":"Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116040071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods 基于数据增强方法的短声声场景分类

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Pub Date : 2022-11-07 DOI: 10.23919/APSIPAASC55919.2022.9980120

Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang

How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.

如何有效地将短音频数据分类为声学场景是DCASE2022挑战任务1提出的新挑战。本文详细介绍了我们为这个问题所做的探索以及我们使用的体系结构。我们的架构基于分段网，增加了一个实例规范化层来规范前一层在编码器的conv_block 1和解码器的deconv_block 2处的激活。提取对数谱图、delta特征和delta-delta特征来训练声学场景分类模型。共采用了混合、时频域掩蔽、图像增强、自动调平、pix2pix、随机裁剪6种数据增强方法。我们采用了三种模型压缩方案:剪枝、量化和知识蒸馏来降低模型的复杂度。与基线系统相比，该系统具有更高的分类精度。在TAU城市声学场景2022移动开发数据集的测试集上，我们的模型可以达到60.58%的平均准确率。经过模型压缩，我们的模型在参数大小为127.2 K、量化为8位、mmac小于30 M的情况下，平均准确率达到54.11%。

{"title":"Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods","authors":"Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang","doi":"10.23919/APSIPAASC55919.2022.9980120","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980120","url":null,"abstract":"How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀