首页 > 最新文献

APSIPA Transactions on Signal and Information Processing最新文献

英文 中文
Two-stage pyramidal convolutional neural networks for image colorization 用于图像着色的两阶段金字塔卷积神经网络
IF 3.2 Q1 Computer Science Pub Date : 2021-10-08 DOI: 10.1017/ATSIP.2021.13
Yu-Jen Wei, Tsu-Tsai Wei, Tien-Ying Kuo, Po-Chyi Su
The development of colorization algorithms through deep learning has become the current research trend. These algorithms colorize grayscale images automatically and quickly, but the colors produced are usually subdued and have low saturation. This research addresses this issue of existing algorithms by presenting a two-stage convolutional neural network (CNN) structure with the first and second stages being a chroma map generation network and a refinement network, respectively. To begin, we convert the color space of an image from RGB to HSV to predict its low-resolution chroma components and therefore reduce the computational complexity. Following that, the first-stage output is zoomed in and its detail is enhanced with a pyramidal CNN, resulting in a colorized image. Experiments show that, while using fewer parameters, our methodology produces results with more realistic color and higher saturation than existing methods.
通过深度学习开发着色算法已成为当前的研究趋势。这些算法自动快速地对灰度图像上色,但产生的颜色通常较弱,饱和度较低。本研究通过提出一种两阶段卷积神经网络(CNN)结构来解决现有算法的这一问题,第一阶段和第二阶段分别是色度图生成网络和细化网络。首先,我们将图像的颜色空间从RGB转换为HSV,以预测其低分辨率色度成分,从而降低计算复杂度。接下来,将第一阶段的输出放大,并使用金字塔形CNN增强其细节,从而得到彩色图像。实验表明,在使用更少参数的情况下,我们的方法比现有方法产生更真实的颜色和更高的饱和度。
{"title":"Two-stage pyramidal convolutional neural networks for image colorization","authors":"Yu-Jen Wei, Tsu-Tsai Wei, Tien-Ying Kuo, Po-Chyi Su","doi":"10.1017/ATSIP.2021.13","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.13","url":null,"abstract":"The development of colorization algorithms through deep learning has become the current research trend. These algorithms colorize grayscale images automatically and quickly, but the colors produced are usually subdued and have low saturation. This research addresses this issue of existing algorithms by presenting a two-stage convolutional neural network (CNN) structure with the first and second stages being a chroma map generation network and a refinement network, respectively. To begin, we convert the color space of an image from RGB to HSV to predict its low-resolution chroma components and therefore reduce the computational complexity. Following that, the first-stage output is zoomed in and its detail is enhanced with a pyramidal CNN, resulting in a colorized image. Experiments show that, while using fewer parameters, our methodology produces results with more realistic color and higher saturation than existing methods.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46027458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
3D skeletal movement-enhanced emotion recognition networks 三维骨骼运动增强的情感识别网络
IF 3.2 Q1 Computer Science Pub Date : 2021-08-05 DOI: 10.1017/ATSIP.2021.11
Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro
Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.
情绪自动识别已成为人机自然交互和人工智能领域的一个重要趋势。尽管手势是非言语交际中最重要的组成部分之一,对情绪识别有着相当大的影响,但在情绪识别的研究中很少考虑它。一个重要的原因是缺乏包含骨骼运动数据的大型开源情感数据库。在本文中,我们从视频中提取三维骨架信息,并将该方法应用于IEMOCAP数据库以添加新的模态。我们提出了一种基于注意力的卷积神经网络,该网络以提取的数据为输入来预测说话人的情绪状态。我们还提出了一种基于图注意力的融合方法,将我们的模型与使用其他模态的模型相结合,以在情绪分类任务中提供互补信息,并有效地融合多模态线索。组合模型利用音频信号、文本信息和骨架数据。该模型的性能显著优于双峰模型和其他融合策略,证明了该方法的有效性。
{"title":"3D skeletal movement-enhanced emotion recognition networks","authors":"Jiaqi Shi, Chaoran Liu, C. Ishi, H. Ishiguro","doi":"10.1017/ATSIP.2021.11","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.11","url":null,"abstract":"Automatic emotion recognition has become an important trend in the fields of human–computer natural interaction and artificial intelligence. Although gesture is one of the most important components of nonverbal communication, which has a considerable impact on emotion recognition, it is rarely considered in the study of emotion recognition. An important reason is the lack of large open-source emotional databases containing skeletal movement data. In this paper, we extract three-dimensional skeleton information from videos and apply the method to IEMOCAP database to add a new modality. We propose an attention-based convolutional neural network which takes the extracted data as input to predict the speakers’ emotional state. We also propose a graph attention-based fusion method that combines our model with the models using other modalities, to provide complementary information in the emotion classification task and effectively fuse multimodal cues. The combined model utilizes audio signals, text information, and skeletal data. The performance of the model significantly outperforms the bimodal model and other fusion strategies, proving the effectiveness of the method.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42864803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Compression efficiency analysis of AV1, VVC, and HEVC for random access applications AV1、VVC和HEVC在随机接入应用中的压缩效率分析
IF 3.2 Q1 Computer Science Pub Date : 2021-07-13 DOI: 10.1017/ATSIP.2021.10
Tung Nguyen, D. Marpe
AOM Video 1 (AV1) and Versatile Video Coding (VVC) are the outcome of two recent independent video coding technology developments. Although VVC is the successor of High Efficiency Video Coding (HEVC) in the lineage of international video coding standards jointly developed by ITU-T and ISO/IEC within an open and public standardization process, AV1 is a video coding scheme that was developed by the industry consortium Alliance for Open Media (AOM) and that has its technological roots in Google's proprietary VP9 codec. This paper presents a compression efficiency evaluation for the AV1, VVC, and HEVC video coding schemes in a typical video compression application requiring random access. The latter is an important property, without which essential functionalities in digital video broadcasting or streaming could not be provided. For the evaluation, we employed a controlled experimental environment that basically follows the guidelines specified in the Common Test Conditions of the Joint Video Experts Team. As representatives of the corresponding video coding schemes, we selected their freely available reference software implementations. Depending on the application-specific frequency of random access points, the experimental results show averaged bit-rate savings of about 10–15% for AV1 and 36–37% for the VVC reference encoder implementation (VTM), both relative to the HEVC reference encoder implementation (HM) and by using a test set of video sequences with different characteristics regarding content and resolution. A direct comparison between VTM and AV1 reveals averaged bit-rate savings of about 25–29% for VTM, while the averaged encoding and decoding run times of VTM relative to those of AV1 are around 300% and 270%, respectively.
AOM视频1(AV1)和通用视频编码(VVC)是最近两个独立视频编码技术发展的结果。尽管VVC是高效视频编码(HEVC)的继任者,是ITU-T和ISO/IEC在开放和公共标准化过程中联合开发的国际视频编码标准,但AV1是一种由行业联盟开放媒体联盟(AOM)开发的视频编码方案,其技术根源于谷歌专有的VP9编解码器。本文对需要随机访问的典型视频压缩应用中的AV1、VVC和HEVC视频编码方案的压缩效率进行了评估。后者是一个重要特性,没有它就无法提供数字视频广播或流媒体的基本功能。为了进行评估,我们采用了一个受控的实验环境,该环境基本上遵循了联合视频专家组通用测试条件中规定的指南。作为相应视频编码方案的代表,我们选择了它们免费提供的参考软件实现。根据随机接入点的特定应用频率,实验结果显示,AV1和VVC参考编码器实现(VTM)的平均比特率分别节省了约10-15%和36-37%,这两种方法都是相对于HEVC参考编码器实施(HM)和通过使用具有不同内容和分辨率特征的视频序列测试集。VTM和AV1之间的直接比较显示,VTM的平均比特率节省了约25-29%,而VTM相对于AV1的平均编码和解码运行时间分别约为300%和270%。
{"title":"Compression efficiency analysis of AV1, VVC, and HEVC for random access applications","authors":"Tung Nguyen, D. Marpe","doi":"10.1017/ATSIP.2021.10","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.10","url":null,"abstract":"AOM Video 1 (AV1) and Versatile Video Coding (VVC) are the outcome of two recent independent video coding technology developments. Although VVC is the successor of High Efficiency Video Coding (HEVC) in the lineage of international video coding standards jointly developed by ITU-T and ISO/IEC within an open and public standardization process, AV1 is a video coding scheme that was developed by the industry consortium Alliance for Open Media (AOM) and that has its technological roots in Google's proprietary VP9 codec. This paper presents a compression efficiency evaluation for the AV1, VVC, and HEVC video coding schemes in a typical video compression application requiring random access. The latter is an important property, without which essential functionalities in digital video broadcasting or streaming could not be provided. For the evaluation, we employed a controlled experimental environment that basically follows the guidelines specified in the Common Test Conditions of the Joint Video Experts Team. As representatives of the corresponding video coding schemes, we selected their freely available reference software implementations. Depending on the application-specific frequency of random access points, the experimental results show averaged bit-rate savings of about 10–15% for AV1 and 36–37% for the VVC reference encoder implementation (VTM), both relative to the HEVC reference encoder implementation (HM) and by using a test set of video sequences with different characteristics regarding content and resolution. A direct comparison between VTM and AV1 reveals averaged bit-rate savings of about 25–29% for VTM, while the averaged encoding and decoding run times of VTM relative to those of AV1 are around 300% and 270%, respectively.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.10","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47250980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
TGHop: an explainable, efficient, and lightweight method for texture generation TGHop:一个可解释的、高效的、轻量级的纹理生成方法
IF 3.2 Q1 Computer Science Pub Date : 2021-07-08 DOI: 10.1017/ATSIP.2021.15
Xuejing Lei, Ganning Zhao, Kaitai Zhang, C. J. Kuo
An explainable, efficient, and lightweight method for texture generation, called TGHop (an acronym of Texture Generation PixelHop), is proposed in this work. Although synthesis of visually pleasant texture can be achieved by deep neural networks, the associated models are large in size, difficult to explain in theory, and computationally expensive in training. In contrast, TGHop is small in its model size, mathematically transparent, efficient in training and inference, and able to generate high-quality texture. Given an exemplary texture, TGHop first crops many sample patches out of it to form a collection of sample patches called the source. Then, it analyzes pixel statistics of samples from the source and obtains a sequence of fine-to-coarse subspaces for these patches by using the PixelHop++ framework. To generate texture patches with TGHop, we begin with the coarsest subspace, which is called the core, and attempt to generate samples in each subspace by following the distribution of real samples. Finally, texture patches are stitched to form texture images of a large size. It is demonstrated by experimental results that TGHop can generate texture images of superior quality with a small model size and at a fast speed.
本文提出了一种可解释的、高效的、轻量级的纹理生成方法,称为TGHop (texture generation PixelHop的缩写)。虽然深度神经网络可以合成视觉上令人愉悦的纹理,但相关的模型规模大,难以在理论上解释,并且在训练中计算成本高。相比之下,TGHop模型尺寸小,数学透明,训练和推理效率高,能够生成高质量的纹理。给定一个示例纹理,TGHop首先从中裁剪出许多样本补丁,形成称为源的样本补丁集合。然后,利用PixelHop++框架对来自源的样本进行像素统计分析,得到这些patch的从细到粗的子空间序列。为了使用TGHop生成纹理补丁,我们从最粗糙的子空间(称为核心)开始,并尝试通过遵循真实样本的分布在每个子空间中生成样本。最后,拼接纹理块,形成大尺寸的纹理图像。实验结果表明,TGHop能够以较小的模型尺寸和较快的速度生成质量较好的纹理图像。
{"title":"TGHop: an explainable, efficient, and lightweight method for texture generation","authors":"Xuejing Lei, Ganning Zhao, Kaitai Zhang, C. J. Kuo","doi":"10.1017/ATSIP.2021.15","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.15","url":null,"abstract":"An explainable, efficient, and lightweight method for texture generation, called TGHop (an acronym of Texture Generation PixelHop), is proposed in this work. Although synthesis of visually pleasant texture can be achieved by deep neural networks, the associated models are large in size, difficult to explain in theory, and computationally expensive in training. In contrast, TGHop is small in its model size, mathematically transparent, efficient in training and inference, and able to generate high-quality texture. Given an exemplary texture, TGHop first crops many sample patches out of it to form a collection of sample patches called the source. Then, it analyzes pixel statistics of samples from the source and obtains a sequence of fine-to-coarse subspaces for these patches by using the PixelHop++ framework. To generate texture patches with TGHop, we begin with the coarsest subspace, which is called the core, and attempt to generate samples in each subspace by following the distribution of real samples. Finally, texture patches are stitched to form texture images of a large size. It is demonstrated by experimental results that TGHop can generate texture images of superior quality with a small model size and at a fast speed.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42046813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A protection method of trained CNN model with a secret key from unauthorized access 一种使用密钥保护训练好的CNN模型不受未经授权访问的方法
IF 3.2 Q1 Computer Science Pub Date : 2021-05-31 DOI: 10.1017/ATSIP.2021.9
AprilPyone Maungmaung, H. Kiya
In this paper, we propose a novel method for protecting convolutional neural network models with a secret key set so that unauthorized users without the correct key set cannot access trained models. The method enables us to protect not only from copyright infringement but also the functionality of a model from unauthorized access without any noticeable overhead. We introduce three block-wise transformations with a secret key set to generate learnable transformed images: pixel shuffling, negative/positive transformation, and format-preserving Feistel-based encryption. Protected models are trained by using transformed images. The results of experiments with the CIFAR and ImageNet datasets show that the performance of a protected model was close to that of non-protected models when the key set was correct, while the accuracy severely dropped when an incorrect key set was given. The protected model was also demonstrated to be robust against various attacks. Compared with the state-of-the-art model protection with passports, the proposed method does not have any additional layers in the network, and therefore, there is no overhead during training and inference processes.
本文提出了一种使用密钥集保护卷积神经网络模型的新方法,使未经授权的用户无法访问训练好的模型。该方法使我们不仅可以防止侵犯版权,还可以保护模型的功能免受未经授权的访问,而不会产生任何明显的开销。我们引入了三种具有秘密密钥集的块转换来生成可学习的转换图像:像素洗牌,负/正转换和基于格式保持的feistel加密。使用变换后的图像来训练受保护的模型。在CIFAR和ImageNet数据集上的实验结果表明,当键集正确时,保护模型的性能接近于非保护模型,而当键集不正确时,精度严重下降。受保护的模型也被证明对各种攻击具有鲁棒性。与最先进的带有护照的模型保护相比,该方法在网络中没有任何额外的层,因此在训练和推理过程中没有开销。
{"title":"A protection method of trained CNN model with a secret key from unauthorized access","authors":"AprilPyone Maungmaung, H. Kiya","doi":"10.1017/ATSIP.2021.9","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.9","url":null,"abstract":"In this paper, we propose a novel method for protecting convolutional neural network models with a secret key set so that unauthorized users without the correct key set cannot access trained models. The method enables us to protect not only from copyright infringement but also the functionality of a model from unauthorized access without any noticeable overhead. We introduce three block-wise transformations with a secret key set to generate learnable transformed images: pixel shuffling, negative/positive transformation, and format-preserving Feistel-based encryption. Protected models are trained by using transformed images. The results of experiments with the CIFAR and ImageNet datasets show that the performance of a protected model was close to that of non-protected models when the key set was correct, while the accuracy severely dropped when an incorrect key set was given. The protected model was also demonstrated to be robust against various attacks. Compared with the state-of-the-art model protection with passports, the proposed method does not have any additional layers in the network, and therefore, there is no overhead during training and inference processes.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48451361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
The future of biometrics technology: from face recognition to related applications 生物识别技术的未来:从人脸识别到相关应用
IF 3.2 Q1 Computer Science Pub Date : 2021-05-28 DOI: 10.1017/ATSIP.2021.8
Hitoshi Imaoka, H. Hashimoto, Koichi Takahashi, Akinori F. Ebihara, Jianquan Liu, Akihiro Hayasaka, Yusuke Morishita, K. Sakurai
Biometric recognition technologies have become more important in the modern society due to their convenience with the recent informatization and the dissemination of network services. Among such technologies, face recognition is one of the most convenient and practical because it enables authentication from a distance without requiring any authentication operations manually. As far as we know, face recognition is susceptible to the changes in the appearance of faces due to aging, the surrounding lighting, and posture. There were a number of technical challenges that need to be resolved. Recently, remarkable progress has been made thanks to the advent of deep learning methods. In this position paper, we provide an overview of face recognition technology and introduce its related applications, including face presentation attack detection, gaze estimation, person re-identification and image data mining. We also discuss the research challenges that still need to be addressed and resolved.
随着最近的信息化和网络服务的普及,生物识别技术因其方便性在现代社会中变得更加重要。在这些技术中,人脸识别是最方便和实用的技术之一,因为它可以在不需要手动进行任何身份验证操作的情况下进行远程身份验证。据我们所知,人脸识别容易受到衰老、周围光线和姿势导致的人脸外观变化的影响。有许多技术挑战需要解决。最近,由于深度学习方法的出现,已经取得了显著的进展。在本文中,我们对人脸识别技术进行了概述,并介绍了其相关应用,包括人脸呈现攻击检测、凝视估计、人脸重新识别和图像数据挖掘。我们还讨论了仍然需要解决的研究挑战。
{"title":"The future of biometrics technology: from face recognition to related applications","authors":"Hitoshi Imaoka, H. Hashimoto, Koichi Takahashi, Akinori F. Ebihara, Jianquan Liu, Akihiro Hayasaka, Yusuke Morishita, K. Sakurai","doi":"10.1017/ATSIP.2021.8","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.8","url":null,"abstract":"Biometric recognition technologies have become more important in the modern society due to their convenience with the recent informatization and the dissemination of network services. Among such technologies, face recognition is one of the most convenient and practical because it enables authentication from a distance without requiring any authentication operations manually. As far as we know, face recognition is susceptible to the changes in the appearance of faces due to aging, the surrounding lighting, and posture. There were a number of technical challenges that need to be resolved. Recently, remarkable progress has been made thanks to the advent of deep learning methods. In this position paper, we provide an overview of face recognition technology and introduce its related applications, including face presentation attack detection, gaze estimation, person re-identification and image data mining. We also discuss the research challenges that still need to be addressed and resolved.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44465370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Audio-to-score singing transcription based on a CRNN-HSMM hybrid model 基于CRNN-HSMM混合模型的音频到乐谱唱歌转录
IF 3.2 Q1 Computer Science Pub Date : 2021-04-20 DOI: 10.1017/ATSIP.2021.4
Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii
This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.
本文描述了一种自动歌唱转录(AST)方法,该方法根据输入音乐信号估计歌唱旋律的人类可读乐谱。由于歌声的音高和时间变化很大,估计F0轮廓并用估计的状态时间对其进行量化的天真级联方法无法避免许多音高和节奏误差。为了解决这个问题,我们建立了一个统一的音乐信号生成模型,该模型由表示以音乐键为条件的潜在音符生成过程的半马尔可夫语言模型和基于卷积递归神经网络(CRNN)的声学模型组成,该模型表示从音符中观察到的音乐信号的生成过程。由此产生的CRNN-HSMM混合模型使我们能够使用维特比算法从音乐信号中估计最可能的音符,同时利用关于音符的语法知识和CRNN的表达能力。实验结果表明,所提出的方法优于传统的最先进的方法,音乐语言模型与声学模型的集成对AST性能有积极影响。
{"title":"Audio-to-score singing transcription based on a CRNN-HSMM hybrid model","authors":"Ryo Nishikimi, Eita Nakamura, Masataka Goto, Kazuyoshi Yoshii","doi":"10.1017/ATSIP.2021.4","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.4","url":null,"abstract":"This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44040319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Speech emotion recognition based on listener-dependent emotion perception models 基于听者依赖情绪感知模型的语音情绪识别
IF 3.2 Q1 Computer Science Pub Date : 2021-04-20 DOI: 10.1017/ATSIP.2021.7
Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa, T. Toda
This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.
本文提出了一种利用情感感知个性的语音情感识别方案。大多数传统方法只是简单地对多个听众进行投票,并直接将多数人的决定建模为感知到的情绪。然而,情感感知随听者的不同而变化,这迫使传统方法的单一模型创建复杂的情感感知标准混合物。为了缓解这一问题,我们提出了一个多数投票的情绪识别框架,该框架构建了听众依赖(LD)情绪识别模型。LD模型不仅可以估计听众感知情绪,还可以通过平均多个LD模型的输出来估计多数决策。引入了微调、辅助输入和子层加权三种LD模型,它们都受到了各种语音处理任务中成功的领域自适应框架的启发。在两个情感语音数据集上的实验表明,该方法不仅在多数投票方面优于传统的情感识别框架,而且在听众感知情感识别方面优于传统的情感识别框架。
{"title":"Speech emotion recognition based on listener-dependent emotion perception models","authors":"Atsushi Ando, Takeshi Mori, Satoshi Kobashikawa, T. Toda","doi":"10.1017/ATSIP.2021.7","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.7","url":null,"abstract":"This paper presents a novel speech emotion recognition scheme that leverages the individuality of emotion perception. Most conventional methods simply poll multiple listeners and directly model the majority decision as the perceived emotion. However, emotion perception varies with the listener, which forces the conventional methods with their single models to create complex mixtures of emotion perception criteria. In order to mitigate this problem, we propose a majority-voted emotion recognition framework that constructs listener-dependent (LD) emotion recognition models. The LD model can estimate not only listener-wise perceived emotion, but also majority decision by averaging the outputs of the multiple LD models. Three LD models, fine-tuning, auxiliary input, and sub-layer weighting, are introduced, all of which are inspired by successful domain-adaptation frameworks in various speech processing tasks. Experiments on two emotional speech datasets demonstrate that the proposed approach outperforms the conventional emotion recognition frameworks in not only majority-voted but also listener-wise perceived emotion recognition.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57024191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automatic Deception Detection using Multiple Speech and Language Communicative Descriptors in Dialogs 对话中使用多个语音和语言交际描述符的自动欺骗检测
IF 3.2 Q1 Computer Science Pub Date : 2021-04-16 DOI: 10.1017/ATSIP.2021.6
Huang-Cheng Chou, Yi-Wen Liu, Chi-Chun Lee
While deceptive behaviors are a natural part of human life, it is well known that human is generally bad at detecting deception. In this study, we present an automatic deception detection framework by comprehensively integrating prior domain knowledge in deceptive behavior understanding. Specifically, we compute acoustics, textual information, implicatures with non-verbal behaviors, and conversational temporal dynamics for improving automatic deception detection in dialogs. The proposed model reaches start-of-the-art performance on the Daily Deceptive Dialogues corpus of Mandarin (DDDM) database, 80.61% unweighted accuracy recall in deception recognition. In the further analyses, we reveal that (i) the deceivers’ deception behaviors can be observed from the interrogators’ behaviors in the conversational temporal dynamics features and (ii) some of the acoustic features (e.g. loudness and MFCC) and textual features are significant and effective indicators to detect deception behaviors.
虽然欺骗行为是人类生活中很自然的一部分,但众所周知,人类通常不善于识破欺骗。在这项研究中,我们提出了一个自动欺骗检测框架,该框架综合了欺骗行为理解中的先验领域知识。具体来说,我们计算了声学、文本信息、非语言行为的含义和会话时间动态,以提高对话中的自动欺骗检测。该模型在普通话每日欺骗对话语料库(DDDM)数据库上达到了最先进的性能,在欺骗识别上的非加权查全率为80.61%。在进一步的分析中,我们发现:(1)从对话时间动态特征中可以观察到欺骗者的欺骗行为;(2)一些声学特征(如响度和MFCC)和文本特征是检测欺骗行为的重要而有效的指标。
{"title":"Automatic Deception Detection using Multiple Speech and Language Communicative Descriptors in Dialogs","authors":"Huang-Cheng Chou, Yi-Wen Liu, Chi-Chun Lee","doi":"10.1017/ATSIP.2021.6","DOIUrl":"https://doi.org/10.1017/ATSIP.2021.6","url":null,"abstract":"While deceptive behaviors are a natural part of human life, it is well known that human is generally bad at detecting deception. In this study, we present an automatic deception detection framework by comprehensively integrating prior domain knowledge in deceptive behavior understanding. Specifically, we compute acoustics, textual information, implicatures with non-verbal behaviors, and conversational temporal dynamics for improving automatic deception detection in dialogs. The proposed model reaches start-of-the-art performance on the Daily Deceptive Dialogues corpus of Mandarin (DDDM) database, 80.61% unweighted accuracy recall in deception recognition. In the further analyses, we reveal that (i) the deceivers’ deception behaviors can be observed from the interrogators’ behaviors in the conversational temporal dynamics features and (ii) some of the acoustic features (e.g. loudness and MFCC) and textual features are significant and effective indicators to detect deception behaviors.","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1017/ATSIP.2021.6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45889529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analyzing public opinion on COVID-19 through different perspectives and stages. 从不同角度和阶段分析有关 COVID-19 的舆论。
IF 3.2 Q1 Computer Science Pub Date : 2021-03-17 eCollection Date: 2021-01-01 DOI: 10.1017/ATSIP.2021.5
Yuqi Gao, Hang Hua, Jiebo Luo

In recent months, COVID-19 has become a global pandemic and had a huge impact on the world. People under different conditions have very different attitudes toward the epidemic. Due to the real-time and large-scale nature of social media, we can continuously obtain a massive amount of public opinion information related to the epidemic from social media. In particular, researchers may ask questions such as "how is the public reacting to COVID-19 in China during different stages of the pandemic?", "what factors affect the public opinion orientation in China?", and so on. To answer such questions, we analyze the pandemic-related public opinion information on Weibo, China's largest social media platform. Specifically, we have first collected a large amount of COVID-19-related public opinion microblogs. We then use a sentiment classifier to recognize and analyze different groups of users' opinions. In the collected sentiment-orientated microblogs, we try to track the public opinion through different stages of the COVID-19 pandemic. Furthermore, we analyze more key factors that might have an impact on the public opinion of COVID-19 (e.g. users in different provinces or users with different education levels). Empirical results show that the public opinions vary along with the key factors of COVID-19. Furthermore, we analyze the public attitudes on different public-concerning topics, such as staying at home and quarantine. In summary, we uncover interesting patterns of users and events as an insight into the world through the lens of a major crisis.

近几个月来,COVID-19 已成为全球性流行病,对世界产生了巨大影响。不同条件下的人们对这一流行病的态度大相径庭。由于社交媒体的实时性和大规模性,我们可以不断地从社交媒体中获取与疫情相关的海量舆情信息。特别是,研究人员可能会提出 "在疫情的不同阶段,中国公众对 COVID-19 的反应如何?"、"影响中国舆论导向的因素有哪些?"等问题。为了回答这些问题,我们分析了中国最大的社交媒体平台--微博上与疫情相关的舆情信息。具体来说,我们首先收集了大量与 COVID-19 相关的舆情微博。然后,我们使用情感分类器来识别和分析不同用户群体的观点。在收集到的以情感为导向的微博中,我们试图追踪 COVID-19 大流行不同阶段的舆情。此外,我们还分析了更多可能影响 COVID-19 舆论的关键因素(如不同省份的用户或不同教育程度的用户)。实证结果表明,公众意见随着 COVID-19 的关键因素而变化。此外,我们还分析了公众对不同公众关注话题的态度,如留在家中和隔离。总之,我们发现了用户和事件的有趣模式,通过重大危机的视角洞察世界。
{"title":"Analyzing public opinion on COVID-19 through different perspectives and stages.","authors":"Yuqi Gao, Hang Hua, Jiebo Luo","doi":"10.1017/ATSIP.2021.5","DOIUrl":"10.1017/ATSIP.2021.5","url":null,"abstract":"<p><p>In recent months, COVID-19 has become a global pandemic and had a huge impact on the world. People under different conditions have very different attitudes toward the epidemic. Due to the real-time and large-scale nature of social media, we can continuously obtain a massive amount of public opinion information related to the epidemic from social media. In particular, researchers may ask questions such as \"how is the public reacting to COVID-19 in China during different stages of the pandemic?\", \"what factors affect the public opinion orientation in China?\", and so on. To answer such questions, we analyze the pandemic-related public opinion information on Weibo, China's largest social media platform. Specifically, we have first collected a large amount of COVID-19-related public opinion microblogs. We then use a sentiment classifier to recognize and analyze different groups of users' opinions. In the collected sentiment-orientated microblogs, we try to track the public opinion through different stages of the COVID-19 pandemic. Furthermore, we analyze more key factors that might have an impact on the public opinion of COVID-19 (e.g. users in different provinces or users with different education levels). Empirical results show that the public opinions vary along with the key factors of COVID-19. Furthermore, we analyze the public attitudes on different public-concerning topics, such as staying at home and quarantine. In summary, we uncover interesting patterns of users and events as an insight into the world through the lens of a major crisis.</p>","PeriodicalId":44812,"journal":{"name":"APSIPA Transactions on Signal and Information Processing","volume":null,"pages":null},"PeriodicalIF":3.2,"publicationDate":"2021-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8082129/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39124603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
APSIPA Transactions on Signal and Information Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1