首页 > 最新文献

2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)最新文献

英文 中文
Open-Source RTP Library for High-Speed 4K HEVC Video Streaming 开源RTP库用于高速4K HEVC视频流
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287162
Aaro Altonen, Joni Räsänen, Jaakko Laitinen, Marko Viitanen, Jarno Vanne
Efficient transport technologies for High Efficiency Video Coding (HEVC) are key enablers for economic 4K video transmission in current telecommunication networks. This paper introduces a novel open-source Real-time Transport Protocol (RTP) library called uvgRTP for high-speed 4K HEVC video streaming. Our library supports the latest RFC 3550 specification for RTP and an associated RFC 7798 RTP payload format for HEVC. It is written in C++ under a permissive 2-clause BSD license and it can be run on both Linux and Windows operating systems with a user-friendly interface. Our experiments on an Intel Core i7-4770 CPU show that uvgRTP is able to stream HEVC video at 5.0 Gb/s over a local 10 Gb/s network. It attains 4.4 times as high peak goodput and 92.1% lower latency than the state-of-the-art FFmpeg multimedia framework. It also outperforms LIVE555 with over double the goodput and 82.3% lower latency. These results indicate that uvgRTP is currently the fastest open-source RTP library for 4K HEVC video streaming.
高效视频编码(HEVC)的高效传输技术是当前电信网络中4K视频经济传输的关键实现因素。本文介绍了一种新型的开源实时传输协议库uvgRTP,用于高速4K HEVC视频流。我们的库支持最新的RFC 3550 RTP规范和相关的RFC 7798 RTP有效载荷格式的HEVC。它是在允许的2条款BSD许可下用c++编写的,它可以在Linux和Windows操作系统上运行,具有用户友好的界面。我们在英特尔酷睿i7-4770 CPU上的实验表明,uvgRTP能够在本地10gb /s的网络上以5.0 Gb/s的速度流式传输HEVC视频。与最先进的FFmpeg多媒体框架相比,它实现了4.4倍的峰值带宽和92.1%的低延迟。它的性能也比LIVE555高出一倍多,延迟降低了82.3%。这些结果表明,uvgRTP是目前4K HEVC视频流最快的开源RTP库。
{"title":"Open-Source RTP Library for High-Speed 4K HEVC Video Streaming","authors":"Aaro Altonen, Joni Räsänen, Jaakko Laitinen, Marko Viitanen, Jarno Vanne","doi":"10.1109/MMSP48831.2020.9287162","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287162","url":null,"abstract":"Efficient transport technologies for High Efficiency Video Coding (HEVC) are key enablers for economic 4K video transmission in current telecommunication networks. This paper introduces a novel open-source Real-time Transport Protocol (RTP) library called uvgRTP for high-speed 4K HEVC video streaming. Our library supports the latest RFC 3550 specification for RTP and an associated RFC 7798 RTP payload format for HEVC. It is written in C++ under a permissive 2-clause BSD license and it can be run on both Linux and Windows operating systems with a user-friendly interface. Our experiments on an Intel Core i7-4770 CPU show that uvgRTP is able to stream HEVC video at 5.0 Gb/s over a local 10 Gb/s network. It attains 4.4 times as high peak goodput and 92.1% lower latency than the state-of-the-art FFmpeg multimedia framework. It also outperforms LIVE555 with over double the goodput and 82.3% lower latency. These results indicate that uvgRTP is currently the fastest open-source RTP library for 4K HEVC video streaming.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124222832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
DEMI: Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions 使用感知视频质量维度的深度视频质量估计模型
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287080
Saman Zadtootaghaj, Nabajeet Barman, Rakesh Rao Ramachandra Rao, Steve Göring, M. Martini, A. Raake, S. Möller
Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. In this paper, we present a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest is used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality. The light-weight, low complexity nature of the model makes it suitable for real-time applications considering both gaming and non-gaming content while achieving similar performance to existing state-of-the-art model NDNetGaming. The model implementation for testing is available on GitHub1.
在质量评估领域的现有工作分别侧重于游戏和非游戏内容。与传统的建模方法一样,基于深度学习的方法由于其较高的预测精度而被用于开发高质量的模型。在本文中,我们提出了一个基于深度学习的质量估计模型,同时考虑了游戏和非游戏视频。该模型的发展分为三个阶段。首先,基于客观度量训练卷积神经网络(CNN),该度量允许CNN学习视频伪像,如模糊和块。接下来,基于使用块度和模糊度评级的小图像质量数据集对模型进行微调。最后,利用随机森林对视频的帧级预测和时间信息进行汇总,以预测整体视频质量。该模型的轻量、低复杂性使其适合考虑游戏和非游戏内容的实时应用,同时实现与现有最先进模型NDNetGaming相似的性能。用于测试的模型实现可在GitHub1上获得。
{"title":"DEMI: Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions","authors":"Saman Zadtootaghaj, Nabajeet Barman, Rakesh Rao Ramachandra Rao, Steve Göring, M. Martini, A. Raake, S. Möller","doi":"10.1109/MMSP48831.2020.9287080","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287080","url":null,"abstract":"Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. In this paper, we present a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest is used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality. The light-weight, low complexity nature of the model makes it suitable for real-time applications considering both gaming and non-gaming content while achieving similar performance to existing state-of-the-art model NDNetGaming. The model implementation for testing is available on GitHub1.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114807380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Multi-Plane Image Video Compression 多平面图像视频压缩
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287083
Scott Janus, J. Boyce, S. Bhatia, J. Tanner, Atul Divekar, Penne Lee
Multiplane Images (MPI) is a new approach for storing volumetric content. MPI represents a 3D scene within a view frustum with typically 32 planes of texture and transparency information per camera. MPI literature to date has been focused on still images but applying MPI to video will require substantial compression in order to be viable for real world productions. In this paper, we describe several techniques for compressing MPI video sequences by reducing pixel rate while maintaining acceptable visual quality. We focus on using traditional video compression codecs such as HEVC. While certainly a new codec algorithm specifically tailored to MPI would likely achieve very good results, no such devices exist today that support this hypothetical MPI codec. By comparison, hundreds of millions of real-time HEVC decoders are present in laptops and TVs today.
多平面图像(MPI)是一种存储体积内容的新方法。MPI代表一个视域内的3D场景,每个相机通常有32个纹理平面和透明度信息。迄今为止,MPI文献主要集中在静态图像上,但将MPI应用于视频将需要大量压缩,以便在现实世界的制作中可行。在本文中,我们描述了几种通过降低像素率同时保持可接受的视觉质量来压缩MPI视频序列的技术。我们专注于使用传统的视频压缩编解码器,如HEVC。当然,专门为MPI量身定制的新编解码器算法可能会取得非常好的效果,但目前还没有这样的设备支持这种假设的MPI编解码器。相比之下,数以亿计的实时HEVC解码器目前存在于笔记本电脑和电视中。
{"title":"Multi-Plane Image Video Compression","authors":"Scott Janus, J. Boyce, S. Bhatia, J. Tanner, Atul Divekar, Penne Lee","doi":"10.1109/MMSP48831.2020.9287083","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287083","url":null,"abstract":"Multiplane Images (MPI) is a new approach for storing volumetric content. MPI represents a 3D scene within a view frustum with typically 32 planes of texture and transparency information per camera. MPI literature to date has been focused on still images but applying MPI to video will require substantial compression in order to be viable for real world productions. In this paper, we describe several techniques for compressing MPI video sequences by reducing pixel rate while maintaining acceptable visual quality. We focus on using traditional video compression codecs such as HEVC. While certainly a new codec algorithm specifically tailored to MPI would likely achieve very good results, no such devices exist today that support this hypothetical MPI codec. By comparison, hundreds of millions of real-time HEVC decoders are present in laptops and TVs today.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115664149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning for Individual Listening Zone 个人聆听区的深度学习
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287161
Giovanni Pepe, L. Gabrielli, S. Squartini, L. Cattani, Carlo Tripodi
A recent trend in car audio systems is the generation of Individual Listening Zones (ILZ), allowing to improve phone call privacy and reduce disturbance to other passengers, without wearing headphones or earpieces. This is generally achieved by using loudspeaker arrays. In this paper, we describe an approach to achieve ILZ exploiting general purpose car loudspeakers and processing the signal through carefully designed Finite Impulse Response (FIR) filters. We propose a deep neural network approach for the design of filters coefficients in order to obtain a so-called bright zone, where the signal is clearly heard, and a dark zone, where the signal is attenuated. Additionally, the frequency response in the bright zone is constrained to be as flat as possible. Numerical experiments were performed taking the impulse responses measured with either one binaural pair or three binaural pairs for each passenger. The results in terms of attenuation and flatness prove the viability of the approach.
汽车音响系统最近的一个趋势是个人听音区(ILZ)的产生,可以提高通话隐私,减少对其他乘客的干扰,而无需戴耳机或耳塞。这通常是通过使用扬声器阵列来实现的。在本文中,我们描述了一种利用通用汽车扬声器实现ILZ的方法,并通过精心设计的有限脉冲响应(FIR)滤波器处理信号。我们提出了一种深度神经网络方法来设计滤波器系数,以获得一个所谓的亮区,其中信号被清晰地听到,和一个暗区,其中信号被衰减。此外,在明亮区域的频率响应被限制为尽可能平坦。对每位乘客分别使用一对或三对双耳进行脉冲响应测量,并进行了数值实验。在衰减和平整度方面的结果证明了该方法的可行性。
{"title":"Deep Learning for Individual Listening Zone","authors":"Giovanni Pepe, L. Gabrielli, S. Squartini, L. Cattani, Carlo Tripodi","doi":"10.1109/MMSP48831.2020.9287161","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287161","url":null,"abstract":"A recent trend in car audio systems is the generation of Individual Listening Zones (ILZ), allowing to improve phone call privacy and reduce disturbance to other passengers, without wearing headphones or earpieces. This is generally achieved by using loudspeaker arrays. In this paper, we describe an approach to achieve ILZ exploiting general purpose car loudspeakers and processing the signal through carefully designed Finite Impulse Response (FIR) filters. We propose a deep neural network approach for the design of filters coefficients in order to obtain a so-called bright zone, where the signal is clearly heard, and a dark zone, where the signal is attenuated. Additionally, the frequency response in the bright zone is constrained to be as flat as possible. Numerical experiments were performed taking the impulse responses measured with either one binaural pair or three binaural pairs for each passenger. The results in terms of attenuation and flatness prove the viability of the approach.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127745710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MultiANet: a Multi-Attention Network for Defocus Blur Detection MultiANet:用于离焦模糊检测的多注意力网络
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287072
Zeyu Jiang, Xun Xu, Chao Zhang, Ce Zhu
Defocus blur detection is a challenging task because of obscure homogenous regions and interferences of background clutter. Most existing deep learning-based methods mainly focus on building wider or deeper network to capture multi-level features, neglecting to extract the feature relationships of intermediate layers, thus hindering the discriminative ability of network. Moreover, fusing features at different levels have been demonstrated to be effective. However, direct integrating without distinction is not optimal because low-level features focus on fine details only and could be distracted by background clutters. To address these issues, we propose the Multi-Attention Network for stronger discriminative learning and spatial guided low-level feature learning. Specifically, a channel-wise attention module is applied to both high-level and low-level feature maps to capture channel-wise global dependencies. In addition, a spatial attention module is employed to low-level features maps to emphasize effective detailed information. Experimental results show the performance of our network is superior to the state-of-the-art algorithms.
散焦模糊检测是一项具有挑战性的任务,主要是由于模糊的均匀区域和背景杂波的干扰。现有的基于深度学习的方法大多侧重于构建更宽或更深的网络来捕获多层次特征,忽略了提取中间层的特征关系,从而阻碍了网络的判别能力。此外,不同层次的特征融合已被证明是有效的。然而,不加区分的直接集成并不是最优的,因为低级特征只关注细节,可能会被背景杂乱分散注意力。为了解决这些问题,我们提出了多注意网络来进行更强的判别学习和空间引导的低级特征学习。具体来说,一个channel-wise attention模块被应用于高级和低级特征映射,以捕获channel-wise全局依赖关系。此外,对底层地物图采用空间注意模块,强调有效的细节信息。实验结果表明,该网络的性能优于目前最先进的算法。
{"title":"MultiANet: a Multi-Attention Network for Defocus Blur Detection","authors":"Zeyu Jiang, Xun Xu, Chao Zhang, Ce Zhu","doi":"10.1109/MMSP48831.2020.9287072","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287072","url":null,"abstract":"Defocus blur detection is a challenging task because of obscure homogenous regions and interferences of background clutter. Most existing deep learning-based methods mainly focus on building wider or deeper network to capture multi-level features, neglecting to extract the feature relationships of intermediate layers, thus hindering the discriminative ability of network. Moreover, fusing features at different levels have been demonstrated to be effective. However, direct integrating without distinction is not optimal because low-level features focus on fine details only and could be distracted by background clutters. To address these issues, we propose the Multi-Attention Network for stronger discriminative learning and spatial guided low-level feature learning. Specifically, a channel-wise attention module is applied to both high-level and low-level feature maps to capture channel-wise global dependencies. In addition, a spatial attention module is employed to low-level features maps to emphasize effective detailed information. Experimental results show the performance of our network is superior to the state-of-the-art algorithms.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Skeleton-based motion estimation for Point Cloud Compression 基于骨架的点云压缩运动估计
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287165
Chao Cao, C. Tulvan, M. Preda, T. Zaharia
With the rapid development of point cloud acquisition technologies, high-quality human-shape point clouds are more and more used in VR/AR applications and in general in 3D Graphics. To achieve near-realistic quality, such content usually contains an extremely high number of points (over 0.5 million points per 3D object per frame) and associated attributes (such as color). For this reason, disposing of efficient, dedicated 3D Point Cloud Compression (3DPCC) methods becomes mandatory. This requirement is even stronger in the case of dynamic content, where the coordinates and attributes of the 3D points are evolving over time. In this paper, we propose a novel skeleton-based 3DPCC approach, dedicated to the specific case of dynamic point clouds representing humanoid avatars. The method relies on a multi-view 2D human pose estimation of 3D dynamic point clouds. By using the DensePose neural network, we first extract the body parts from projected 2D images. The obtained 2D segmentation information is back-projected and aggregated into the 3D space. This procedure makes it possible to partition the 3D point cloud into a set of 3D body parts. For each part, a 3D affine transform is estimated between every two consecutive frames and used for 3D motion compensation. The proposed approach has been integrated into the Video-based Point Cloud Compression (V-PCC) test model of MPEG. Experimental results show that the proposed method, in the particular case of body motion with small amplitudes, outperforms the V-PCC test mode in the lossy inter-coding condition by up to 83% in terms of bitrate reduction in low bit rate conditions. Meanwhile, the proposed framework holds the potential of supporting various features such as regions of interests and level of details.
随着点云获取技术的快速发展,高质量的人形点云越来越多地应用于VR/AR应用以及三维图形领域。为了达到接近真实的质量,这样的内容通常包含非常多的点(每帧3D对象超过50万个点)和相关属性(如颜色)。出于这个原因,处理有效的、专用的3D点云压缩(3DPCC)方法变得势在必行。在动态内容的情况下,这种要求甚至更强,其中3D点的坐标和属性随着时间的推移而变化。在本文中,我们提出了一种新的基于骨架的3DPCC方法,专门针对代表人形化身的动态点云的具体情况。该方法依赖于三维动态点云的多视图二维人体姿态估计。首先利用DensePose神经网络从投影的二维图像中提取人体部位。得到的二维分割信息被反向投影并聚合到三维空间中。这个过程使得将3D点云划分为一组3D身体部分成为可能。对于每个部分,在每两个连续帧之间估计一个三维仿射变换,并用于三维运动补偿。该方法已集成到基于视频的MPEG点云压缩(V-PCC)测试模型中。实验结果表明,在小幅度身体运动的特殊情况下,该方法在低比特率条件下的比特率降低率高达83%,优于有损互编码条件下的V-PCC测试模式。同时,所提出的框架具有支持各种特征的潜力,如兴趣区域和细节水平。
{"title":"Skeleton-based motion estimation for Point Cloud Compression","authors":"Chao Cao, C. Tulvan, M. Preda, T. Zaharia","doi":"10.1109/MMSP48831.2020.9287165","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287165","url":null,"abstract":"With the rapid development of point cloud acquisition technologies, high-quality human-shape point clouds are more and more used in VR/AR applications and in general in 3D Graphics. To achieve near-realistic quality, such content usually contains an extremely high number of points (over 0.5 million points per 3D object per frame) and associated attributes (such as color). For this reason, disposing of efficient, dedicated 3D Point Cloud Compression (3DPCC) methods becomes mandatory. This requirement is even stronger in the case of dynamic content, where the coordinates and attributes of the 3D points are evolving over time. In this paper, we propose a novel skeleton-based 3DPCC approach, dedicated to the specific case of dynamic point clouds representing humanoid avatars. The method relies on a multi-view 2D human pose estimation of 3D dynamic point clouds. By using the DensePose neural network, we first extract the body parts from projected 2D images. The obtained 2D segmentation information is back-projected and aggregated into the 3D space. This procedure makes it possible to partition the 3D point cloud into a set of 3D body parts. For each part, a 3D affine transform is estimated between every two consecutive frames and used for 3D motion compensation. The proposed approach has been integrated into the Video-based Point Cloud Compression (V-PCC) test model of MPEG. Experimental results show that the proposed method, in the particular case of body motion with small amplitudes, outperforms the V-PCC test mode in the lossy inter-coding condition by up to 83% in terms of bitrate reduction in low bit rate conditions. Meanwhile, the proposed framework holds the potential of supporting various features such as regions of interests and level of details.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124308155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Controlled Feature Adjustment for Image Processing and Synthesis 图像处理与合成中的受控特征调整
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287164
Eduardo Martínez-Enríquez, J. Portilla
Feature adjustment, understood as the process aimed at modifying at will global features of given signals, has cardinal importance for several signal processing applications, such as enhancement, restoration, style transfer, and synthesis. Despite of this, it has not yet been approached from a general, theory-grounded, perspective. This work proposes a new conceptual and practical methodology that we term Controlled Feature Adjustment (CFA). CFA provides methods for, given a set of parametric global features (scalar functions of discrete signals), (1) constructing a related set of deterministically decoupled features, and (2) adjusting these new features in a controlled way, i.e., each one independently of the others. We illustrate the application of CFA by devising a spectrally-based hierarchically decoupled feature set and applying it to obtain different types of image synthesis that are not achievable using traditional (coupled) feature sets.
特征调整,被理解为旨在随意修改给定信号的全局特征的过程,对于几个信号处理应用具有至关重要的意义,例如增强,恢复,风格转移和合成。尽管如此,它还没有从一个一般的、理论基础的角度来研究。这项工作提出了一种新的概念和实用的方法,我们称之为受控特征调整(CFA)。CFA提供的方法是,给定一组参数全局特征(离散信号的标量函数),(1)构造一组相关的确定性解耦特征,(2)以可控的方式调整这些新特征,即每个特征独立于其他特征。我们通过设计一个基于光谱的分层解耦特征集来说明CFA的应用,并应用它来获得使用传统(耦合)特征集无法实现的不同类型的图像合成。
{"title":"Controlled Feature Adjustment for Image Processing and Synthesis","authors":"Eduardo Martínez-Enríquez, J. Portilla","doi":"10.1109/MMSP48831.2020.9287164","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287164","url":null,"abstract":"Feature adjustment, understood as the process aimed at modifying at will global features of given signals, has cardinal importance for several signal processing applications, such as enhancement, restoration, style transfer, and synthesis. Despite of this, it has not yet been approached from a general, theory-grounded, perspective. This work proposes a new conceptual and practical methodology that we term Controlled Feature Adjustment (CFA). CFA provides methods for, given a set of parametric global features (scalar functions of discrete signals), (1) constructing a related set of deterministically decoupled features, and (2) adjusting these new features in a controlled way, i.e., each one independently of the others. We illustrate the application of CFA by devising a spectrally-based hierarchically decoupled feature set and applying it to obtain different types of image synthesis that are not achievable using traditional (coupled) feature sets.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131420127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation 利用音频编解码器改进自动语音识别以增强数据
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287127
N. Hailu, Ingo Siegert, A. Nürnberger
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.
为了训练端到端自动语音识别模型,需要大量的标记语音数据。这个目标对于资源较少的语言来说是一个挑战。与常用的特征级数据增强方法不同,我们提出在数据级使用不同的音频编解码器来扩展训练集。增强方法包括使用不同的音频编解码器,改变比特率、采样率和比特深度。这种变化保证了输入数据的变化,而不会严重影响音频质量。此外,我们可以确保人类仍然感知到音频,并且以后可以提取任何特征。为了证明所提出的增强技术的一般适用性,我们在四种语言的端到端自动语音识别体系结构中对其进行了评估。应用该方法后,在阿姆哈拉语、荷兰语、斯洛文尼亚语和土耳其语数据集上,我们在不集成语言模型的情况下实现了字符错误率(CER)的平均提高1.57。结果与基线结果相当,每种语言的CER改进分别为2.78、1.25、1.21和1.05。在阿姆哈拉语数据集上,与基线结果相比,我们的音节错误率降低了6.12。
{"title":"Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation","authors":"N. Hailu, Ingo Siegert, A. Nürnberger","doi":"10.1109/MMSP48831.2020.9287127","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287127","url":null,"abstract":"To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Hybrid Layered Image Compressor with Deep-Learning Technique 基于深度学习技术的混合分层图像压缩器
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287130
Wei‐Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, H. Hang
This paper presents a detailed description of NCTU’s proposal for learning-based image compression, in response to the JPEG AI Call for Evidence Challenge. The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.
本文详细描述了NCTU的基于学习的图像压缩提案,以响应JPEG人工智能呼吁证据挑战。提出的压缩系统以VVC内部编解码器为基础层,以基于学习的残差编解码器为增强层。后者旨在通过发送潜在残余信号来改善基础层的质量。特别地,采用基层引导注意模块将残差提取集中在关键高频区域。为了重建图像,通过基于神经网络的合成器以非线性方式将该潜在残余信号与基础层输出组合。所提出的方法在常见的客观指标方面显示出与单层VVC内部相当的率失真性能,但在某些情况下,特别是在高压缩比下,表现出更好的主观质量。它始终优于HEVC intra、JPEG 2000和JPEG。该系统采用16位浮点格式的18M网络参数。平均而言,在Intel Xeon Gold 6154上对图像进行编码大约需要13.5分钟,其中VVC基础层主导了编码运行时。相反,解码由残差解码器和合成器主导,每张图像需要31秒。
{"title":"A Hybrid Layered Image Compressor with Deep-Learning Technique","authors":"Wei‐Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, H. Hang","doi":"10.1109/MMSP48831.2020.9287130","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287130","url":null,"abstract":"This paper presents a detailed description of NCTU’s proposal for learning-based image compression, in response to the JPEG AI Call for Evidence Challenge. The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132874692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Mesh Coding Extensions to MPEG-I V-PCC 网格编码扩展到MPEG-I V-PCC
Pub Date : 2020-09-21 DOI: 10.1109/MMSP48831.2020.9287057
Esmaeil Faramarzi, R. Joshi, M. Budagavi
Dynamic point clouds and meshes are used in a wide variety of applications such as gaming, visualization, medicine, and more recently AR/VR/MR. This paper presents two extensions of MPEG-I Video-based Point Cloud Compression (V-PCC) standard to support mesh coding. The extensions are based on Edgebreaker and TFAN mesh connectivity coding algorithms implemented in the Google Draco software and the MPEG SC3DMC software for mesh coding, respectively. Lossless results for the proposed frameworks on top of version 8.0 of the MPEG-I V-PCC test model (TMC2) are presented and compared with Draco for dense meshes.
动态点云和网格被广泛应用于游戏、可视化、医学以及最近的AR/VR/MR等应用中。本文提出了MPEG-I基于视频的点云压缩(V-PCC)标准的两个扩展,以支持网格编码。这些扩展分别基于Edgebreaker和TFAN网格连接编码算法,这些算法分别在Google Draco软件和MPEG SC3DMC软件中实现。给出了基于MPEG-I V-PCC测试模型(TMC2) 8.0的框架的无损结果,并与Draco进行了密集网格的比较。
{"title":"Mesh Coding Extensions to MPEG-I V-PCC","authors":"Esmaeil Faramarzi, R. Joshi, M. Budagavi","doi":"10.1109/MMSP48831.2020.9287057","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287057","url":null,"abstract":"Dynamic point clouds and meshes are used in a wide variety of applications such as gaming, visualization, medicine, and more recently AR/VR/MR. This paper presents two extensions of MPEG-I Video-based Point Cloud Compression (V-PCC) standard to support mesh coding. The extensions are based on Edgebreaker and TFAN mesh connectivity coding algorithms implemented in the Google Draco software and the MPEG SC3DMC software for mesh coding, respectively. Lossless results for the proposed frameworks on top of version 8.0 of the MPEG-I V-PCC test model (TMC2) are presented and compared with Draco for dense meshes.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1