Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287162
Aaro Altonen, Joni Räsänen, Jaakko Laitinen, Marko Viitanen, Jarno Vanne
Efficient transport technologies for High Efficiency Video Coding (HEVC) are key enablers for economic 4K video transmission in current telecommunication networks. This paper introduces a novel open-source Real-time Transport Protocol (RTP) library called uvgRTP for high-speed 4K HEVC video streaming. Our library supports the latest RFC 3550 specification for RTP and an associated RFC 7798 RTP payload format for HEVC. It is written in C++ under a permissive 2-clause BSD license and it can be run on both Linux and Windows operating systems with a user-friendly interface. Our experiments on an Intel Core i7-4770 CPU show that uvgRTP is able to stream HEVC video at 5.0 Gb/s over a local 10 Gb/s network. It attains 4.4 times as high peak goodput and 92.1% lower latency than the state-of-the-art FFmpeg multimedia framework. It also outperforms LIVE555 with over double the goodput and 82.3% lower latency. These results indicate that uvgRTP is currently the fastest open-source RTP library for 4K HEVC video streaming.
{"title":"Open-Source RTP Library for High-Speed 4K HEVC Video Streaming","authors":"Aaro Altonen, Joni Räsänen, Jaakko Laitinen, Marko Viitanen, Jarno Vanne","doi":"10.1109/MMSP48831.2020.9287162","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287162","url":null,"abstract":"Efficient transport technologies for High Efficiency Video Coding (HEVC) are key enablers for economic 4K video transmission in current telecommunication networks. This paper introduces a novel open-source Real-time Transport Protocol (RTP) library called uvgRTP for high-speed 4K HEVC video streaming. Our library supports the latest RFC 3550 specification for RTP and an associated RFC 7798 RTP payload format for HEVC. It is written in C++ under a permissive 2-clause BSD license and it can be run on both Linux and Windows operating systems with a user-friendly interface. Our experiments on an Intel Core i7-4770 CPU show that uvgRTP is able to stream HEVC video at 5.0 Gb/s over a local 10 Gb/s network. It attains 4.4 times as high peak goodput and 92.1% lower latency than the state-of-the-art FFmpeg multimedia framework. It also outperforms LIVE555 with over double the goodput and 82.3% lower latency. These results indicate that uvgRTP is currently the fastest open-source RTP library for 4K HEVC video streaming.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124222832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287080
Saman Zadtootaghaj, Nabajeet Barman, Rakesh Rao Ramachandra Rao, Steve Göring, M. Martini, A. Raake, S. Möller
Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. In this paper, we present a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest is used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality. The light-weight, low complexity nature of the model makes it suitable for real-time applications considering both gaming and non-gaming content while achieving similar performance to existing state-of-the-art model NDNetGaming. The model implementation for testing is available on GitHub1.
{"title":"DEMI: Deep Video Quality Estimation Model using Perceptual Video Quality Dimensions","authors":"Saman Zadtootaghaj, Nabajeet Barman, Rakesh Rao Ramachandra Rao, Steve Göring, M. Martini, A. Raake, S. Möller","doi":"10.1109/MMSP48831.2020.9287080","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287080","url":null,"abstract":"Existing works in the field of quality assessment focus separately on gaming and non-gaming content. Along with the traditional modeling approaches, deep learning based approaches have been used to develop quality models, due to their high prediction accuracy. In this paper, we present a deep learning based quality estimation model considering both gaming and non-gaming videos. The model is developed in three phases. First, a convolutional neural network (CNN) is trained based on an objective metric which allows the CNN to learn video artifacts such as blurriness and blockiness. Next, the model is fine-tuned based on a small image quality dataset using blockiness and blurriness ratings. Finally, a Random Forest is used to pool frame-level predictions and temporal information of videos in order to predict the overall video quality. The light-weight, low complexity nature of the model makes it suitable for real-time applications considering both gaming and non-gaming content while achieving similar performance to existing state-of-the-art model NDNetGaming. The model implementation for testing is available on GitHub1.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114807380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287083
Scott Janus, J. Boyce, S. Bhatia, J. Tanner, Atul Divekar, Penne Lee
Multiplane Images (MPI) is a new approach for storing volumetric content. MPI represents a 3D scene within a view frustum with typically 32 planes of texture and transparency information per camera. MPI literature to date has been focused on still images but applying MPI to video will require substantial compression in order to be viable for real world productions. In this paper, we describe several techniques for compressing MPI video sequences by reducing pixel rate while maintaining acceptable visual quality. We focus on using traditional video compression codecs such as HEVC. While certainly a new codec algorithm specifically tailored to MPI would likely achieve very good results, no such devices exist today that support this hypothetical MPI codec. By comparison, hundreds of millions of real-time HEVC decoders are present in laptops and TVs today.
{"title":"Multi-Plane Image Video Compression","authors":"Scott Janus, J. Boyce, S. Bhatia, J. Tanner, Atul Divekar, Penne Lee","doi":"10.1109/MMSP48831.2020.9287083","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287083","url":null,"abstract":"Multiplane Images (MPI) is a new approach for storing volumetric content. MPI represents a 3D scene within a view frustum with typically 32 planes of texture and transparency information per camera. MPI literature to date has been focused on still images but applying MPI to video will require substantial compression in order to be viable for real world productions. In this paper, we describe several techniques for compressing MPI video sequences by reducing pixel rate while maintaining acceptable visual quality. We focus on using traditional video compression codecs such as HEVC. While certainly a new codec algorithm specifically tailored to MPI would likely achieve very good results, no such devices exist today that support this hypothetical MPI codec. By comparison, hundreds of millions of real-time HEVC decoders are present in laptops and TVs today.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115664149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287161
Giovanni Pepe, L. Gabrielli, S. Squartini, L. Cattani, Carlo Tripodi
A recent trend in car audio systems is the generation of Individual Listening Zones (ILZ), allowing to improve phone call privacy and reduce disturbance to other passengers, without wearing headphones or earpieces. This is generally achieved by using loudspeaker arrays. In this paper, we describe an approach to achieve ILZ exploiting general purpose car loudspeakers and processing the signal through carefully designed Finite Impulse Response (FIR) filters. We propose a deep neural network approach for the design of filters coefficients in order to obtain a so-called bright zone, where the signal is clearly heard, and a dark zone, where the signal is attenuated. Additionally, the frequency response in the bright zone is constrained to be as flat as possible. Numerical experiments were performed taking the impulse responses measured with either one binaural pair or three binaural pairs for each passenger. The results in terms of attenuation and flatness prove the viability of the approach.
{"title":"Deep Learning for Individual Listening Zone","authors":"Giovanni Pepe, L. Gabrielli, S. Squartini, L. Cattani, Carlo Tripodi","doi":"10.1109/MMSP48831.2020.9287161","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287161","url":null,"abstract":"A recent trend in car audio systems is the generation of Individual Listening Zones (ILZ), allowing to improve phone call privacy and reduce disturbance to other passengers, without wearing headphones or earpieces. This is generally achieved by using loudspeaker arrays. In this paper, we describe an approach to achieve ILZ exploiting general purpose car loudspeakers and processing the signal through carefully designed Finite Impulse Response (FIR) filters. We propose a deep neural network approach for the design of filters coefficients in order to obtain a so-called bright zone, where the signal is clearly heard, and a dark zone, where the signal is attenuated. Additionally, the frequency response in the bright zone is constrained to be as flat as possible. Numerical experiments were performed taking the impulse responses measured with either one binaural pair or three binaural pairs for each passenger. The results in terms of attenuation and flatness prove the viability of the approach.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127745710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287072
Zeyu Jiang, Xun Xu, Chao Zhang, Ce Zhu
Defocus blur detection is a challenging task because of obscure homogenous regions and interferences of background clutter. Most existing deep learning-based methods mainly focus on building wider or deeper network to capture multi-level features, neglecting to extract the feature relationships of intermediate layers, thus hindering the discriminative ability of network. Moreover, fusing features at different levels have been demonstrated to be effective. However, direct integrating without distinction is not optimal because low-level features focus on fine details only and could be distracted by background clutters. To address these issues, we propose the Multi-Attention Network for stronger discriminative learning and spatial guided low-level feature learning. Specifically, a channel-wise attention module is applied to both high-level and low-level feature maps to capture channel-wise global dependencies. In addition, a spatial attention module is employed to low-level features maps to emphasize effective detailed information. Experimental results show the performance of our network is superior to the state-of-the-art algorithms.
{"title":"MultiANet: a Multi-Attention Network for Defocus Blur Detection","authors":"Zeyu Jiang, Xun Xu, Chao Zhang, Ce Zhu","doi":"10.1109/MMSP48831.2020.9287072","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287072","url":null,"abstract":"Defocus blur detection is a challenging task because of obscure homogenous regions and interferences of background clutter. Most existing deep learning-based methods mainly focus on building wider or deeper network to capture multi-level features, neglecting to extract the feature relationships of intermediate layers, thus hindering the discriminative ability of network. Moreover, fusing features at different levels have been demonstrated to be effective. However, direct integrating without distinction is not optimal because low-level features focus on fine details only and could be distracted by background clutters. To address these issues, we propose the Multi-Attention Network for stronger discriminative learning and spatial guided low-level feature learning. Specifically, a channel-wise attention module is applied to both high-level and low-level feature maps to capture channel-wise global dependencies. In addition, a spatial attention module is employed to low-level features maps to emphasize effective detailed information. Experimental results show the performance of our network is superior to the state-of-the-art algorithms.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126465160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287165
Chao Cao, C. Tulvan, M. Preda, T. Zaharia
With the rapid development of point cloud acquisition technologies, high-quality human-shape point clouds are more and more used in VR/AR applications and in general in 3D Graphics. To achieve near-realistic quality, such content usually contains an extremely high number of points (over 0.5 million points per 3D object per frame) and associated attributes (such as color). For this reason, disposing of efficient, dedicated 3D Point Cloud Compression (3DPCC) methods becomes mandatory. This requirement is even stronger in the case of dynamic content, where the coordinates and attributes of the 3D points are evolving over time. In this paper, we propose a novel skeleton-based 3DPCC approach, dedicated to the specific case of dynamic point clouds representing humanoid avatars. The method relies on a multi-view 2D human pose estimation of 3D dynamic point clouds. By using the DensePose neural network, we first extract the body parts from projected 2D images. The obtained 2D segmentation information is back-projected and aggregated into the 3D space. This procedure makes it possible to partition the 3D point cloud into a set of 3D body parts. For each part, a 3D affine transform is estimated between every two consecutive frames and used for 3D motion compensation. The proposed approach has been integrated into the Video-based Point Cloud Compression (V-PCC) test model of MPEG. Experimental results show that the proposed method, in the particular case of body motion with small amplitudes, outperforms the V-PCC test mode in the lossy inter-coding condition by up to 83% in terms of bitrate reduction in low bit rate conditions. Meanwhile, the proposed framework holds the potential of supporting various features such as regions of interests and level of details.
{"title":"Skeleton-based motion estimation for Point Cloud Compression","authors":"Chao Cao, C. Tulvan, M. Preda, T. Zaharia","doi":"10.1109/MMSP48831.2020.9287165","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287165","url":null,"abstract":"With the rapid development of point cloud acquisition technologies, high-quality human-shape point clouds are more and more used in VR/AR applications and in general in 3D Graphics. To achieve near-realistic quality, such content usually contains an extremely high number of points (over 0.5 million points per 3D object per frame) and associated attributes (such as color). For this reason, disposing of efficient, dedicated 3D Point Cloud Compression (3DPCC) methods becomes mandatory. This requirement is even stronger in the case of dynamic content, where the coordinates and attributes of the 3D points are evolving over time. In this paper, we propose a novel skeleton-based 3DPCC approach, dedicated to the specific case of dynamic point clouds representing humanoid avatars. The method relies on a multi-view 2D human pose estimation of 3D dynamic point clouds. By using the DensePose neural network, we first extract the body parts from projected 2D images. The obtained 2D segmentation information is back-projected and aggregated into the 3D space. This procedure makes it possible to partition the 3D point cloud into a set of 3D body parts. For each part, a 3D affine transform is estimated between every two consecutive frames and used for 3D motion compensation. The proposed approach has been integrated into the Video-based Point Cloud Compression (V-PCC) test model of MPEG. Experimental results show that the proposed method, in the particular case of body motion with small amplitudes, outperforms the V-PCC test mode in the lossy inter-coding condition by up to 83% in terms of bitrate reduction in low bit rate conditions. Meanwhile, the proposed framework holds the potential of supporting various features such as regions of interests and level of details.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124308155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287164
Eduardo Martínez-Enríquez, J. Portilla
Feature adjustment, understood as the process aimed at modifying at will global features of given signals, has cardinal importance for several signal processing applications, such as enhancement, restoration, style transfer, and synthesis. Despite of this, it has not yet been approached from a general, theory-grounded, perspective. This work proposes a new conceptual and practical methodology that we term Controlled Feature Adjustment (CFA). CFA provides methods for, given a set of parametric global features (scalar functions of discrete signals), (1) constructing a related set of deterministically decoupled features, and (2) adjusting these new features in a controlled way, i.e., each one independently of the others. We illustrate the application of CFA by devising a spectrally-based hierarchically decoupled feature set and applying it to obtain different types of image synthesis that are not achievable using traditional (coupled) feature sets.
{"title":"Controlled Feature Adjustment for Image Processing and Synthesis","authors":"Eduardo Martínez-Enríquez, J. Portilla","doi":"10.1109/MMSP48831.2020.9287164","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287164","url":null,"abstract":"Feature adjustment, understood as the process aimed at modifying at will global features of given signals, has cardinal importance for several signal processing applications, such as enhancement, restoration, style transfer, and synthesis. Despite of this, it has not yet been approached from a general, theory-grounded, perspective. This work proposes a new conceptual and practical methodology that we term Controlled Feature Adjustment (CFA). CFA provides methods for, given a set of parametric global features (scalar functions of discrete signals), (1) constructing a related set of deterministically decoupled features, and (2) adjusting these new features in a controlled way, i.e., each one independently of the others. We illustrate the application of CFA by devising a spectrally-based hierarchically decoupled feature set and applying it to obtain different types of image synthesis that are not achievable using traditional (coupled) feature sets.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131420127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287127
N. Hailu, Ingo Siegert, A. Nürnberger
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.
{"title":"Improving Automatic Speech Recognition Utilizing Audio-codecs for Data Augmentation","authors":"N. Hailu, Ingo Siegert, A. Nürnberger","doi":"10.1109/MMSP48831.2020.9287127","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287127","url":null,"abstract":"To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method consists of using different audio codecs with changed bit rate, sampling rate, and bit depth. The change reassures variation in the input data without drastically affecting the audio quality. Besides, we can ensure that humans still perceive the audio, and any feature extraction is possible later. To demonstrate the general applicability of the proposed augmentation technique, we evaluated it in an end-to-end automatic speech recognition architecture in four languages. After applying the method, on the Amharic, Dutch, Slovenian, and Turkish datasets, we achieved a 1.57 average improvement in the character error rates (CER) without integrating language models. The result is comparable to the baseline result, showing CER improvement of 2.78, 1.25, 1.21, and 1.05 for each language. On the Amharic dataset, we reached a syllable error rate reduction of 6.12 compared to the baseline result.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287130
Wei‐Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, H. Hang
This paper presents a detailed description of NCTU’s proposal for learning-based image compression, in response to the JPEG AI Call for Evidence Challenge. The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.
{"title":"A Hybrid Layered Image Compressor with Deep-Learning Technique","authors":"Wei‐Cheng Lee, Chih-Peng Chang, Wen-Hsiao Peng, H. Hang","doi":"10.1109/MMSP48831.2020.9287130","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287130","url":null,"abstract":"This paper presents a detailed description of NCTU’s proposal for learning-based image compression, in response to the JPEG AI Call for Evidence Challenge. The proposed compression system features a VVC intra codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal. In particular, a base-layer-guided attention module is employed to focus the residual extraction on critical high-frequency areas. To reconstruct the image, this latent residual signal is combined with the base-layer output in a non-linear fashion by a neural-network-based synthesizer. The proposed method shows comparable rate-distortion performance to single-layer VVC intra in terms of common objective metrics, but presents better subjective quality particularly at high compression ratios in some cases. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, with the VVC base layer dominating the encoding runtime. On the contrary, the decoding is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132874692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287057
Esmaeil Faramarzi, R. Joshi, M. Budagavi
Dynamic point clouds and meshes are used in a wide variety of applications such as gaming, visualization, medicine, and more recently AR/VR/MR. This paper presents two extensions of MPEG-I Video-based Point Cloud Compression (V-PCC) standard to support mesh coding. The extensions are based on Edgebreaker and TFAN mesh connectivity coding algorithms implemented in the Google Draco software and the MPEG SC3DMC software for mesh coding, respectively. Lossless results for the proposed frameworks on top of version 8.0 of the MPEG-I V-PCC test model (TMC2) are presented and compared with Draco for dense meshes.
{"title":"Mesh Coding Extensions to MPEG-I V-PCC","authors":"Esmaeil Faramarzi, R. Joshi, M. Budagavi","doi":"10.1109/MMSP48831.2020.9287057","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287057","url":null,"abstract":"Dynamic point clouds and meshes are used in a wide variety of applications such as gaming, visualization, medicine, and more recently AR/VR/MR. This paper presents two extensions of MPEG-I Video-based Point Cloud Compression (V-PCC) standard to support mesh coding. The extensions are based on Edgebreaker and TFAN mesh connectivity coding algorithms implemented in the Google Draco software and the MPEG SC3DMC software for mesh coding, respectively. Lossless results for the proposed frameworks on top of version 8.0 of the MPEG-I V-PCC test model (TMC2) are presented and compared with Draco for dense meshes.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}