Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287092
D. Graziosi, A. Tabatabai, Vladyslav Zakharchenko, A. Zaghetto
For a V-PCC1 system to be able to reconstruct a single instance of the point cloud one V-PCC unit must be transferred to the 3D point cloud reconstruction module. It is however required that all the V-PCC components i.e. occupancy map, geometry, atlas and attribute to be temporally aligned. This, in principle, could pose a challenge since the temporal structures of the decoded sub-bitstreams are not coherent across V-PCC sub-bitstreams. In this paper we propose an output delay adjustment mechanism for the decoded V-PCC sub-bitstreams to provide synchronized V-PCC components input to the point cloud reconstruction module.
{"title":"V-PCC Component Synchronization for Point Cloud Reconstruction","authors":"D. Graziosi, A. Tabatabai, Vladyslav Zakharchenko, A. Zaghetto","doi":"10.1109/MMSP48831.2020.9287092","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287092","url":null,"abstract":"For a V-PCC1 system to be able to reconstruct a single instance of the point cloud one V-PCC unit must be transferred to the 3D point cloud reconstruction module. It is however required that all the V-PCC components i.e. occupancy map, geometry, atlas and attribute to be temporally aligned. This, in principle, could pose a challenge since the temporal structures of the decoded sub-bitstreams are not coherent across V-PCC sub-bitstreams. In this paper we propose an output delay adjustment mechanism for the decoded V-PCC sub-bitstreams to provide synchronized V-PCC components input to the point cloud reconstruction module.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126168165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287093
J. Brandenburg, A. Wieckowski, Tobias Hinz, Anastasia Henkel, Valeri George, Ivan Zupancic, C. Stoffers, B. Bross, H. Schwarz, D. Marpe
Versatile Video Coding (VVC) is a new international video coding standard to be finalized in July 2020. It is designed to provide around 50% bit-rate saving at the same subjective visual quality over its predecessor, High Efficiency Video Coding (H.265/HEVC). During the standard development, objective bit-rate savings of around 40% have been reported for the VVC reference software (VTM) compared to the HEVC reference software (HM). The unoptimized VTM encoder is around 9x, and the decoder around 2x, slower than HM. This paper discusses the VVC encoder complexity in terms of soft-ware runtime. The modular design of the standard allows a VVC encoder to trade off bit-rate savings and encoder runtime. Based on a detailed tradeoff analysis, results for different operating points are reported. Additionally, initial work on software and algorithm optimization is presented. With the optimized software algorithms, an operating point with an over 22x faster single-threaded encoder runtime than VTM can be achieved, i.e. around 2.5x faster than HM, while still providing more than 30% bit-rate savings over HM. Finally, our experiments demonstrate the flexibility of VVC and its potential for optimized soft-ware encoder implementations.
{"title":"Towards Fast and Efficient VVC Encoding","authors":"J. Brandenburg, A. Wieckowski, Tobias Hinz, Anastasia Henkel, Valeri George, Ivan Zupancic, C. Stoffers, B. Bross, H. Schwarz, D. Marpe","doi":"10.1109/MMSP48831.2020.9287093","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287093","url":null,"abstract":"Versatile Video Coding (VVC) is a new international video coding standard to be finalized in July 2020. It is designed to provide around 50% bit-rate saving at the same subjective visual quality over its predecessor, High Efficiency Video Coding (H.265/HEVC). During the standard development, objective bit-rate savings of around 40% have been reported for the VVC reference software (VTM) compared to the HEVC reference software (HM). The unoptimized VTM encoder is around 9x, and the decoder around 2x, slower than HM. This paper discusses the VVC encoder complexity in terms of soft-ware runtime. The modular design of the standard allows a VVC encoder to trade off bit-rate savings and encoder runtime. Based on a detailed tradeoff analysis, results for different operating points are reported. Additionally, initial work on software and algorithm optimization is presented. With the optimized software algorithms, an operating point with an over 22x faster single-threaded encoder runtime than VTM can be achieved, i.e. around 2.5x faster than HM, while still providing more than 30% bit-rate savings over HM. Finally, our experiments demonstrate the flexibility of VVC and its potential for optimized soft-ware encoder implementations.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124484794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287167
R. G. Youvalari, J. Lainema
The Cross-Component Linear Model (CCLM) is an intra prediction technique that is adopted into the upcoming Versatile Video Coding (VVC) standard. CCLM attempts to reduce the inter-channel correlation by using a linear model. For that, the parameters of the model are calculated based on the reconstructed samples in luma channel as well as neighboring samples of the chroma coding block. In this paper, we propose a new method, called as Joint Cross-Component Linear Model (J-CCLM), in order to improve the prediction efficiency of the tool. The proposed J-CCLM technique predicts the samples of the coding block with a multi-hypothesis approach which consists of combining two intra prediction modes. To that end, the final prediction of the block is achieved by combining the conventional CCLM mode with an angular mode that is derived from the co-located luma block. The conducted experiments in VTM-8.0 test model of VVC illustrated that the proposed method provides on average more than 1.0% BD-Rate gain in chroma channels. Furthermore, the weighted YCbCr bitrate savings of 0.24% and 0.54% are achieved in 4:2:0 and 4:4:4 color formats, respectively.
跨分量线性模型(Cross-Component Linear Model, CCLM)是一种用于即将到来的通用视频编码(VVC)标准的帧内预测技术。CCLM试图通过使用线性模型来降低信道间的相关性。为此,基于亮度通道重构样本和色度编码块相邻样本计算模型参数。为了提高刀具的预测效率,本文提出了一种新的方法——联合交叉分量线性模型(J-CCLM)。提出的J-CCLM技术采用多假设方法对编码块样本进行预测,该方法结合了两种内预测模式。为此,通过将传统的CCLM模式与从共定位光斑块中导出的角模式相结合来实现块的最终预测。在VVC的VTM-8.0测试模型上进行的实验表明,该方法在色度通道上的BD-Rate平均增益大于1.0%。此外,在4:2:0和4:4:4颜色格式下,加权YCbCr比特率分别节省0.24%和0.54%。
{"title":"Joint Cross-Component Linear Model For Chroma Intra Prediction","authors":"R. G. Youvalari, J. Lainema","doi":"10.1109/MMSP48831.2020.9287167","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287167","url":null,"abstract":"The Cross-Component Linear Model (CCLM) is an intra prediction technique that is adopted into the upcoming Versatile Video Coding (VVC) standard. CCLM attempts to reduce the inter-channel correlation by using a linear model. For that, the parameters of the model are calculated based on the reconstructed samples in luma channel as well as neighboring samples of the chroma coding block. In this paper, we propose a new method, called as Joint Cross-Component Linear Model (J-CCLM), in order to improve the prediction efficiency of the tool. The proposed J-CCLM technique predicts the samples of the coding block with a multi-hypothesis approach which consists of combining two intra prediction modes. To that end, the final prediction of the block is achieved by combining the conventional CCLM mode with an angular mode that is derived from the co-located luma block. The conducted experiments in VTM-8.0 test model of VVC illustrated that the proposed method provides on average more than 1.0% BD-Rate gain in chroma channels. Furthermore, the weighted YCbCr bitrate savings of 0.24% and 0.54% are achieved in 4:2:0 and 4:4:4 color formats, respectively.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116603788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287066
U. A. Alma, M. Altinsoy
In this study, suitability of recorded and simplified texture vibrations are evaluated according to visual textures displayed on a screen. The tested vibrations are 1) recorded vibration, 2) single sinusoids, and 3) band-limited white noise which were used in the previous work. In the former study, suitability of texture vibrations were evaluated according to real textures by touching. Nevertheless, texture vibrations should be also tested based on texture images considering the fact that users interact with only virtual (visual) objects on touch devices. Thus, the aim of this study is to assess the congruence between the vibrotactile feedback and the texture images with the absence and the presence of auditory feedback. Two types of auditory feedback were used for the trimodal test, and they were tested in different loudness levels. Therefore, the most plausible combination of vibrotactile and audio stimuli when exploring the visual textures can be determined. Based on the psychophysical tests, the similarity ratings of the texture vibrations were not concluded significantly different from each other in bimodal condition as opposed to the former study. In the trimodal judgments, synthesized sound influenced the similarity ratings significantly while touch sound did not affect the perceived similarity.
{"title":"The Suitability of Texture Vibrations Based on Visually Perceived Virtual Textures in Bimodal and Trimodal Conditions","authors":"U. A. Alma, M. Altinsoy","doi":"10.1109/MMSP48831.2020.9287066","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287066","url":null,"abstract":"In this study, suitability of recorded and simplified texture vibrations are evaluated according to visual textures displayed on a screen. The tested vibrations are 1) recorded vibration, 2) single sinusoids, and 3) band-limited white noise which were used in the previous work. In the former study, suitability of texture vibrations were evaluated according to real textures by touching. Nevertheless, texture vibrations should be also tested based on texture images considering the fact that users interact with only virtual (visual) objects on touch devices. Thus, the aim of this study is to assess the congruence between the vibrotactile feedback and the texture images with the absence and the presence of auditory feedback. Two types of auditory feedback were used for the trimodal test, and they were tested in different loudness levels. Therefore, the most plausible combination of vibrotactile and audio stimuli when exploring the visual textures can be determined. Based on the psychophysical tests, the similarity ratings of the texture vibrations were not concluded significantly different from each other in bimodal condition as opposed to the former study. In the trimodal judgments, synthesized sound influenced the similarity ratings significantly while touch sound did not affect the perceived similarity.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114308865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287145
Yue Li, R. Mathew, D. Taubman
A highly scalable and compact representation of depth data is required in many applications, and it is especially critical for plenoptic multiview image compression frameworks that use depth information for novel view synthesis and interview prediction. Efficiently coding depth data can be difficult as it contains sharp discontinuities. Breakpoint-adaptive discrete wavelet transforms (BPA-DWT) currently being standardized as part of JPEG 2000 Part-17 extensions have been found suitable for coding spatial media with hard discontinuities. In this paper, we explore a modification to the original BPA-DWT by replacing the traditional constant extrapolation strategy with the newly proposed affine extrapolation for reconstructing depth data in the vicinity of discontinuities. We also present a depth reconstruction scheme that can directly decode the BPA-DWT coefficients and breakpoints onto a compact and scalable mesh-based representation which has many potential benefits over the sample-based description. For performing depth compensated view prediction, our proposed triangular mesh representation of the depth data is a natural fit for modern graphics architectures.
在许多应用中需要高度可扩展和紧凑的深度数据表示,这对于使用深度信息进行新视图合成和访谈预测的全光学多视图图像压缩框架尤为重要。有效地编码深度数据可能是困难的,因为它包含明显的不连续。断点自适应离散小波变换(BPA-DWT)目前作为JPEG 2000 part -17扩展的一部分被标准化,已经发现适合编码具有硬不连续的空间媒体。在本文中,我们探索了一种对原始bp - dwt的改进,用新提出的仿射外推法取代传统的常数外推策略,用于重建不连续区域附近的深度数据。我们还提出了一种深度重建方案,该方案可以直接将BPA-DWT系数和断点解码为紧凑且可扩展的基于网格的表示,与基于样本的描述相比,它具有许多潜在的优点。对于执行深度补偿视图预测,我们提出的深度数据的三角形网格表示非常适合现代图形架构。
{"title":"Scalable Mesh Representation for Depth from Breakpoint-Adaptive Wavelet Coding","authors":"Yue Li, R. Mathew, D. Taubman","doi":"10.1109/MMSP48831.2020.9287145","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287145","url":null,"abstract":"A highly scalable and compact representation of depth data is required in many applications, and it is especially critical for plenoptic multiview image compression frameworks that use depth information for novel view synthesis and interview prediction. Efficiently coding depth data can be difficult as it contains sharp discontinuities. Breakpoint-adaptive discrete wavelet transforms (BPA-DWT) currently being standardized as part of JPEG 2000 Part-17 extensions have been found suitable for coding spatial media with hard discontinuities. In this paper, we explore a modification to the original BPA-DWT by replacing the traditional constant extrapolation strategy with the newly proposed affine extrapolation for reconstructing depth data in the vicinity of discontinuities. We also present a depth reconstruction scheme that can directly decode the BPA-DWT coefficients and breakpoints onto a compact and scalable mesh-based representation which has many potential benefits over the sample-based description. For performing depth compensated view prediction, our proposed triangular mesh representation of the depth data is a natural fit for modern graphics architectures.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124784874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287136
Kristian Fischer, Fabian Brand, Christian Herglotz, A. Kaup
Common state-of-the-art video codecs are optimized to deliver a low bitrate by providing a certain quality for the final human observer, which is achieved by rate-distortion optimization (RDO). But, with the steady improvement of neural networks solving computer vision tasks, more and more multimedia data is not observed by humans anymore, but directly analyzed by neural networks. In this paper, we propose a standard-compliant feature-based RDO (FRDO) that is designed to increase the coding performance, when the decoded frame is analyzed by a neural network in a video coding for machine scenario. To that extent, we replace the pixel-based distortion metrics in conventional RDO of VTM-8.0 with distortion metrics calculated in the feature space created by the first layers of a neural network. Throughout several tests with the segmentation network Mask R-CNN and single images from the Cityscapes dataset, we compare the proposed FRDO and its hybrid version HFRDO with different distortion measures in the feature space against the conventional RDO. With HFRDO, up to 5.49% bitrate can be saved compared to the VTM-8.0 implementation in terms of Bjøntegaard Delta Rate and using the weighted average precision as quality metric. Additionally, allowing the encoder to vary the quantization parameter results in coding gains for the proposed HFRDO of up 9.95% compared to conventional VTM.
{"title":"Video Coding for Machines with Feature-Based Rate-Distortion Optimization","authors":"Kristian Fischer, Fabian Brand, Christian Herglotz, A. Kaup","doi":"10.1109/MMSP48831.2020.9287136","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287136","url":null,"abstract":"Common state-of-the-art video codecs are optimized to deliver a low bitrate by providing a certain quality for the final human observer, which is achieved by rate-distortion optimization (RDO). But, with the steady improvement of neural networks solving computer vision tasks, more and more multimedia data is not observed by humans anymore, but directly analyzed by neural networks. In this paper, we propose a standard-compliant feature-based RDO (FRDO) that is designed to increase the coding performance, when the decoded frame is analyzed by a neural network in a video coding for machine scenario. To that extent, we replace the pixel-based distortion metrics in conventional RDO of VTM-8.0 with distortion metrics calculated in the feature space created by the first layers of a neural network. Throughout several tests with the segmentation network Mask R-CNN and single images from the Cityscapes dataset, we compare the proposed FRDO and its hybrid version HFRDO with different distortion measures in the feature space against the conventional RDO. With HFRDO, up to 5.49% bitrate can be saved compared to the VTM-8.0 implementation in terms of Bjøntegaard Delta Rate and using the weighted average precision as quality metric. Additionally, allowing the encoder to vary the quantization parameter results in coding gains for the proposed HFRDO of up 9.95% compared to conventional VTM.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130530238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287147
E. Belyaev, Linlin Bie, J. Korhonen
This paper studies the problem of decoding video sequences compressed by Motion JPEG (M-JPEG) at the best possible perceived video quality. We consider decoding of M-JPEG video as signal recovery from incomplete measurements known in compressive sensing. We take all quantized nonzero Discrete Cosine Transform (DCT) coefficients as measurements and the remaining zero coefficients as data that should be recovered. The output video is reconstructed via iterative thresholding algorithm, where Video Block Matching and 4-D filtering (VBM4D) is used as thresholding operator. To reduce non-linearities in the measurements caused by the quantization in JPEG, we propose to apply spatio-temporal pre-filtering before measurements calculation and recovery. Since temporal inconsistencies of the residual coding artifacts lead to strong flickering in recovered video, we also propose to apply motion-compensated deflickering filter as a post-filter. Experimental results show that the proposed approach provides 0.44–0.51 dB average improvement in Peak Signal to Noise Ratio (PSNR), as well as lower flickering level compared to the state-of-the-art method based on Coefficient Graph Laplacians (COGL). We have also conducted a subjective comparison study, indicating that the proposed approach outperforms state-of-the-art methods in terms of subjective video quality.
{"title":"Motion JPEG Decoding via Iterative Thresholding and Motion-Compensated Deflickering","authors":"E. Belyaev, Linlin Bie, J. Korhonen","doi":"10.1109/MMSP48831.2020.9287147","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287147","url":null,"abstract":"This paper studies the problem of decoding video sequences compressed by Motion JPEG (M-JPEG) at the best possible perceived video quality. We consider decoding of M-JPEG video as signal recovery from incomplete measurements known in compressive sensing. We take all quantized nonzero Discrete Cosine Transform (DCT) coefficients as measurements and the remaining zero coefficients as data that should be recovered. The output video is reconstructed via iterative thresholding algorithm, where Video Block Matching and 4-D filtering (VBM4D) is used as thresholding operator. To reduce non-linearities in the measurements caused by the quantization in JPEG, we propose to apply spatio-temporal pre-filtering before measurements calculation and recovery. Since temporal inconsistencies of the residual coding artifacts lead to strong flickering in recovered video, we also propose to apply motion-compensated deflickering filter as a post-filter. Experimental results show that the proposed approach provides 0.44–0.51 dB average improvement in Peak Signal to Noise Ratio (PSNR), as well as lower flickering level compared to the state-of-the-art method based on Coefficient Graph Laplacians (COGL). We have also conducted a subjective comparison study, indicating that the proposed approach outperforms state-of-the-art methods in terms of subjective video quality.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130513961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287053
Kévin Riou, Jingwen Zhu, Suiyi Ling, Mathis Piquet, V. Truffault, P. Callet
Confinement during COVID-19 has caused serious effects on agriculture all over the world. As one of the efficient solutions, mechanical harvest/auto-harvest that is based on object detection and robotic harvester becomes an urgent need. Within the auto-harvest system, robust few-shot object detection model is one of the bottlenecks, since the system is required to deal with new vegetable/fruit categories and the collection of large-scale annotated datasets for all the novel categories is expensive. There are many few-shot object detection models that were developed by the community. Yet whether they could be employed directly for real life agricultural applications is still questionable, as there is a context-gap between the commonly used training datasets and the images collected in real life agricultural scenarios. To this end, in this study, we present a novel cucumber dataset and propose two data augmentation strategies that help to bridge the context-gap. Experimental results show that 1) the state-of-the-art few-shot object detection model performs poorly on the novel ‘cucumber’ category; and 2) the proposed augmentation strategies outperform the commonly used ones.
{"title":"Few-Shot Object Detection in Real Life: Case Study on Auto-Harvest","authors":"Kévin Riou, Jingwen Zhu, Suiyi Ling, Mathis Piquet, V. Truffault, P. Callet","doi":"10.1109/MMSP48831.2020.9287053","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287053","url":null,"abstract":"Confinement during COVID-19 has caused serious effects on agriculture all over the world. As one of the efficient solutions, mechanical harvest/auto-harvest that is based on object detection and robotic harvester becomes an urgent need. Within the auto-harvest system, robust few-shot object detection model is one of the bottlenecks, since the system is required to deal with new vegetable/fruit categories and the collection of large-scale annotated datasets for all the novel categories is expensive. There are many few-shot object detection models that were developed by the community. Yet whether they could be employed directly for real life agricultural applications is still questionable, as there is a context-gap between the commonly used training datasets and the images collected in real life agricultural scenarios. To this end, in this study, we present a novel cucumber dataset and propose two data augmentation strategies that help to bridge the context-gap. Experimental results show that 1) the state-of-the-art few-shot object detection model performs poorly on the novel ‘cucumber’ category; and 2) the proposed augmentation strategies outperform the commonly used ones.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"413 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115953895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287117
Kerem Durak, Mehmet N. Akcay, Yigit K. Erinc, Boran Pekel, A. Begen
In its annual developers conference in June 2019, Apple has announced a backwards-compatible extension to its popular HTTP Live Streaming (HLS) protocol to enable low-latency live streaming. This extension offers new features such as the ability to generate partial segments, use playlist delta updates, block playlist reload and provide rendition reports. Compared to the traditional HLS, these features require new capabilities on the origin servers and the caches inside a content delivery network. While HLS has been known to perform great at scale, its low-latency extension is likely to consume considerable server and network resources, and this may raise concerns about its scalability. In this paper, we make the first attempt to understand how this new extension works and performs. We also provide a 1:1 comparison against the low-latency DASH approach, which is the competing low-latency solution developed as an open standard.
{"title":"Evaluating the Performance of Apple’s Low-Latency HLS","authors":"Kerem Durak, Mehmet N. Akcay, Yigit K. Erinc, Boran Pekel, A. Begen","doi":"10.1109/MMSP48831.2020.9287117","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287117","url":null,"abstract":"In its annual developers conference in June 2019, Apple has announced a backwards-compatible extension to its popular HTTP Live Streaming (HLS) protocol to enable low-latency live streaming. This extension offers new features such as the ability to generate partial segments, use playlist delta updates, block playlist reload and provide rendition reports. Compared to the traditional HLS, these features require new capabilities on the origin servers and the caches inside a content delivery network. While HLS has been known to perform great at scale, its low-latency extension is likely to consume considerable server and network resources, and this may raise concerns about its scalability. In this paper, we make the first attempt to understand how this new extension works and performs. We also provide a 1:1 comparison against the low-latency DASH approach, which is the competing low-latency solution developed as an open standard.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121500281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-21DOI: 10.1109/MMSP48831.2020.9287123
Hiba Yousef, J. L. Feuvre, Alexandre Storelli
With the massive increase of video traffic over the internet, HTTP adaptive streaming has now become the main technique for infotainment content delivery. In this context, many bandwidth adaptation algorithms have emerged, each aiming to improve the user QoE using different session information e.g. TCP throughput, buffer occupancy, download time... Notwithstanding the difference in their implementation, they mostly use the same inputs to adapt to the varying conditions of the media session. In this paper, we show that it is possible to predict the bitrate decision of any ABR algorithm, thanks to machine learning techniques, and supervised classification in particular. This approach has the benefit of being generic, hence it does not require any knowledge about the player ABR algorithm itself, but assumes that whatever the logic behind, it will use a common set of input features. Then, using machine learning feature selection, it is possible to predict the relevant features and then train the model over real observation. We test our approach using simulations on well-known ABR algorithms, then we verify the results on commercial closed-source players, using different VoD and Live realistic data sets. The results show that both Random Forest and Gradient Boosting achieve a very high prediction accuracy among other ML-classifier.
{"title":"ABR prediction using supervised learning algorithms","authors":"Hiba Yousef, J. L. Feuvre, Alexandre Storelli","doi":"10.1109/MMSP48831.2020.9287123","DOIUrl":"https://doi.org/10.1109/MMSP48831.2020.9287123","url":null,"abstract":"With the massive increase of video traffic over the internet, HTTP adaptive streaming has now become the main technique for infotainment content delivery. In this context, many bandwidth adaptation algorithms have emerged, each aiming to improve the user QoE using different session information e.g. TCP throughput, buffer occupancy, download time... Notwithstanding the difference in their implementation, they mostly use the same inputs to adapt to the varying conditions of the media session. In this paper, we show that it is possible to predict the bitrate decision of any ABR algorithm, thanks to machine learning techniques, and supervised classification in particular. This approach has the benefit of being generic, hence it does not require any knowledge about the player ABR algorithm itself, but assumes that whatever the logic behind, it will use a common set of input features. Then, using machine learning feature selection, it is possible to predict the relevant features and then train the model over real observation. We test our approach using simulations on well-known ABR algorithms, then we verify the results on commercial closed-source players, using different VoD and Live realistic data sets. The results show that both Random Forest and Gradient Boosting achieve a very high prediction accuracy among other ML-classifier.","PeriodicalId":188283,"journal":{"name":"2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122764206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}