Pub Date : 2024-03-26DOI: 10.1109/TBC.2024.3374078
Zhijian Hao;Heming Sun;Guohao Xu;Jiaming Liu;Xiankui Xiong;Xuanpeng Zhu;Xiaoyang Zeng;Yibo Fan
As a fundamental component of video coding, transform coding concentrates the energy scattered in the spatial domain onto the upper-left region of the frequency domain. This concentration contributes significantly to Rate-Distortion performance improvement when combined with quantization and entropy coding. To better adapt the dynamic characteristics of image content, Alliance for Open Media Video 1 (AV1) introduces multiple transform kernels, which brings substantial coding performance benefits, albeit at the cost of considerably computational complexity. In this paper, we propose a fast transform kernel selection algorithm for AV1 based on frequency matching and probability model to effectively accelerate the coding process with an acceptable level of performance loss. Firstly, the concept of Frequency Matching Factor (FMF) based on cosine similarity is defined for the first time to describe the similarity between the residual block and the primary frequency basis image of the transform kernel. Statistical results demonstrate a clear distribution relationship between FMFs and normalized Rate-Distortion optimization costs (nRDOC). Then, leveraging these distribution characteristics, we establish Gaussian normal probability model of nRDOC for each FMF by characterizing the parameters of the normal model as functions of FMFs, enhancing the normal model’s accuracy and coding performance. Finally, based on the derived normal models, we design a fast selection algorithm with scalability and hardware-friendliness to skip the non-promising transform kernels. Experimental results show that the performance loss of the proposed fast algorithm is 1.15% when 57.66% of the transform kernels are skipped, resulting in a saving of 20.09% encoding time, which is superior to other fast algorithms found in the literature and competitive with the pruning algorithm based on the neural network in the AV1 reference software.
{"title":"Fast Transform Kernel Selection Based on Frequency Matching and Probability Model for AV1","authors":"Zhijian Hao;Heming Sun;Guohao Xu;Jiaming Liu;Xiankui Xiong;Xuanpeng Zhu;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/TBC.2024.3374078","DOIUrl":"10.1109/TBC.2024.3374078","url":null,"abstract":"As a fundamental component of video coding, transform coding concentrates the energy scattered in the spatial domain onto the upper-left region of the frequency domain. This concentration contributes significantly to Rate-Distortion performance improvement when combined with quantization and entropy coding. To better adapt the dynamic characteristics of image content, Alliance for Open Media Video 1 (AV1) introduces multiple transform kernels, which brings substantial coding performance benefits, albeit at the cost of considerably computational complexity. In this paper, we propose a fast transform kernel selection algorithm for AV1 based on frequency matching and probability model to effectively accelerate the coding process with an acceptable level of performance loss. Firstly, the concept of Frequency Matching Factor (FMF) based on cosine similarity is defined for the first time to describe the similarity between the residual block and the primary frequency basis image of the transform kernel. Statistical results demonstrate a clear distribution relationship between FMFs and normalized Rate-Distortion optimization costs (nRDOC). Then, leveraging these distribution characteristics, we establish Gaussian normal probability model of nRDOC for each FMF by characterizing the parameters of the normal model as functions of FMFs, enhancing the normal model’s accuracy and coding performance. Finally, based on the derived normal models, we design a fast selection algorithm with scalability and hardware-friendliness to skip the non-promising transform kernels. Experimental results show that the performance loss of the proposed fast algorithm is 1.15% when 57.66% of the transform kernels are skipped, resulting in a saving of 20.09% encoding time, which is superior to other fast algorithms found in the literature and competitive with the pruning algorithm based on the neural network in the AV1 reference software.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"693-707"},"PeriodicalIF":4.5,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-26DOI: 10.1109/TBC.2024.3374061
Sungjun Ahn;Bo-Mi Lim;Sunhyoung Kwon;Sungho Jeon;Xianbin Wang;Sung-Ik Park
This paper demonstrates the feasibility of multi-antenna reception to facilitate mobile broadcasting for vehicular receivers. Starting from a dimension analysis estimating the spatial capacity of automobiles, we confirm multi-antenna embedding as a viable solution for vehicular broadcast receivers. Accordingly, a rolling prototype of an ATSC 3.0 multi-antenna diversity receiver (DivRx) is implemented and repeatedly tested on public roads. The field verification tests in this paper aim to evaluate the performance of DivRx in real broadcast environments, represented by an urban single-frequency network (SFN) with high-power transmissions using ultra-high frequencies. To this end, extensive field trials are drawn in an operating ATSC 3.0 network located in the Seoul Metropolitan Area, South Korea. Public on-air services of 1080p and 4K videos are tested, targeting inter-city journeys and trips in urban centroids, respectively. The mobile reliability gain of DivRx is empirically evaluated in terms of coverage probability and the field strength required for 95% receivability. The results show that leveraging four antennas can achieve 99% coverage of intra-city 4K service in the current network status, deriving 65% more gain over single-antenna systems. It is also exhibited that the signal strength requirement can be reduced by 13 dB or more. In addition to the empirical evaluation, we provide theoretical proofs aligning with the observations.
{"title":"Diversity Receiver for ATSC 3.0-in-Vehicle: Design and Field Evaluation in Metropolitan SFN","authors":"Sungjun Ahn;Bo-Mi Lim;Sunhyoung Kwon;Sungho Jeon;Xianbin Wang;Sung-Ik Park","doi":"10.1109/TBC.2024.3374061","DOIUrl":"10.1109/TBC.2024.3374061","url":null,"abstract":"This paper demonstrates the feasibility of multi-antenna reception to facilitate mobile broadcasting for vehicular receivers. Starting from a dimension analysis estimating the spatial capacity of automobiles, we confirm multi-antenna embedding as a viable solution for vehicular broadcast receivers. Accordingly, a rolling prototype of an ATSC 3.0 multi-antenna diversity receiver (DivRx) is implemented and repeatedly tested on public roads. The field verification tests in this paper aim to evaluate the performance of DivRx in real broadcast environments, represented by an urban single-frequency network (SFN) with high-power transmissions using ultra-high frequencies. To this end, extensive field trials are drawn in an operating ATSC 3.0 network located in the Seoul Metropolitan Area, South Korea. Public on-air services of 1080p and 4K videos are tested, targeting inter-city journeys and trips in urban centroids, respectively. The mobile reliability gain of DivRx is empirically evaluated in terms of coverage probability and the field strength required for 95% receivability. The results show that leveraging four antennas can achieve 99% coverage of intra-city 4K service in the current network status, deriving 65% more gain over single-antenna systems. It is also exhibited that the signal strength requirement can be reduced by 13 dB or more. In addition to the empirical evaluation, we provide theoretical proofs aligning with the observations.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"367-381"},"PeriodicalIF":4.5,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Screen content video (SCV) has drawn much more attention than ever during the COVID-19 period and has evolved from a niche to a mainstream due to the recent proliferation of remote offices, online meetings, shared-screen collaboration, and gaming live streaming. Therefore, quality assessments for screen content media are highly demanded to maintain service quality recently. Although many practical natural scene video quality assessment methods have been proposed and achieved promising results, these methods cannot be applied to the screen content video quality assessment (SCVQA) task directly since the content characteristics of SCV are substantially different from natural scene video. Besides, only one no-reference SCVQA (NR-SCVQA) method, which requires handcrafted features, has been proposed in the literature. Therefore, we propose the first deep learning approach explicitly designed for NR-SCVQA. First, a multi-channel convolutional neural network (CNN) model is used to extract spatial quality features of pictorial and textual regions separately. Since there is no human annotated quality for each screen content frame (SCF), the CNN model is pre-trained in a multi-task self-supervised fashion to extract spatial quality feature representation of SCF. Second, we propose a time-distributed CNN transformer model (TCNNT) to further process all SCF spatial quality feature representations of an SCV and learn spatial and temporal features simultaneously so that high-level spatiotemporal features of SCV can be extracted and used to assess the whole SCV quality. Experimental results demonstrate the robustness and validity of our model, which is closely related to human perception.
{"title":"Deep Learning Approach for No-Reference Screen Content Video Quality Assessment","authors":"Ngai-Wing Kwong;Yui-Lam Chan;Sik-Ho Tsang;Ziyin Huang;Kin-Man Lam","doi":"10.1109/TBC.2024.3374042","DOIUrl":"10.1109/TBC.2024.3374042","url":null,"abstract":"Screen content video (SCV) has drawn much more attention than ever during the COVID-19 period and has evolved from a niche to a mainstream due to the recent proliferation of remote offices, online meetings, shared-screen collaboration, and gaming live streaming. Therefore, quality assessments for screen content media are highly demanded to maintain service quality recently. Although many practical natural scene video quality assessment methods have been proposed and achieved promising results, these methods cannot be applied to the screen content video quality assessment (SCVQA) task directly since the content characteristics of SCV are substantially different from natural scene video. Besides, only one no-reference SCVQA (NR-SCVQA) method, which requires handcrafted features, has been proposed in the literature. Therefore, we propose the first deep learning approach explicitly designed for NR-SCVQA. First, a multi-channel convolutional neural network (CNN) model is used to extract spatial quality features of pictorial and textual regions separately. Since there is no human annotated quality for each screen content frame (SCF), the CNN model is pre-trained in a multi-task self-supervised fashion to extract spatial quality feature representation of SCF. Second, we propose a time-distributed CNN transformer model (TCNNT) to further process all SCF spatial quality feature representations of an SCV and learn spatial and temporal features simultaneously so that high-level spatiotemporal features of SCV can be extracted and used to assess the whole SCV quality. Experimental results demonstrate the robustness and validity of our model, which is closely related to human perception.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"555-569"},"PeriodicalIF":4.5,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1109/TBC.2024.3394289
Desheng Chen;Jiabao Wen;Huiao Dai;Meng Xi;Shuai Xiao;Jiachen Yang
The Maritime Internet of Things (MIoT) consists of offshore equipment such as ships, consoles, and base stations, which are used for maritime information sharing to assist driving decision-making. However, with the increase in the number of MIoT access devices, the risks of information security and data reliability have also significantly increased. In this paper, we describe a maritime Dynamic Ship Federated Information Security Sharing Model (DSF-ISS) in Maritime Internet of Vessels (MIoV) based on maritime 5G broadcasting technology. The main object of this study is to solve the problem of maritime information isolated island under the condition of low communication between maritime ship nodes. In this model, maritime ship nodes cooperation is based on the Contract Network Protocol (CNP), which considers task types, spatial, and temporal distribution of different vessels. We then propose an improved federated learning approach for local dynamic nodes based on maritime 5G broadcasting technology. Moreover, this study designs a proof of membership (PoM) to share local task model information in global blockchain. The results showed that DSF-ISS has a positive effect in maritime transportation work. It effectively realizes the secure sharing of information and protects the privacy of node data.
{"title":"Enhancing Transportation Management in Marine Internet of Vessels: A 5G Broadcasting-Centric Framework Leveraging Federated Learning","authors":"Desheng Chen;Jiabao Wen;Huiao Dai;Meng Xi;Shuai Xiao;Jiachen Yang","doi":"10.1109/TBC.2024.3394289","DOIUrl":"10.1109/TBC.2024.3394289","url":null,"abstract":"The Maritime Internet of Things (MIoT) consists of offshore equipment such as ships, consoles, and base stations, which are used for maritime information sharing to assist driving decision-making. However, with the increase in the number of MIoT access devices, the risks of information security and data reliability have also significantly increased. In this paper, we describe a maritime Dynamic Ship Federated Information Security Sharing Model (DSF-ISS) in Maritime Internet of Vessels (MIoV) based on maritime 5G broadcasting technology. The main object of this study is to solve the problem of maritime information isolated island under the condition of low communication between maritime ship nodes. In this model, maritime ship nodes cooperation is based on the Contract Network Protocol (CNP), which considers task types, spatial, and temporal distribution of different vessels. We then propose an improved federated learning approach for local dynamic nodes based on maritime 5G broadcasting technology. Moreover, this study designs a proof of membership (PoM) to share local task model information in global blockchain. The results showed that DSF-ISS has a positive effect in maritime transportation work. It effectively realizes the secure sharing of information and protects the privacy of node data.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 3","pages":"1091-1103"},"PeriodicalIF":3.2,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141153356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21DOI: 10.1109/TBC.2024.3374066
Zhijun Li;Yumei Wang;Yu Liu;Junjie Li;Ping Zhu
360-degree videos, as a type of media that offers highly immersive experiences, often result in significant bandwidth waste due to incomplete views by users. This places a heavy demand on streaming systems to support high-bandwidth requirements. Recently, tile-based streaming systems combined with viewport prediction have become popular to improve bandwidth efficiency. However, since the viewport prediction is only reliable in the short term, maintaining a long buffer to avoid rebuffering is challenging. We propose JUST360, a joint utility based two-tier 360-degree video streaming system in this paper. To better improve the accuracy of utility evaluation, a utility model incorporates image quality and prediction accuracy is proposed to evaluate the contribution of each tile so that longer buffer and bandwidth efficiency can coexist. The optimal bitrate allocation strategy is determined by using model predictive control (MPC) to dynamically select the tiles according to their characteristics. Experiments show that our method successfully achieves higher PSNR and less rebuffering. Compared with other state-of-the-art methods, our proposed method can outperform the other methods by 3%-20% in terms of QoE.
{"title":"JUST360: Optimizing 360-Degree Video Streaming Systems With Joint Utility","authors":"Zhijun Li;Yumei Wang;Yu Liu;Junjie Li;Ping Zhu","doi":"10.1109/TBC.2024.3374066","DOIUrl":"10.1109/TBC.2024.3374066","url":null,"abstract":"360-degree videos, as a type of media that offers highly immersive experiences, often result in significant bandwidth waste due to incomplete views by users. This places a heavy demand on streaming systems to support high-bandwidth requirements. Recently, tile-based streaming systems combined with viewport prediction have become popular to improve bandwidth efficiency. However, since the viewport prediction is only reliable in the short term, maintaining a long buffer to avoid rebuffering is challenging. We propose JUST360, a joint utility based two-tier 360-degree video streaming system in this paper. To better improve the accuracy of utility evaluation, a utility model incorporates image quality and prediction accuracy is proposed to evaluate the contribution of each tile so that longer buffer and bandwidth efficiency can coexist. The optimal bitrate allocation strategy is determined by using model predictive control (MPC) to dynamically select the tiles according to their characteristics. Experiments show that our method successfully achieves higher PSNR and less rebuffering. Compared with other state-of-the-art methods, our proposed method can outperform the other methods by 3%-20% in terms of QoE.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"468-481"},"PeriodicalIF":4.5,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21DOI: 10.1109/TBC.2024.3374119
Yumei Wang;Junjie Li;Zhijun Li;Simou Shang;Yu Liu
360-degree videos usually require extremely high bandwidth and low latency for wireless transmission, which hinders their popularity. A tile-based viewport adaptive streaming scheme, which involves accurate viewport prediction and optimal bitrate adaptation to maintain user Quality of Experience (QoE) under a bandwidth-constrained network, has been proposed by researchers. However, viewport prediction is error-prone in long-term prediction, and bitrate adaptation schemes may waste bandwidth resources due to failing to consider various aspects of QoE. In this paper, we propose a synergistic temporal-spatial user-aware viewport prediction scheme for optimal adaptive 360-Degree video streaming (SPA360) to tackle these challenges. We use a user-aware viewport prediction mode, which offers a white box solution for Field of View (FoV) prediction. Specially, we employ temporal-spatial fusion for enhanced viewport prediction to minimize prediction errors. Our proposed utility prediction model jointly considers viewport probability distribution and metrics that directly affecting QoE to enable more precise bitrate adaptation. To optimize bitrate adaptation for tiled-based 360-degree video streaming, the problem is formulated as a packet knapsack problem and solved efficiently with a dynamic programming-based algorithm to maximize utility. The SPA360 scheme demonstrates improved performance in terms of both viewport prediction accuracy and bandwidth utilization, and our approach enhances the overall quality and efficiency of adaptive 360-degree video streaming.
{"title":"Synergistic Temporal-Spatial User-Aware Viewport Prediction for Optimal Adaptive 360-Degree Video Streaming","authors":"Yumei Wang;Junjie Li;Zhijun Li;Simou Shang;Yu Liu","doi":"10.1109/TBC.2024.3374119","DOIUrl":"10.1109/TBC.2024.3374119","url":null,"abstract":"360-degree videos usually require extremely high bandwidth and low latency for wireless transmission, which hinders their popularity. A tile-based viewport adaptive streaming scheme, which involves accurate viewport prediction and optimal bitrate adaptation to maintain user Quality of Experience (QoE) under a bandwidth-constrained network, has been proposed by researchers. However, viewport prediction is error-prone in long-term prediction, and bitrate adaptation schemes may waste bandwidth resources due to failing to consider various aspects of QoE. In this paper, we propose a synergistic temporal-spatial user-aware viewport prediction scheme for optimal adaptive 360-Degree video streaming (SPA360) to tackle these challenges. We use a user-aware viewport prediction mode, which offers a white box solution for Field of View (FoV) prediction. Specially, we employ temporal-spatial fusion for enhanced viewport prediction to minimize prediction errors. Our proposed utility prediction model jointly considers viewport probability distribution and metrics that directly affecting QoE to enable more precise bitrate adaptation. To optimize bitrate adaptation for tiled-based 360-degree video streaming, the problem is formulated as a packet knapsack problem and solved efficiently with a dynamic programming-based algorithm to maximize utility. The SPA360 scheme demonstrates improved performance in terms of both viewport prediction accuracy and bandwidth utilization, and our approach enhances the overall quality and efficiency of adaptive 360-degree video streaming.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"453-467"},"PeriodicalIF":4.5,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140197311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The visual quality of 3D-synthesized videos is closely related to the development and broadcasting of immersive media such as free-viewpoint videos and six degrees of freedom navigation. Therefore, studying the 3D-Synthesized video quality assessment is helpful to promote the popularity of immersive media applications. Inspired by the texture compression, depth compression and virtual view synthesis polluting the visual quality of 3D-synthesized videos at pixel-, structure- and content-levels, this paper proposes a Multi-Level 3D-Synthesized Video Quality Assessment algorithm, namely ML-SVQA, which consists of a quality feature perception module and a quality feature regression module. Specifically, the quality feature perception module firstly extracts motion vector fields of the 3D-synthesized video at pixel-, structure- and content-levels by combining the perception mechanism of human visual system. Then, the quality feature perception module measures the temporal flicker distortion intensity in the no-reference environment by calculating the self-similarity of adjacent motion vector fields. Finally, the quality feature regression module uses the machine learning algorithm to learn the mapping of the developed quality features to the quality score. Experiments constructed on the public IRCCyN/IVC and SIAT synthesized video datasets show that our ML-SVQA is more effective than state-of-the-art image/video quality assessment methods in evaluating the quality of 3D-Synthesized videos.
三维合成视频的视觉质量与自由视点视频和六自由度导航等沉浸式媒体的开发和播放密切相关。因此,研究三维合成视频质量评估有助于促进身临其境媒体应用的普及。受纹理压缩、深度压缩和虚拟视图合成在像素级、结构级和内容级污染三维合成视频视觉质量的启发,本文提出了一种多级三维合成视频质量评估算法,即 ML-SVQA,该算法由质量特征感知模块和质量特征回归模块组成。具体来说,质量特征感知模块首先结合人类视觉系统的感知机制,从像素、结构和内容三个层面提取三维合成视频的运动矢量场。然后,质量特征感知模块通过计算相邻运动矢量场的自相似性来测量无参照环境下的时间闪烁失真强度。最后,质量特征回归模块使用机器学习算法来学习所开发的质量特征与质量得分之间的映射关系。在公开的 IRCCyN/IVC 和 SIAT 合成视频数据集上构建的实验表明,在评估 3D 合成视频质量方面,我们的 ML-SVQA 比最先进的图像/视频质量评估方法更有效。
{"title":"No-Reference Multi-Level Video Quality Assessment Metric for 3D-Synthesized Videos","authors":"Guangcheng Wang;Baojin Huang;Ke Gu;Yuchen Liu;Hongyan Liu;Quan Shi;Guangtao Zhai;Wenjun Zhang","doi":"10.1109/TBC.2024.3396696","DOIUrl":"10.1109/TBC.2024.3396696","url":null,"abstract":"The visual quality of 3D-synthesized videos is closely related to the development and broadcasting of immersive media such as free-viewpoint videos and six degrees of freedom navigation. Therefore, studying the 3D-Synthesized video quality assessment is helpful to promote the popularity of immersive media applications. Inspired by the texture compression, depth compression and virtual view synthesis polluting the visual quality of 3D-synthesized videos at pixel-, structure- and content-levels, this paper proposes a Multi-Level 3D-Synthesized Video Quality Assessment algorithm, namely ML-SVQA, which consists of a quality feature perception module and a quality feature regression module. Specifically, the quality feature perception module firstly extracts motion vector fields of the 3D-synthesized video at pixel-, structure- and content-levels by combining the perception mechanism of human visual system. Then, the quality feature perception module measures the temporal flicker distortion intensity in the no-reference environment by calculating the self-similarity of adjacent motion vector fields. Finally, the quality feature regression module uses the machine learning algorithm to learn the mapping of the developed quality features to the quality score. Experiments constructed on the public IRCCyN/IVC and SIAT synthesized video datasets show that our ML-SVQA is more effective than state-of-the-art image/video quality assessment methods in evaluating the quality of 3D-Synthesized videos.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"584-596"},"PeriodicalIF":4.5,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141153267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21DOI: 10.1109/TBC.2024.3394291
Qiang Zhu;Feiyu Chen;Yu Liu;Shuyuan Zhu;Bing Zeng
Compressed video super-resolution (VSR) is employed to generate high-resolution (HR) videos from low-resolution (LR) compressed videos. Recently, some compressed VSR methods have adopted coding priors, such as partition maps, compressed residual frames, predictive pictures and motion vectors, to generate HR videos. However, these methods disregard the design of modules according to the specific characteristics of coding information, which limits the application efficiency of coding priors. In this paper, we propose a deep compressed VSR network that effectively introduces coding priors to construct high-quality HR videos. Specifically, we design a partition-guided feature extraction module to extract features from the LR video with the guidance of the partition average image. Moreover, we separate the video features into sparse features and dense features according to the energy distribution of the compressed residual frame to achieve feature enhancement. Additionally, we construct a temporal attention-based feature fusion module to use motion vectors and predictive pictures to eliminate motion errors between frames and temporally fuse features. Based on these modules, the coding priors are effectively employed in our model for constructing high-quality HR videos. The experimental results demonstrate that our method achieves better performance and lower complexity than the state-of-the-arts.
{"title":"Deep Compressed Video Super-Resolution With Guidance of Coding Priors","authors":"Qiang Zhu;Feiyu Chen;Yu Liu;Shuyuan Zhu;Bing Zeng","doi":"10.1109/TBC.2024.3394291","DOIUrl":"10.1109/TBC.2024.3394291","url":null,"abstract":"Compressed video super-resolution (VSR) is employed to generate high-resolution (HR) videos from low-resolution (LR) compressed videos. Recently, some compressed VSR methods have adopted coding priors, such as partition maps, compressed residual frames, predictive pictures and motion vectors, to generate HR videos. However, these methods disregard the design of modules according to the specific characteristics of coding information, which limits the application efficiency of coding priors. In this paper, we propose a deep compressed VSR network that effectively introduces coding priors to construct high-quality HR videos. Specifically, we design a partition-guided feature extraction module to extract features from the LR video with the guidance of the partition average image. Moreover, we separate the video features into sparse features and dense features according to the energy distribution of the compressed residual frame to achieve feature enhancement. Additionally, we construct a temporal attention-based feature fusion module to use motion vectors and predictive pictures to eliminate motion errors between frames and temporally fuse features. Based on these modules, the coding priors are effectively employed in our model for constructing high-quality HR videos. The experimental results demonstrate that our method achieves better performance and lower complexity than the state-of-the-arts.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"505-515"},"PeriodicalIF":4.5,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141153653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diffusion models have gained significant popularity for image-to-image translation tasks. Previous efforts applying diffusion models to image super-resolution have demonstrated that iteratively refining pure Gaussian noise using a U-Net architecture trained on denoising at various noise levels can yield satisfactory high-resolution images from low-resolution inputs. However, this iterative refinement process comes with the drawback of low inference speed, which strongly limits its applications. To speed up inference and further enhance the performance, our research revisits diffusion models in image super-resolution and proposes a straightforward yet significant diffusion model-based super-resolution method called ACDMSR (accelerated conditional diffusion model for image super-resolution). Specifically, we adopt existing image super-resolution methods and finetune them to provide conditional images from given low-resolution images, which can help to achieve better high-resolution results than just taking low-resolution images as conditional images. Then we adapt the diffusion model to perform super-resolution through a deterministic iterative denoising process, which helps to strongly decline the inference time. We demonstrate that our method surpasses previous attempts in qualitative and quantitative results through extensive experiments conducted on benchmark datasets such as Set5, Set14, Urban100, BSD100, and Manga109. Moreover, our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.
{"title":"ACDMSR: Accelerated Conditional Diffusion Models for Single Image Super-Resolution","authors":"Axi Niu;Trung X. Pham;Kang Zhang;Jinqiu Sun;Yu Zhu;Qingsen Yan;In So Kweon;Yanning Zhang","doi":"10.1109/TBC.2024.3374122","DOIUrl":"10.1109/TBC.2024.3374122","url":null,"abstract":"Diffusion models have gained significant popularity for image-to-image translation tasks. Previous efforts applying diffusion models to image super-resolution have demonstrated that iteratively refining pure Gaussian noise using a U-Net architecture trained on denoising at various noise levels can yield satisfactory high-resolution images from low-resolution inputs. However, this iterative refinement process comes with the drawback of low inference speed, which strongly limits its applications. To speed up inference and further enhance the performance, our research revisits diffusion models in image super-resolution and proposes a straightforward yet significant diffusion model-based super-resolution method called ACDMSR (accelerated conditional diffusion model for image super-resolution). Specifically, we adopt existing image super-resolution methods and finetune them to provide conditional images from given low-resolution images, which can help to achieve better high-resolution results than just taking low-resolution images as conditional images. Then we adapt the diffusion model to perform super-resolution through a deterministic iterative denoising process, which helps to strongly decline the inference time. We demonstrate that our method surpasses previous attempts in qualitative and quantitative results through extensive experiments conducted on benchmark datasets such as Set5, Set14, Urban100, BSD100, and Manga109. Moreover, our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"492-504"},"PeriodicalIF":4.5,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140205610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The adaptive bitrate (ABR) algorithm plays a crucial role in ensuring satisfactory quality of experience (QoE) in video streaming applications. Most existing approaches, either rule-based or learning-driven, tend to conduct ABR decisions based on limited network statistics, e.g., mean/standard deviation of recent throughput measurements. However, all of them lack a good understanding of network dynamics given the varying network conditions from time to time, leading to compromised performance, especially when the network condition changes significantly. In this paper, we propose a framework named ANT that aims to enhance adaptive video streaming by accurately learning network dynamics. ANT represents and detects specific network conditions by characterizing the entire spectrum of network fluctuations. It further trains multiple dedicated ABR models for each condition using deep reinforcement learning. During inference, a dynamic switching mechanism is devised to activate the appropriate ABR model based on real-time network condition sensing, enabling ANT to automatically adjust its control policies to different network conditions. Extensive experimental results demonstrate that our proposed ANT achieves a significant improvement in user QoE of 20.8%-41.2% in the video-on-demand scenario and 67.4%-134.5% in the live-streaming scenario compared to state-of-the-art methods, across a wide range of network conditions.
在视频流应用中,自适应比特率(ABR)算法对确保令人满意的体验质量(QoE)起着至关重要的作用。大多数现有方法,无论是基于规则的还是学习驱动的,都倾向于根据有限的网络统计数据(如最近吞吐量测量的平均值/标准偏差)做出 ABR 决定。然而,所有这些方法都缺乏对网络动态的充分了解,因为网络条件时常变化,导致性能受损,尤其是当网络条件发生重大变化时。在本文中,我们提出了一个名为 ANT 的框架,旨在通过准确学习网络动态来增强自适应视频流。ANT 通过描述整个网络波动频谱来表示和检测特定的网络条件。它还利用深度强化学习为每种情况训练多个专用 ABR 模型。在推理过程中,我们设计了一种动态切换机制,根据实时网络状况感知激活适当的 ABR 模型,使 ANT 能够根据不同的网络状况自动调整其控制策略。广泛的实验结果表明,与最先进的方法相比,在各种网络条件下,我们提出的 ANT 在视频点播场景中显著改善了用户 QoE,改善幅度为 20.8%-41.2%,在直播场景中改善幅度为 67.4%-134.5%。
{"title":"Learning Accurate Network Dynamics for Enhanced Adaptive Video Streaming","authors":"Jiaoyang Yin;Hao Chen;Yiling Xu;Zhan Ma;Xiaozhong Xu","doi":"10.1109/TBC.2024.3396698","DOIUrl":"10.1109/TBC.2024.3396698","url":null,"abstract":"The adaptive bitrate (ABR) algorithm plays a crucial role in ensuring satisfactory quality of experience (QoE) in video streaming applications. Most existing approaches, either rule-based or learning-driven, tend to conduct ABR decisions based on limited network statistics, e.g., mean/standard deviation of recent throughput measurements. However, all of them lack a good understanding of network dynamics given the varying network conditions from time to time, leading to compromised performance, especially when the network condition changes significantly. In this paper, we propose a framework named ANT that aims to enhance adaptive video streaming by accurately learning network dynamics. ANT represents and detects specific network conditions by characterizing the entire spectrum of network fluctuations. It further trains multiple dedicated ABR models for each condition using deep reinforcement learning. During inference, a dynamic switching mechanism is devised to activate the appropriate ABR model based on real-time network condition sensing, enabling ANT to automatically adjust its control policies to different network conditions. Extensive experimental results demonstrate that our proposed ANT achieves a significant improvement in user QoE of 20.8%-41.2% in the video-on-demand scenario and 67.4%-134.5% in the live-streaming scenario compared to state-of-the-art methods, across a wide range of network conditions.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 3","pages":"808-821"},"PeriodicalIF":3.2,"publicationDate":"2024-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}