{"title":"基于神经网络的视频内环过滤可重构框架","authors":"Yichi Zhang, Dandan Ding, Zhan Ma, Zhu Li","doi":"10.1145/3640467","DOIUrl":null,"url":null,"abstract":"<p>This paper proposes a reconfigurable framework for neural network-based video in-loop filtering to guide large-scale models for content-aware processing. Specifically, the backbone neural model is decomposed into several convolutional groups and the encoder systematically traverses all candidate configurations combined by these groups to find the best one. The selected configuration index is then encapsulated as side information and passed to the decoder, enabling dynamic model reconfiguration during the decoding stage. The above reconfiguration process is only deployed in the inference stage on top of a pre-trained backbone model. Furthermore, we devise a Wavelet Multi-scale Poolformer (<i>WMSPFormer</i>) as the backbone network structure. <i>WMSPFormer</i> utilizes a wavelet-based multi-scale structure to losslessly decompose the input into multiple scales for spatial-spectral features aggregation. Moreover, it uses the Multi-scale Pooling operations (<i>MSPoolformer</i>) instead of complicated matrix calculations to substitute the attention process. We also extend <i>MSPoolformer</i> to a large-scale version using more parameters, referred to as <i>MSPoolformerExt</i>. Extensive experiments demonstrate that the proposed <i>WMSPFormer+Reconfig.</i> and <i>WMSPFormerExt+Reconfig.</i> achieves a remarkable 7.13% and 7.92% BD-Rate reduction over the anchor H.266/VVC, outperforming most existing methods evaluated under the same training and testing conditions. Also, the low-complexity nature of <i>WMSPFormer</i> series makes it attractive for practical applications.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"30 19 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Reconfigurable Framework for Neural Network-based Video In-loop Filtering\",\"authors\":\"Yichi Zhang, Dandan Ding, Zhan Ma, Zhu Li\",\"doi\":\"10.1145/3640467\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This paper proposes a reconfigurable framework for neural network-based video in-loop filtering to guide large-scale models for content-aware processing. Specifically, the backbone neural model is decomposed into several convolutional groups and the encoder systematically traverses all candidate configurations combined by these groups to find the best one. The selected configuration index is then encapsulated as side information and passed to the decoder, enabling dynamic model reconfiguration during the decoding stage. The above reconfiguration process is only deployed in the inference stage on top of a pre-trained backbone model. Furthermore, we devise a Wavelet Multi-scale Poolformer (<i>WMSPFormer</i>) as the backbone network structure. <i>WMSPFormer</i> utilizes a wavelet-based multi-scale structure to losslessly decompose the input into multiple scales for spatial-spectral features aggregation. Moreover, it uses the Multi-scale Pooling operations (<i>MSPoolformer</i>) instead of complicated matrix calculations to substitute the attention process. We also extend <i>MSPoolformer</i> to a large-scale version using more parameters, referred to as <i>MSPoolformerExt</i>. Extensive experiments demonstrate that the proposed <i>WMSPFormer+Reconfig.</i> and <i>WMSPFormerExt+Reconfig.</i> achieves a remarkable 7.13% and 7.92% BD-Rate reduction over the anchor H.266/VVC, outperforming most existing methods evaluated under the same training and testing conditions. Also, the low-complexity nature of <i>WMSPFormer</i> series makes it attractive for practical applications.</p>\",\"PeriodicalId\":50937,\"journal\":{\"name\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"volume\":\"30 19 1\",\"pages\":\"\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-01-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Multimedia Computing Communications and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3640467\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3640467","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A Reconfigurable Framework for Neural Network-based Video In-loop Filtering
This paper proposes a reconfigurable framework for neural network-based video in-loop filtering to guide large-scale models for content-aware processing. Specifically, the backbone neural model is decomposed into several convolutional groups and the encoder systematically traverses all candidate configurations combined by these groups to find the best one. The selected configuration index is then encapsulated as side information and passed to the decoder, enabling dynamic model reconfiguration during the decoding stage. The above reconfiguration process is only deployed in the inference stage on top of a pre-trained backbone model. Furthermore, we devise a Wavelet Multi-scale Poolformer (WMSPFormer) as the backbone network structure. WMSPFormer utilizes a wavelet-based multi-scale structure to losslessly decompose the input into multiple scales for spatial-spectral features aggregation. Moreover, it uses the Multi-scale Pooling operations (MSPoolformer) instead of complicated matrix calculations to substitute the attention process. We also extend MSPoolformer to a large-scale version using more parameters, referred to as MSPoolformerExt. Extensive experiments demonstrate that the proposed WMSPFormer+Reconfig. and WMSPFormerExt+Reconfig. achieves a remarkable 7.13% and 7.92% BD-Rate reduction over the anchor H.266/VVC, outperforming most existing methods evaluated under the same training and testing conditions. Also, the low-complexity nature of WMSPFormer series makes it attractive for practical applications.
期刊介绍:
The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome.
TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.