Pub Date : 2024-08-13DOI: 10.1007/s00530-024-01448-z
Hong Fang, Leiyuxin Sha, Jindong Liang
Most multimodal recommender systems utilize multimodal content of user-interacted items as supplemental information to capture user preferences based on historical interactions without considering user-uninteracted items. In contrast, multimodal recommender systems based on causal inference counterfactual learning utilize the causal difference between the multimodal content of user-interacted and user-uninteracted items to purify the content related to user preferences. However, existing methods adopt a unified multimodal channel, which treats each modality equally, resulting in the inability to distinguish users’ tastes for different modalities. Therefore, the differences in users’ attention and perception of different modalities' content cannot be reflected. To cope with the above issue, this paper proposes a novel recommender system based on multi-channel counterfactual learning (MCCL) networks to capture user fine-grained preferences on different modalities. First, two independent channels are established based on the corresponding features for the content of image and text modalities for modality-specific feature extraction. Then, leveraging the counterfactual theory of causal inference, features in each channel unrelated to user preferences are eliminated using the features of the user-uninteracted items. Features related to user preferences are enhanced and multimodal user preferences are modeled at the content level, which portrays the users' taste for the different modalities of items. Finally, semantic entities are extracted to model semantic-level multimodal user preferences, which are fused with historical user interaction information and content-level user preferences for recommendation. Extensive experiments on three different datasets show that our results improve up to 4.17% on NDCG compared to the optimal model.
{"title":"Multimodal recommender system based on multi-channel counterfactual learning networks","authors":"Hong Fang, Leiyuxin Sha, Jindong Liang","doi":"10.1007/s00530-024-01448-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01448-z","url":null,"abstract":"<p>Most multimodal recommender systems utilize multimodal content of user-interacted items as supplemental information to capture user preferences based on historical interactions without considering user-uninteracted items. In contrast, multimodal recommender systems based on causal inference counterfactual learning utilize the causal difference between the multimodal content of user-interacted and user-uninteracted items to purify the content related to user preferences. However, existing methods adopt a unified multimodal channel, which treats each modality equally, resulting in the inability to distinguish users’ tastes for different modalities. Therefore, the differences in users’ attention and perception of different modalities' content cannot be reflected. To cope with the above issue, this paper proposes a novel recommender system based on multi-channel counterfactual learning (MCCL) networks to capture user fine-grained preferences on different modalities. First, two independent channels are established based on the corresponding features for the content of image and text modalities for modality-specific feature extraction. Then, leveraging the counterfactual theory of causal inference, features in each channel unrelated to user preferences are eliminated using the features of the user-uninteracted items. Features related to user preferences are enhanced and multimodal user preferences are modeled at the content level, which portrays the users' taste for the different modalities of items. Finally, semantic entities are extracted to model semantic-level multimodal user preferences, which are fused with historical user interaction information and content-level user preferences for recommendation. Extensive experiments on three different datasets show that our results improve up to 4.17% on NDCG compared to the optimal model.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"16 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s00530-024-01451-4
Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo
Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.
{"title":"Exploring multi-level transformers with feature frame padding network for 3D human pose estimation","authors":"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo","doi":"10.1007/s00530-024-01451-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01451-4","url":null,"abstract":"<p>Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s00530-024-01423-8
Yue Wu, Chengtao Cai, Chai Kiat Yeo
In recent years, the domain of visual object tracking has witnessed considerable advancements with the advent of deep learning methodologies. Siamese-based trackers have been pivotal, establishing a new architecture with a weight-shared backbone. With the inclusion of the transformer, attention mechanism has been exploited to enhance the feature discriminability across successive frames. However, the limited adaptability of many existing trackers to the different tracking scenarios has led to inaccurate target localization. To effectively solve this issue, in this paper, we have integrated a siamese network with the transformer, where the former utilizes ResNet50 as the backbone network to extract the target features, while the latter consists of an encoder and a decoder, where the encoder can effectively utilize global contextual information to obtain the discriminative features. Simultaneously, we employ the decoder to propagate prior information related to the target, which enables the tracker to successfully locate the target in a variety of environments, enhancing the stability and robustness of the tracker. Extensive experiments on four major public datasets, OTB100, UAV123, GOT10k and LaSOText demonstrate the effectiveness of the proposed method. Its performance surpasses many state-of-the-art trackers. Additionally, the proposed tracker can achieve a tracking speed of 60 fps, meeting the requirements for real-time tracking.
{"title":"Propagating prior information with transformer for robust visual object tracking","authors":"Yue Wu, Chengtao Cai, Chai Kiat Yeo","doi":"10.1007/s00530-024-01423-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01423-8","url":null,"abstract":"<p>In recent years, the domain of visual object tracking has witnessed considerable advancements with the advent of deep learning methodologies. Siamese-based trackers have been pivotal, establishing a new architecture with a weight-shared backbone. With the inclusion of the transformer, attention mechanism has been exploited to enhance the feature discriminability across successive frames. However, the limited adaptability of many existing trackers to the different tracking scenarios has led to inaccurate target localization. To effectively solve this issue, in this paper, we have integrated a siamese network with the transformer, where the former utilizes ResNet50 as the backbone network to extract the target features, while the latter consists of an encoder and a decoder, where the encoder can effectively utilize global contextual information to obtain the discriminative features. Simultaneously, we employ the decoder to propagate prior information related to the target, which enables the tracker to successfully locate the target in a variety of environments, enhancing the stability and robustness of the tracker. Extensive experiments on four major public datasets, OTB100, UAV123, GOT10k and LaSOText demonstrate the effectiveness of the proposed method. Its performance surpasses many state-of-the-art trackers. Additionally, the proposed tracker can achieve a tracking speed of 60 fps, meeting the requirements for real-time tracking.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"8 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-12DOI: 10.1007/s00530-024-01419-4
Jiaqi Zhu, Bin Li, Xinhua Zhao
Stereo matching is a key technology for many autonomous driving and robotics applications. Recently, methods based on Convolutional Neural Network have achieved huge progress. However, it is still difficult to find accurate matching points in inherently ill-posed regions such as areas with weak texture and reflective surfaces. In this paper, we propose a multi-level pyramid fusion volume (MPFV-Stereo) which contains two prominent components: multi-scale cost volume (MSCV) and multi-level cost volume (MLCV). We also design a low-parameter Gaussian attention module to excite cost volume. Our MPFV-Stereo ranks 2nd on KITTI 2012 (Reflective) among all published methods. In addition, MPFV-Stereo has competitive results on both Scene Flow and KITTI datasets and requires less training to achieve strong cross-dataset generalization on Middlebury and ETH3D benchmark.
{"title":"Multi-level pyramid fusion for efficient stereo matching","authors":"Jiaqi Zhu, Bin Li, Xinhua Zhao","doi":"10.1007/s00530-024-01419-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01419-4","url":null,"abstract":"<p>Stereo matching is a key technology for many autonomous driving and robotics applications. Recently, methods based on Convolutional Neural Network have achieved huge progress. However, it is still difficult to find accurate matching points in inherently ill-posed regions such as areas with weak texture and reflective surfaces. In this paper, we propose a multi-level pyramid fusion volume (MPFV-Stereo) which contains two prominent components: multi-scale cost volume (MSCV) and multi-level cost volume (MLCV). We also design a low-parameter Gaussian attention module to excite cost volume. Our MPFV-Stereo ranks 2nd on KITTI 2012 (Reflective) among all published methods. In addition, MPFV-Stereo has competitive results on both Scene Flow and KITTI datasets and requires less training to achieve strong cross-dataset generalization on Middlebury and ETH3D benchmark.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"56 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-12DOI: 10.1007/s00530-024-01432-7
Dan Xiang, Huihua Wang, Zebin Zhou, Hao Zhao, Pan Gao, Jinwen Zhang, Chun Shan
An underwater image enhancement technique based on weighted guided filter image fusion is proposed to address challenges, including optical absorption and scattering, color distortion, and uneven illumination. The method consists of three stages: color correction, local contrast enhancement, and fusion algorithm methods. In terms of color correction, basic correction is achieved through channel compensation and remapping, with saturation adjusted based on histogram distribution to enhance visual richness. For local contrast enhancement, the approach involves box filtering and a variational model to improve image saturation. Finally, the method utilizes weighted guided filter image fusion to achieve high visual quality underwater images. Additionally, our method outperforms eight state-of-the-art algorithms in no-reference metrics, demonstrating its effectiveness and innovation. We also conducted application tests and time comparisons to further validate the practicality of our approach.
{"title":"Underwater image enhancement based on weighted guided filter image fusion","authors":"Dan Xiang, Huihua Wang, Zebin Zhou, Hao Zhao, Pan Gao, Jinwen Zhang, Chun Shan","doi":"10.1007/s00530-024-01432-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01432-7","url":null,"abstract":"<p>An underwater image enhancement technique based on weighted guided filter image fusion is proposed to address challenges, including optical absorption and scattering, color distortion, and uneven illumination. The method consists of three stages: color correction, local contrast enhancement, and fusion algorithm methods. In terms of color correction, basic correction is achieved through channel compensation and remapping, with saturation adjusted based on histogram distribution to enhance visual richness. For local contrast enhancement, the approach involves box filtering and a variational model to improve image saturation. Finally, the method utilizes weighted guided filter image fusion to achieve high visual quality underwater images. Additionally, our method outperforms eight state-of-the-art algorithms in no-reference metrics, demonstrating its effectiveness and innovation. We also conducted application tests and time comparisons to further validate the practicality of our approach.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"43 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00530-024-01439-0
Jinxian Bai, Yao Fan, Zhiwei Zhao
Thangka, as a precious heritage of painting art, holds irreplaceable research value due to its richness in Tibetan history, religious beliefs, and folk culture. However, it is susceptible to partial damage and form distortion due to natural erosion or inadequate conservation measures. Given the complexity of textures and rich semantics in thangka images, existing image inpainting methods struggle to recover their original artistic style and intricate details. In this paper, we propose a novel approach combining discrete codebook learning with a transformer for image inpainting, tailored specifically for thangka images. In the codebook learning stage, we design an improved network framework based on vector quantization (VQ) codebooks to discretely encode intermediate features of input images, yielding a context-rich discrete codebook. The second phase introduces a parallel transformer module based on a cross-shaped window, which efficiently predicts the index combinations for missing regions under limited computational cost. Furthermore, we devise a multi-scale feature guidance module that progressively fuses features from intact areas with textural features from the codebook, thereby enhancing the preservation of local details in non-damaged regions. We validate the efficacy of our method through qualitative and quantitative experiments on datasets including Celeba-HQ, Places2, and a custom thangka dataset. Experimental results demonstrate that compared to previous methods, our approach successfully reconstructs images with more complete structural information and clearer textural details.
{"title":"Discrete codebook collaborating with transformer for thangka image inpainting","authors":"Jinxian Bai, Yao Fan, Zhiwei Zhao","doi":"10.1007/s00530-024-01439-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01439-0","url":null,"abstract":"<p>Thangka, as a precious heritage of painting art, holds irreplaceable research value due to its richness in Tibetan history, religious beliefs, and folk culture. However, it is susceptible to partial damage and form distortion due to natural erosion or inadequate conservation measures. Given the complexity of textures and rich semantics in thangka images, existing image inpainting methods struggle to recover their original artistic style and intricate details. In this paper, we propose a novel approach combining discrete codebook learning with a transformer for image inpainting, tailored specifically for thangka images. In the codebook learning stage, we design an improved network framework based on vector quantization (VQ) codebooks to discretely encode intermediate features of input images, yielding a context-rich discrete codebook. The second phase introduces a parallel transformer module based on a cross-shaped window, which efficiently predicts the index combinations for missing regions under limited computational cost. Furthermore, we devise a multi-scale feature guidance module that progressively fuses features from intact areas with textural features from the codebook, thereby enhancing the preservation of local details in non-damaged regions. We validate the efficacy of our method through qualitative and quantitative experiments on datasets including Celeba-HQ, Places2, and a custom thangka dataset. Experimental results demonstrate that compared to previous methods, our approach successfully reconstructs images with more complete structural information and clearer textural details.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"167 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00530-024-01428-3
Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu
As a prominent manifestation of user-generated content (UGC), micro-video has emerged as a pivotal medium for individuals to document and disseminate their daily experiences. In particular, micro-videos generally encompass abundant content elements that are abstractly described by a group of annotated labels. However, previous methods primarily focus on the discriminability of explicit labels while neglecting corresponding implicit semantics, which are particularly relevant for diverse micro-video characteristics. To address this problem, we develop a deep low-rank semantic factorization (DLRSF) method to perform multi-label classification of micro-videos. Specifically, we introduce a semantic embedding matrix to bridge explicit labels and implicit semantics, and further present a low-rank-regularized semantic learning module to explore the intrinsic lowest-rank semantic attributes. A correlation-driven deep semantic interaction module is designed within a deep factorization framework to enhance interactions among instance features, explicit labels and semantic embeddings. Additionally, inverse covariance analysis is employed to unveil underlying correlation structures between labels and features, thereby making the semantic embeddings more discriminative and improving model generalization ability simultaneously. Extensive experimental results on three available datasets have showcased the superiority of our DLRSF compared with the state-of-the-art methods.
{"title":"A deep low-rank semantic factorization method for micro-video multi-label classification","authors":"Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu","doi":"10.1007/s00530-024-01428-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01428-3","url":null,"abstract":"<p>As a prominent manifestation of user-generated content (UGC), micro-video has emerged as a pivotal medium for individuals to document and disseminate their daily experiences. In particular, micro-videos generally encompass abundant content elements that are abstractly described by a group of annotated labels. However, previous methods primarily focus on the discriminability of explicit labels while neglecting corresponding implicit semantics, which are particularly relevant for diverse micro-video characteristics. To address this problem, we develop a deep low-rank semantic factorization (DLRSF) method to perform multi-label classification of micro-videos. Specifically, we introduce a semantic embedding matrix to bridge explicit labels and implicit semantics, and further present a low-rank-regularized semantic learning module to explore the intrinsic lowest-rank semantic attributes. A correlation-driven deep semantic interaction module is designed within a deep factorization framework to enhance interactions among instance features, explicit labels and semantic embeddings. Additionally, inverse covariance analysis is employed to unveil underlying correlation structures between labels and features, thereby making the semantic embeddings more discriminative and improving model generalization ability simultaneously. Extensive experimental results on three available datasets have showcased the superiority of our DLRSF compared with the state-of-the-art methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"72 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconstructing 3D face from monocular images is a challenging computer vision task, due to the limitations of traditional 3DMM (3D Morphable Model) and the lack of high-fidelity 3D facial scanning data. To solve this issue, we propose a novel coarse-to-fine self-supervised learning framework for reconstructing fine-grained 3D faces from monocular images in the wild. In the coarse stage, face parameters extracted from a single image are used to reconstruct a coarse 3D face through a 3DMM. In the refinement stage, we design a wavelet transform perception model to extract facial details in different frequency domains from an input image. Furthermore, we propose a depth displacement module based on the wavelet transform perception model to generate a refined displacement map from the unwrapped UV textures of the input image and rendered coarse face, which can be used to synthesize detailed 3D face geometry. Moreover, we propose a novel albedo map module based on the wavelet transform perception model to capture high-frequency texture information and generate a detailed albedo map consistent with face illumination. The detailed face geometry and albedo map are used to reconstruct a fine-grained 3D face without any labeled data. We have conducted extensive experiments that demonstrate the superiority of our method over existing state-of-the-art approaches for 3D face reconstruction on four public datasets including CelebA, LS3D, LFW, and NoW benchmark. The experimental results indicate that our method achieved higher accuracy and robustness, particularly of under the challenging conditions such as occlusion, large poses, and varying illuminations.
由于传统 3DMM(三维可变形模型)的局限性和高保真三维面部扫描数据的缺乏,从单目图像重建三维人脸是一项极具挑战性的计算机视觉任务。为了解决这个问题,我们提出了一种新颖的从粗到细的自监督学习框架,用于从野外单目图像中重建细粒度三维人脸。在粗粒度阶段,从单张图像中提取的人脸参数被用于通过 3DMM 重建粗粒度 3D 人脸。在细化阶段,我们设计了一个小波变换感知模型,从输入图像中提取不同频域的面部细节。此外,我们还提出了一个基于小波变换感知模型的深度位移模块,从输入图像和渲染后的粗略人脸的未包裹 UV 纹理中生成精细的位移图,用于合成详细的三维人脸几何图形。此外,我们还提出了基于小波变换感知模型的新型反照率图模块,用于捕捉高频纹理信息并生成与人脸光照一致的详细反照率图。详细的人脸几何图形和反照率图用于在没有任何标记数据的情况下重建精细的三维人脸。我们在 CelebA、LS3D、LFW 和 NoW 基准等四个公共数据集上进行了大量实验,证明我们的方法优于现有的最先进的三维人脸重建方法。实验结果表明,我们的方法实现了更高的准确性和鲁棒性,尤其是在遮挡、大姿势和不同光照等具有挑战性的条件下。
{"title":"Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild","authors":"Dongjin Huang, Yongsheng Shi, Jinhua Liu, Wen Tang","doi":"10.1007/s00530-024-01436-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01436-3","url":null,"abstract":"<p>Reconstructing 3D face from monocular images is a challenging computer vision task, due to the limitations of traditional 3DMM (3D Morphable Model) and the lack of high-fidelity 3D facial scanning data. To solve this issue, we propose a novel coarse-to-fine self-supervised learning framework for reconstructing fine-grained 3D faces from monocular images in the wild. In the coarse stage, face parameters extracted from a single image are used to reconstruct a coarse 3D face through a 3DMM. In the refinement stage, we design a wavelet transform perception model to extract facial details in different frequency domains from an input image. Furthermore, we propose a depth displacement module based on the wavelet transform perception model to generate a refined displacement map from the unwrapped UV textures of the input image and rendered coarse face, which can be used to synthesize detailed 3D face geometry. Moreover, we propose a novel albedo map module based on the wavelet transform perception model to capture high-frequency texture information and generate a detailed albedo map consistent with face illumination. The detailed face geometry and albedo map are used to reconstruct a fine-grained 3D face without any labeled data. We have conducted extensive experiments that demonstrate the superiority of our method over existing state-of-the-art approaches for 3D face reconstruction on four public datasets including CelebA, LS3D, LFW, and NoW benchmark. The experimental results indicate that our method achieved higher accuracy and robustness, particularly of under the challenging conditions such as occlusion, large poses, and varying illuminations.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"23 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00530-024-01434-5
Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma
Viewport-dependent streaming (VDS) of immersive video typically devises the attentive viewport (or FoV - Field of View) with high-quality compression but low-quality compressed content outside of it to reduce bandwidth. It, however, assumes uniform compression within the viewport, completely neglecting visual redundancy caused by non-uniform perception in central and peripheral vision areas when consuming the content using a head-mounted display (HMD). Our work models the unequal retinal perception within the instantaneous viewport and explores using it in the VDS system for non-uniform viewport compression to further save the data volume. To this end, we assess the just-noticeable-distortion moment of the rendered viewport frame by carefully adapting image quality-related compression factors like quantization stepsize q and/or spatial resolution s zone-by-zone to explicitly derive the imperceptible quality perception threshold with respect to the eccentric angle. Independent validations show that the visual perception of the immersive images with non-uniform FoV quality guided by our model is indistinguishable from that of images with default uniform FoV quality. Our model can be flexibly integrated with the tiling strategy in popular video codecs to facilitate non-uniform viewport compression in practical VDS systems for significant bandwidth reduction (e.g., about 40% reported in our experiments) at similar visual quality.
{"title":"Modeling the non-uniform retinal perception for viewport-dependent streaming of immersive video","authors":"Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma","doi":"10.1007/s00530-024-01434-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01434-5","url":null,"abstract":"<p>Viewport-dependent streaming (VDS) of immersive video typically devises the attentive viewport (or FoV - Field of View) with high-quality compression but low-quality compressed content outside of it to reduce bandwidth. It, however, assumes uniform compression within the viewport, completely neglecting visual redundancy caused by non-uniform perception in central and peripheral vision areas when consuming the content using a head-mounted display (HMD). Our work models the unequal retinal perception within the instantaneous viewport and explores using it in the VDS system for non-uniform viewport compression to further save the data volume. To this end, we assess the just-noticeable-distortion moment of the rendered viewport frame by carefully adapting image quality-related compression factors like quantization stepsize q and/or spatial resolution s zone-by-zone to explicitly derive the imperceptible quality perception threshold with respect to the eccentric angle. Independent validations show that the visual perception of the immersive images with non-uniform FoV quality guided by our model is indistinguishable from that of images with default uniform FoV quality. Our model can be flexibly integrated with the tiling strategy in popular video codecs to facilitate non-uniform viewport compression in practical VDS systems for significant bandwidth reduction (e.g., about 40% reported in our experiments) at similar visual quality.\u0000</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"61 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1007/s00530-024-01371-3
Chung-Ming Huang, Zi-Yuan Hu
To deal with the coming Multi-access Edge Computing (MEC)-based 5G and the future 6G wireless mobile network environment, a Multi-access Edge Computing (MEC)-based video streaming method using the proposed quality-aware video bitrate adaption and MEC server handoff control mechanisms for Dynamic Adaptive Streaming over HTTP (MPEG-DASH) video streaming was proposed in this work. Since the user is moving, the attached Base Station (BS) of the cellular network can be changed, i.e., the BS handoff can happen, which results in the corresponding MEC server handoff. Thus, this work proposed the MEC server handoff control mechanism to make the playing quality be smooth when the MEC server handoff happens. To have the MEC server to be able to derive the video bit rate for each video segment of the MPEG-DASH video streaming and to have the smooth video streaming when the MEC server handoff happens, the proposed method (i) derives the estimated bandwidth using the adaptive filter mechanism, (ii) generates some candidate video bit rates by considering the estimated bandwidth and the buffer occupancy situation in the client side and then (iii) selects a video bit rate from the candidate ones considering video quality’s stability. For the video quality’s stability concern, the proposed method considered not only (i) both bandwidth and buffer issues but also (ii) the long-term quality variation and the short-term quality variation to have the adaptive video streaming. The results of the performance evaluation, which was executed in a lab-wide experimental LTE network eNB system, shown that the proposed method has the more stable video quality for the MPEG-DASH video streaming over the wireless mobile network.
{"title":"Bitrate-adaptive and quality-aware HTTP video streaming with the multi-access edge computing server handoff control","authors":"Chung-Ming Huang, Zi-Yuan Hu","doi":"10.1007/s00530-024-01371-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01371-3","url":null,"abstract":"<p>To deal with the coming Multi-access Edge Computing (MEC)-based 5G and the future 6G wireless mobile network environment, a Multi-access Edge Computing (MEC)-based video streaming method using the proposed quality-aware video bitrate adaption and MEC server handoff control mechanisms for Dynamic Adaptive Streaming over HTTP (MPEG-DASH) video streaming was proposed in this work. Since the user is moving, the attached Base Station (BS) of the cellular network can be changed, i.e., the BS handoff can happen, which results in the corresponding MEC server handoff. Thus, this work proposed the MEC server handoff control mechanism to make the playing quality be smooth when the MEC server handoff happens. To have the MEC server to be able to derive the video bit rate for each video segment of the MPEG-DASH video streaming and to have the smooth video streaming when the MEC server handoff happens, the proposed method (i) derives the estimated bandwidth using the adaptive filter mechanism, (ii) generates some candidate video bit rates by considering the estimated bandwidth and the buffer occupancy situation in the client side and then (iii) selects a video bit rate from the candidate ones considering video quality’s stability. For the video quality’s stability concern, the proposed method considered not only (i) both bandwidth and buffer issues but also (ii) the long-term quality variation and the short-term quality variation to have the adaptive video streaming. The results of the performance evaluation, which was executed in a lab-wide experimental LTE network eNB system, shown that the proposed method has the more stable video quality for the MPEG-DASH video streaming over the wireless mobile network.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"43 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141883369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}