Multimedia Systems最新文献_第4页

PS-YOLO: a small object detector based on efficient convolution and multi-scale feature fusion PS-YOLO：基于高效卷积和多尺度特征融合的小物体检测器

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01447-0

Shifeng Peng, Xin Fan, Shengwei Tian, Long Yu

Compared to generalized object detection, research on small object detection has been slow, mainly due to the need to learn appropriate features from limited information about small objects. This is coupled with difficulties such as information loss during the forward propagation of neural networks. In order to solve this problem, this paper proposes an object detector named PS-YOLO with a model: (1) Reconstructs the C2f module to reduce the weakening or loss of small object features during the deep superposition of the backbone network. (2) Optimizes the neck feature fusion using the PD module, which fuses features at different levels and sizes to improve the model’s feature fusion capability at multiple scales. (3) Design the multi-channel aggregate receptive field module (MCARF) for downsampling to extend the image receptive field and recognize more local information. The experimental results of this method on three public datasets show that the algorithm achieves satisfactory accuracy, prediction, and recall.

与广义物体检测相比，小物体检测方面的研究进展缓慢，主要原因是需要从有限的小物体信息中学习适当的特征。再加上神经网络前向传播过程中的信息丢失等困难。为了解决这一问题，本文提出了一种名为 PS-YOLO 的物体检测器，其模型为：（1）重构 C2f 模块，减少骨干网络深度叠加过程中对小物体特征的削弱或损失。(2) 利用 PD 模块优化颈部特征融合，融合不同层次和大小的特征，提高模型在多尺度上的特征融合能力。(3) 设计多通道聚合感受野模块（MCARF）进行降采样，扩展图像感受野，识别更多局部信息。该方法在三个公共数据集上的实验结果表明，算法的准确率、预测率和召回率都达到了令人满意的水平。

引用次数: 0

Multimodal recommender system based on multi-channel counterfactual learning networks 基于多通道反事实学习网络的多模式推荐系统

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01448-z

Hong Fang, Leiyuxin Sha, Jindong Liang

Most multimodal recommender systems utilize multimodal content of user-interacted items as supplemental information to capture user preferences based on historical interactions without considering user-uninteracted items. In contrast, multimodal recommender systems based on causal inference counterfactual learning utilize the causal difference between the multimodal content of user-interacted and user-uninteracted items to purify the content related to user preferences. However, existing methods adopt a unified multimodal channel, which treats each modality equally, resulting in the inability to distinguish users’ tastes for different modalities. Therefore, the differences in users’ attention and perception of different modalities' content cannot be reflected. To cope with the above issue, this paper proposes a novel recommender system based on multi-channel counterfactual learning (MCCL) networks to capture user fine-grained preferences on different modalities. First, two independent channels are established based on the corresponding features for the content of image and text modalities for modality-specific feature extraction. Then, leveraging the counterfactual theory of causal inference, features in each channel unrelated to user preferences are eliminated using the features of the user-uninteracted items. Features related to user preferences are enhanced and multimodal user preferences are modeled at the content level, which portrays the users' taste for the different modalities of items. Finally, semantic entities are extracted to model semantic-level multimodal user preferences, which are fused with historical user interaction information and content-level user preferences for recommendation. Extensive experiments on three different datasets show that our results improve up to 4.17% on NDCG compared to the optimal model.

大多数多模态推荐系统利用用户互动项目的多模态内容作为补充信息，以历史互动为基础捕捉用户偏好，而不考虑用户未互动的项目。相比之下，基于因果推理反事实学习的多模态推荐系统则利用用户互动项目和用户未互动项目的多模态内容之间的因果差异来提纯与用户偏好相关的内容。然而，现有方法采用统一的多模态通道，对每种模态一视同仁，导致无法区分用户对不同模态的喜好。因此，无法反映用户对不同模式内容的关注和感知差异。针对上述问题，本文提出了一种基于多通道反事实学习（MCCL）网络的新型推荐系统，以捕捉用户对不同模式的细粒度偏好。首先，根据图像和文本模态内容的相应特征建立两个独立通道，以提取特定模态的特征。然后，利用因果推理的反事实理论，利用用户未互动项目的特征剔除每个通道中与用户偏好无关的特征。增强与用户偏好相关的特征，并在内容层面建立多模态用户偏好模型，从而描绘出用户对不同模态项目的喜好。最后，提取语义实体，建立语义级多模态用户偏好模型，并将其与历史用户交互信息和内容级用户偏好融合，以进行推荐。在三个不同数据集上进行的广泛实验表明，与最优模型相比，我们的结果在 NDCG 上提高了 4.17%。

{"title":"Multimodal recommender system based on multi-channel counterfactual learning networks","authors":"Hong Fang, Leiyuxin Sha, Jindong Liang","doi":"10.1007/s00530-024-01448-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01448-z","url":null,"abstract":"Most multimodal recommender systems utilize multimodal content of user-interacted items as supplemental information to capture user preferences based on historical interactions without considering user-uninteracted items. In contrast, multimodal recommender systems based on causal inference counterfactual learning utilize the causal difference between the multimodal content of user-interacted and user-uninteracted items to purify the content related to user preferences. However, existing methods adopt a unified multimodal channel, which treats each modality equally, resulting in the inability to distinguish users’ tastes for different modalities. Therefore, the differences in users’ attention and perception of different modalities' content cannot be reflected. To cope with the above issue, this paper proposes a novel recommender system based on multi-channel counterfactual learning (MCCL) networks to capture user fine-grained preferences on different modalities. First, two independent channels are established based on the corresponding features for the content of image and text modalities for modality-specific feature extraction. Then, leveraging the counterfactual theory of causal inference, features in each channel unrelated to user preferences are eliminated using the features of the user-uninteracted items. Features related to user preferences are enhanced and multimodal user preferences are modeled at the content level, which portrays the users' taste for the different modalities of items. Finally, semantic entities are extracted to model semantic-level multimodal user preferences, which are fused with historical user interaction information and content-level user preferences for recommendation. Extensive experiments on three different datasets show that our results improve up to 4.17% on NDCG compared to the optimal model.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"16 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring multi-level transformers with feature frame padding network for 3D human pose estimation 利用特征帧填充网络探索用于三维人体姿态估计的多级变换器

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01451-4

Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo

Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.

最近，基于变换器的架构在从二维到三维的升降姿态估计方面取得了显著的性能。尽管基于变换器的架构取得了进步，但它们在处理深度模糊性、有限的时间信息、缺乏边缘帧细节和短期时间特征等问题上仍有困难。因此，变换器架构在精确估计三维人体位置方面遇到了挑战。为了解决这些问题，我们提出了带有特征帧填充网络（MLTFFPN）的多级变换器。为此，我们首先提出了帧填充网络，使网络能够捕捉更长的时间依赖性，有效解决边缘帧信息不足的问题，从而更好地理解人体运动的连续性，提高姿势估计的准确性。此外，我们采用多级变换器从三维人体姿势中提取时间信息，旨在改善人体姿势骨架关键点之间的短程时间依赖关系。具体来说，我们引入了精炼时空收缩和扩散变换器（RTCPT），该变换器结合了时空编码器和时空收缩和扩散（TCP）结构，以揭示多尺度注意力信息，并有效解决深度模糊问题。此外，我们还以跨层方式将特征聚合细化（FAR）模块纳入 TCP 块，通过查询、键和值的持续交互促进语义表示。我们通过在两个著名的基准数据集上进行实验，广泛评估了我们方法的效率：Human3.6M 和 MPI-INF-3DHP。

{"title":"Exploring multi-level transformers with feature frame padding network for 3D human pose estimation","authors":"Sathiyamoorthi Arthanari, Jae Hoon Jeong, Young Hoon Joo","doi":"10.1007/s00530-024-01451-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01451-4","url":null,"abstract":"Recently, transformer-based architecture achieved remarkable performance in 2D to 3D lifting pose estimation. Despite advancements in transformer-based architecture they still struggle to handle depth ambiguity, limited temporal information, lacking edge frame details, and short-term temporal features. Consequently, transformer architecture encounters challenges in preciously estimating the 3D human position. To address these problems, we proposed Multi-Level Transformers with a Feature Frame Padding Network (MLTFFPN). To do this, we first propose the frame-padding network, which allows the network to capture longer temporal dependencies and effectively address the lacking edge frame information, enabling a better understanding of the sequential nature of human motion and improving the accuracy of pose estimation. Furthermore, we employ a multi-level transformer to extract temporal information from 3D human poses, which aims to improve the short-range temporal dependencies among keypoints of the human pose skeleton. Specifically, we introduce the Refined Temporal Constriction and Proliferation Transformer (RTCPT), which incorporates spatio-temporal encoders and a Temporal Constriction and Proliferation (TCP) structure to reveal multi-scale attention information and effectively addresses the depth ambiguity problem. Moreover, we incorporate the Feature Aggregation Refinement (FAR) module into the TCP block in a cross-layer manner, which facilitates semantic representation through the persistent interaction of queries, keys, and values. We extensively evaluate the efficiency of our method through experiments on two well-known benchmark datasets: Human3.6M and MPI-INF-3DHP.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Propagating prior information with transformer for robust visual object tracking 利用变换器传播先验信息，实现稳健的视觉物体跟踪

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-13 DOI: 10.1007/s00530-024-01423-8

Yue Wu, Chengtao Cai, Chai Kiat Yeo

In recent years, the domain of visual object tracking has witnessed considerable advancements with the advent of deep learning methodologies. Siamese-based trackers have been pivotal, establishing a new architecture with a weight-shared backbone. With the inclusion of the transformer, attention mechanism has been exploited to enhance the feature discriminability across successive frames. However, the limited adaptability of many existing trackers to the different tracking scenarios has led to inaccurate target localization. To effectively solve this issue, in this paper, we have integrated a siamese network with the transformer, where the former utilizes ResNet50 as the backbone network to extract the target features, while the latter consists of an encoder and a decoder, where the encoder can effectively utilize global contextual information to obtain the discriminative features. Simultaneously, we employ the decoder to propagate prior information related to the target, which enables the tracker to successfully locate the target in a variety of environments, enhancing the stability and robustness of the tracker. Extensive experiments on four major public datasets, OTB100, UAV123, GOT10k and LaSOText demonstrate the effectiveness of the proposed method. Its performance surpasses many state-of-the-art trackers. Additionally, the proposed tracker can achieve a tracking speed of 60 fps, meeting the requirements for real-time tracking.

近年来，随着深度学习方法的出现，视觉物体跟踪领域取得了长足的进步。基于连体的跟踪器发挥了关键作用，建立了一种以权重共享为骨干的新架构。随着变压器的加入，注意力机制被用来提高连续帧的特征可辨别性。然而，现有的许多跟踪器对不同跟踪场景的适应性有限，导致目标定位不准确。为了有效解决这一问题，本文将连体网络与变压器整合在一起，前者利用 ResNet50 作为骨干网络来提取目标特征，后者由编码器和解码器组成，其中编码器可以有效利用全局上下文信息来获取判别特征。同时，我们还利用解码器传播与目标相关的先验信息，从而使跟踪器能够在各种环境中成功定位目标，增强跟踪器的稳定性和鲁棒性。在四个主要公共数据集 OTB100、UAV123、GOT10k 和 LaSOText 上进行的广泛实验证明了所提方法的有效性。其性能超过了许多最先进的跟踪器。此外，所提出的跟踪器可以达到 60 fps 的跟踪速度，满足了实时跟踪的要求。

{"title":"Propagating prior information with transformer for robust visual object tracking","authors":"Yue Wu, Chengtao Cai, Chai Kiat Yeo","doi":"10.1007/s00530-024-01423-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01423-8","url":null,"abstract":"In recent years, the domain of visual object tracking has witnessed considerable advancements with the advent of deep learning methodologies. Siamese-based trackers have been pivotal, establishing a new architecture with a weight-shared backbone. With the inclusion of the transformer, attention mechanism has been exploited to enhance the feature discriminability across successive frames. However, the limited adaptability of many existing trackers to the different tracking scenarios has led to inaccurate target localization. To effectively solve this issue, in this paper, we have integrated a siamese network with the transformer, where the former utilizes ResNet50 as the backbone network to extract the target features, while the latter consists of an encoder and a decoder, where the encoder can effectively utilize global contextual information to obtain the discriminative features. Simultaneously, we employ the decoder to propagate prior information related to the target, which enables the tracker to successfully locate the target in a variety of environments, enhancing the stability and robustness of the tracker. Extensive experiments on four major public datasets, OTB100, UAV123, GOT10k and LaSOText demonstrate the effectiveness of the proposed method. Its performance surpasses many state-of-the-art trackers. Additionally, the proposed tracker can achieve a tracking speed of 60 fps, meeting the requirements for real-time tracking.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"8 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-level pyramid fusion for efficient stereo matching 多级金字塔融合实现高效立体匹配

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-12 DOI: 10.1007/s00530-024-01419-4

Jiaqi Zhu, Bin Li, Xinhua Zhao

Stereo matching is a key technology for many autonomous driving and robotics applications. Recently, methods based on Convolutional Neural Network have achieved huge progress. However, it is still difficult to find accurate matching points in inherently ill-posed regions such as areas with weak texture and reflective surfaces. In this paper, we propose a multi-level pyramid fusion volume (MPFV-Stereo) which contains two prominent components: multi-scale cost volume (MSCV) and multi-level cost volume (MLCV). We also design a low-parameter Gaussian attention module to excite cost volume. Our MPFV-Stereo ranks 2nd on KITTI 2012 (Reflective) among all published methods. In addition, MPFV-Stereo has competitive results on both Scene Flow and KITTI datasets and requires less training to achieve strong cross-dataset generalization on Middlebury and ETH3D benchmark.

立体匹配是许多自动驾驶和机器人应用的关键技术。最近，基于卷积神经网络的方法取得了巨大进步。然而，在纹理薄弱区域和反光表面等固有问题区域，仍然很难找到精确的匹配点。在本文中，我们提出了一种多层次金字塔融合体（MPFV-Stereo），它包含两个重要组成部分：多尺度成本体（MSCV）和多层次成本体（MLCV）。我们还设计了一个低参数高斯注意模块来激发成本体积。我们的 MPFV-Stereo 在 2012 年 KITTI（反思）上，在所有已发布的方法中排名第二。此外，MPFV-Stereo 在 Scene Flow 和 KITTI 数据集上的结果也很有竞争力，而且在 Middlebury 和 ETH3D 基准上，只需较少的训练即可实现较强的跨数据集泛化。

引用次数: 0

Underwater image enhancement based on weighted guided filter image fusion 基于加权导向滤波图像融合的水下图像增强技术

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-12 DOI: 10.1007/s00530-024-01432-7

Dan Xiang, Huihua Wang, Zebin Zhou, Hao Zhao, Pan Gao, Jinwen Zhang, Chun Shan

An underwater image enhancement technique based on weighted guided filter image fusion is proposed to address challenges, including optical absorption and scattering, color distortion, and uneven illumination. The method consists of three stages: color correction, local contrast enhancement, and fusion algorithm methods. In terms of color correction, basic correction is achieved through channel compensation and remapping, with saturation adjusted based on histogram distribution to enhance visual richness. For local contrast enhancement, the approach involves box filtering and a variational model to improve image saturation. Finally, the method utilizes weighted guided filter image fusion to achieve high visual quality underwater images. Additionally, our method outperforms eight state-of-the-art algorithms in no-reference metrics, demonstrating its effectiveness and innovation. We also conducted application tests and time comparisons to further validate the practicality of our approach.

提出了一种基于加权导向滤波图像融合的水下图像增强技术，以应对包括光学吸收和散射、色彩失真和光照不均等挑战。该方法包括三个阶段：色彩校正、局部对比度增强和融合算法方法。在色彩校正方面，通过通道补偿和重映射实现基本校正，并根据直方图分布调整饱和度，以增强视觉丰富度。在局部对比度增强方面，该方法采用盒式滤波和变异模型来提高图像饱和度。最后，该方法利用加权引导滤波图像融合技术实现高视觉质量的水下图像。此外，我们的方法在无参考指标方面优于八种最先进的算法，证明了其有效性和创新性。我们还进行了应用测试和时间比较，以进一步验证我们方法的实用性。

引用次数: 0

Discrete codebook collaborating with transformer for thangka image inpainting 用于唐卡图像绘制的离散编码本与变换器协作

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-07 DOI: 10.1007/s00530-024-01439-0

Jinxian Bai, Yao Fan, Zhiwei Zhao

Thangka, as a precious heritage of painting art, holds irreplaceable research value due to its richness in Tibetan history, religious beliefs, and folk culture. However, it is susceptible to partial damage and form distortion due to natural erosion or inadequate conservation measures. Given the complexity of textures and rich semantics in thangka images, existing image inpainting methods struggle to recover their original artistic style and intricate details. In this paper, we propose a novel approach combining discrete codebook learning with a transformer for image inpainting, tailored specifically for thangka images. In the codebook learning stage, we design an improved network framework based on vector quantization (VQ) codebooks to discretely encode intermediate features of input images, yielding a context-rich discrete codebook. The second phase introduces a parallel transformer module based on a cross-shaped window, which efficiently predicts the index combinations for missing regions under limited computational cost. Furthermore, we devise a multi-scale feature guidance module that progressively fuses features from intact areas with textural features from the codebook, thereby enhancing the preservation of local details in non-damaged regions. We validate the efficacy of our method through qualitative and quantitative experiments on datasets including Celeba-HQ, Places2, and a custom thangka dataset. Experimental results demonstrate that compared to previous methods, our approach successfully reconstructs images with more complete structural information and clearer textural details.

唐卡作为绘画艺术的珍贵遗产，因其蕴含着丰富的藏族历史、宗教信仰和民俗文化，具有不可替代的研究价值。然而，由于自然侵蚀或保护措施不当，很容易造成局部损坏和形态变形。鉴于唐卡图像纹理的复杂性和丰富的语义，现有的图像上色方法很难恢复其原有的艺术风格和复杂细节。在本文中，我们提出了一种新方法，将离散代码集学习与图像着色变换器相结合，专门用于唐卡图像。在编码本学习阶段，我们设计了一个基于向量量化（VQ）编码本的改进网络框架，对输入图像的中间特征进行离散编码，从而生成一个上下文丰富的离散编码本。第二阶段引入了基于十字形窗口的并行变换器模块，它能在有限的计算成本下高效预测缺失区域的索引组合。此外，我们还设计了一个多尺度特征引导模块，将完整区域的特征与代码库中的纹理特征逐步融合，从而加强对非损坏区域局部细节的保护。我们在 Celeba-HQ、Places2 和自定义唐卡数据集等数据集上进行了定性和定量实验，验证了我们方法的有效性。实验结果表明，与之前的方法相比，我们的方法成功地重建了具有更完整结构信息和更清晰纹理细节的图像。

{"title":"Discrete codebook collaborating with transformer for thangka image inpainting","authors":"Jinxian Bai, Yao Fan, Zhiwei Zhao","doi":"10.1007/s00530-024-01439-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01439-0","url":null,"abstract":"Thangka, as a precious heritage of painting art, holds irreplaceable research value due to its richness in Tibetan history, religious beliefs, and folk culture. However, it is susceptible to partial damage and form distortion due to natural erosion or inadequate conservation measures. Given the complexity of textures and rich semantics in thangka images, existing image inpainting methods struggle to recover their original artistic style and intricate details. In this paper, we propose a novel approach combining discrete codebook learning with a transformer for image inpainting, tailored specifically for thangka images. In the codebook learning stage, we design an improved network framework based on vector quantization (VQ) codebooks to discretely encode intermediate features of input images, yielding a context-rich discrete codebook. The second phase introduces a parallel transformer module based on a cross-shaped window, which efficiently predicts the index combinations for missing regions under limited computational cost. Furthermore, we devise a multi-scale feature guidance module that progressively fuses features from intact areas with textural features from the codebook, thereby enhancing the preservation of local details in non-damaged regions. We validate the efficacy of our method through qualitative and quantitative experiments on datasets including Celeba-HQ, Places2, and a custom thangka dataset. Experimental results demonstrate that compared to previous methods, our approach successfully reconstructs images with more complete structural information and clearer textural details.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"167 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A deep low-rank semantic factorization method for micro-video multi-label classification 用于微视频多标签分类的深度低阶语义因式分解方法

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01428-3

Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu

As a prominent manifestation of user-generated content (UGC), micro-video has emerged as a pivotal medium for individuals to document and disseminate their daily experiences. In particular, micro-videos generally encompass abundant content elements that are abstractly described by a group of annotated labels. However, previous methods primarily focus on the discriminability of explicit labels while neglecting corresponding implicit semantics, which are particularly relevant for diverse micro-video characteristics. To address this problem, we develop a deep low-rank semantic factorization (DLRSF) method to perform multi-label classification of micro-videos. Specifically, we introduce a semantic embedding matrix to bridge explicit labels and implicit semantics, and further present a low-rank-regularized semantic learning module to explore the intrinsic lowest-rank semantic attributes. A correlation-driven deep semantic interaction module is designed within a deep factorization framework to enhance interactions among instance features, explicit labels and semantic embeddings. Additionally, inverse covariance analysis is employed to unveil underlying correlation structures between labels and features, thereby making the semantic embeddings more discriminative and improving model generalization ability simultaneously. Extensive experimental results on three available datasets have showcased the superiority of our DLRSF compared with the state-of-the-art methods.

作为用户生成内容（UGC）的一种突出表现，微视频已成为个人记录和传播日常经历的重要媒介。特别是，微视频通常包含丰富的内容元素，这些元素由一组注释标签进行抽象描述。然而，以往的方法主要关注显性标签的可辨别性，却忽视了相应的隐性语义，而隐性语义与微视频的各种特征尤为相关。为了解决这个问题，我们开发了一种深度低阶语义因式分解（DLRSF）方法来对微视频进行多标签分类。具体来说，我们引入了一个语义嵌入矩阵来连接显式标签和隐式语义，并进一步提出了一个低阶正则化语义学习模块来探索内在的最低阶语义属性。在深度因式分解框架内设计了一个相关性驱动的深度语义交互模块，以增强实例特征、显式标签和语义嵌入之间的交互。此外，还采用了逆协方差分析来揭示标签和特征之间的潜在相关结构，从而使语义嵌入更具辨别力，并同时提高模型的泛化能力。在三个可用数据集上进行的广泛实验结果表明，与最先进的方法相比，我们的 DLRSF 更具优势。

{"title":"A deep low-rank semantic factorization method for micro-video multi-label classification","authors":"Fugui Fan, Yuting Su, Yun Liu, Peiguang Jing, Kaihua Qu","doi":"10.1007/s00530-024-01428-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01428-3","url":null,"abstract":"As a prominent manifestation of user-generated content (UGC), micro-video has emerged as a pivotal medium for individuals to document and disseminate their daily experiences. In particular, micro-videos generally encompass abundant content elements that are abstractly described by a group of annotated labels. However, previous methods primarily focus on the discriminability of explicit labels while neglecting corresponding implicit semantics, which are particularly relevant for diverse micro-video characteristics. To address this problem, we develop a deep low-rank semantic factorization (DLRSF) method to perform multi-label classification of micro-videos. Specifically, we introduce a semantic embedding matrix to bridge explicit labels and implicit semantics, and further present a low-rank-regularized semantic learning module to explore the intrinsic lowest-rank semantic attributes. A correlation-driven deep semantic interaction module is designed within a deep factorization framework to enhance interactions among instance features, explicit labels and semantic embeddings. Additionally, inverse covariance analysis is employed to unveil underlying correlation structures between labels and features, thereby making the semantic embeddings more discriminative and improving model generalization ability simultaneously. Extensive experimental results on three available datasets have showcased the superiority of our DLRSF compared with the state-of-the-art methods.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"72 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild 野外细粒度单目三维人脸重建的自我监督学习

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01436-3

Dongjin Huang, Yongsheng Shi, Jinhua Liu, Wen Tang

Reconstructing 3D face from monocular images is a challenging computer vision task, due to the limitations of traditional 3DMM (3D Morphable Model) and the lack of high-fidelity 3D facial scanning data. To solve this issue, we propose a novel coarse-to-fine self-supervised learning framework for reconstructing fine-grained 3D faces from monocular images in the wild. In the coarse stage, face parameters extracted from a single image are used to reconstruct a coarse 3D face through a 3DMM. In the refinement stage, we design a wavelet transform perception model to extract facial details in different frequency domains from an input image. Furthermore, we propose a depth displacement module based on the wavelet transform perception model to generate a refined displacement map from the unwrapped UV textures of the input image and rendered coarse face, which can be used to synthesize detailed 3D face geometry. Moreover, we propose a novel albedo map module based on the wavelet transform perception model to capture high-frequency texture information and generate a detailed albedo map consistent with face illumination. The detailed face geometry and albedo map are used to reconstruct a fine-grained 3D face without any labeled data. We have conducted extensive experiments that demonstrate the superiority of our method over existing state-of-the-art approaches for 3D face reconstruction on four public datasets including CelebA, LS3D, LFW, and NoW benchmark. The experimental results indicate that our method achieved higher accuracy and robustness, particularly of under the challenging conditions such as occlusion, large poses, and varying illuminations.

由于传统 3DMM（三维可变形模型）的局限性和高保真三维面部扫描数据的缺乏，从单目图像重建三维人脸是一项极具挑战性的计算机视觉任务。为了解决这个问题，我们提出了一种新颖的从粗到细的自监督学习框架，用于从野外单目图像中重建细粒度三维人脸。在粗粒度阶段，从单张图像中提取的人脸参数被用于通过 3DMM 重建粗粒度 3D 人脸。在细化阶段，我们设计了一个小波变换感知模型，从输入图像中提取不同频域的面部细节。此外，我们还提出了一个基于小波变换感知模型的深度位移模块，从输入图像和渲染后的粗略人脸的未包裹 UV 纹理中生成精细的位移图，用于合成详细的三维人脸几何图形。此外，我们还提出了基于小波变换感知模型的新型反照率图模块，用于捕捉高频纹理信息并生成与人脸光照一致的详细反照率图。详细的人脸几何图形和反照率图用于在没有任何标记数据的情况下重建精细的三维人脸。我们在 CelebA、LS3D、LFW 和 NoW 基准等四个公共数据集上进行了大量实验，证明我们的方法优于现有的最先进的三维人脸重建方法。实验结果表明，我们的方法实现了更高的准确性和鲁棒性，尤其是在遮挡、大姿势和不同光照等具有挑战性的条件下。

{"title":"Self-supervised learning for fine-grained monocular 3D face reconstruction in the wild","authors":"Dongjin Huang, Yongsheng Shi, Jinhua Liu, Wen Tang","doi":"10.1007/s00530-024-01436-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01436-3","url":null,"abstract":"Reconstructing 3D face from monocular images is a challenging computer vision task, due to the limitations of traditional 3DMM (3D Morphable Model) and the lack of high-fidelity 3D facial scanning data. To solve this issue, we propose a novel coarse-to-fine self-supervised learning framework for reconstructing fine-grained 3D faces from monocular images in the wild. In the coarse stage, face parameters extracted from a single image are used to reconstruct a coarse 3D face through a 3DMM. In the refinement stage, we design a wavelet transform perception model to extract facial details in different frequency domains from an input image. Furthermore, we propose a depth displacement module based on the wavelet transform perception model to generate a refined displacement map from the unwrapped UV textures of the input image and rendered coarse face, which can be used to synthesize detailed 3D face geometry. Moreover, we propose a novel albedo map module based on the wavelet transform perception model to capture high-frequency texture information and generate a detailed albedo map consistent with face illumination. The detailed face geometry and albedo map are used to reconstruct a fine-grained 3D face without any labeled data. We have conducted extensive experiments that demonstrate the superiority of our method over existing state-of-the-art approaches for 3D face reconstruction on four public datasets including CelebA, LS3D, LFW, and NoW benchmark. The experimental results indicate that our method achieved higher accuracy and robustness, particularly of under the challenging conditions such as occlusion, large poses, and varying illuminations.","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"23 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling the non-uniform retinal perception for viewport-dependent streaming of immersive video 建立非均匀视网膜感知模型，用于视口相关的沉浸式视频流

IF 3.9 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Systems

Pub Date : 2024-08-05 DOI: 10.1007/s00530-024-01434-5

Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma

Viewport-dependent streaming (VDS) of immersive video typically devises the attentive viewport (or FoV - Field of View) with high-quality compression but low-quality compressed content outside of it to reduce bandwidth. It, however, assumes uniform compression within the viewport, completely neglecting visual redundancy caused by non-uniform perception in central and peripheral vision areas when consuming the content using a head-mounted display (HMD). Our work models the unequal retinal perception within the instantaneous viewport and explores using it in the VDS system for non-uniform viewport compression to further save the data volume. To this end, we assess the just-noticeable-distortion moment of the rendered viewport frame by carefully adapting image quality-related compression factors like quantization stepsize q and/or spatial resolution s zone-by-zone to explicitly derive the imperceptible quality perception threshold with respect to the eccentric angle. Independent validations show that the visual perception of the immersive images with non-uniform FoV quality guided by our model is indistinguishable from that of images with default uniform FoV quality. Our model can be flexibly integrated with the tiling strategy in popular video codecs to facilitate non-uniform viewport compression in practical VDS systems for significant bandwidth reduction (e.g., about 40% reported in our experiments) at similar visual quality.

身临其境视频的视口相关流媒体（VDS）通常设计出具有高质量压缩功能的贴心视口（或 FoV - 视场），但在视口之外则采用低质量压缩内容，以减少带宽。然而，这种方法假定视口内的压缩是均匀的，完全忽略了使用头戴式显示器（HMD）观看内容时，中心和周边视觉区域的不均匀感知所造成的视觉冗余。我们的研究建立了瞬时视口内视网膜感知不平等的模型，并探索在 VDS 系统中使用它来进行非均匀视口压缩，以进一步节省数据量。为此，我们通过仔细调整量化步长 q 和/或空间分辨率 s 等与图像质量相关的压缩因素，逐区评估渲染视口帧的可察觉失真时刻，从而明确推导出与偏心角相关的不可察觉质量感知阈值。独立验证表明，在我们的模型指导下，非均匀视场质量的沉浸式图像的视觉感知与默认均匀视场质量的图像没有区别。我们的模型可以灵活地与流行视频编解码器中的平铺策略相结合，从而在实际的 VDS 系统中实现非均匀视口压缩，在视觉质量相似的情况下显著降低带宽（例如，我们的实验报告显示降低了约 40%）。

{"title":"Modeling the non-uniform retinal perception for viewport-dependent streaming of immersive video","authors":"Peiyao Guo, Wenjing Su, Xu Zhang, Hao Chen, Zhan Ma","doi":"10.1007/s00530-024-01434-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01434-5","url":null,"abstract":"Viewport-dependent streaming (VDS) of immersive video typically devises the attentive viewport (or FoV - Field of View) with high-quality compression but low-quality compressed content outside of it to reduce bandwidth. It, however, assumes uniform compression within the viewport, completely neglecting visual redundancy caused by non-uniform perception in central and peripheral vision areas when consuming the content using a head-mounted display (HMD). Our work models the unequal retinal perception within the instantaneous viewport and explores using it in the VDS system for non-uniform viewport compression to further save the data volume. To this end, we assess the just-noticeable-distortion moment of the rendered viewport frame by carefully adapting image quality-related compression factors like quantization stepsize q and/or spatial resolution s zone-by-zone to explicitly derive the imperceptible quality perception threshold with respect to the eccentric angle. Independent validations show that the visual perception of the immersive images with non-uniform FoV quality guided by our model is indistinguishable from that of images with default uniform FoV quality. Our model can be flexibly integrated with the tiling strategy in popular video codecs to facilitate non-uniform viewport compression in practical VDS systems for significant bandwidth reduction (e.g., about 40% reported in our experiments) at similar visual quality.\u0000","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"61 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0