The Visual Computer最新文献_第5页

TDGar-Ani: temporal motion fusion model and deformation correction network for enhancing garment animation details TDGar-Ani：用于增强服装动画细节的时空运动融合模型和变形校正网络

The Visual Computer

Pub Date : 2024-07-30 DOI: 10.1007/s00371-024-03575-0

Jiazhe Miao, Tao Peng, Fei Fang, Xinrong Hu, Li Li

Garment simulation technology has widespread applications in fields such as virtual try-on and game animation. Traditional methods often require extensive manual annotation, leading to decreased efficiency. Recent methods that simulate garment from real videos often suffer from frame jitter problems due to a lack of consideration of temporal details. These approaches usually reconstruct human bodies and garments together without considering physical constraints, leading to unnatural stretching of garments during motion. To address these challenges, we propose TDGar-Ani. In terms of method design, we first propose a motion fusion module to optimize human motion sequences, resolving frame jitter issues. Subsequently, initial garment deformations are generated through physical constraints, combined with correction parameters outputted by a deformation correction network, ensuring coordinated deformations of garments and human bodies during motion, thereby enhancing the realism of simulation. Our experimental results demonstrate the applicability of the motion fusion module in capturing human motion from real videos. Simultaneously, the overall simulation results exhibit higher naturalness and realism, effectively improving alignment and deformation effects between garments and human body motion.

服装模拟技术在虚拟试穿和游戏动画等领域有着广泛的应用。传统方法通常需要大量的人工标注，导致效率降低。最近从真实视频中模拟服装的方法，由于缺乏对时间细节的考虑，往往会出现帧抖动问题。这些方法通常在不考虑物理限制的情况下将人体和服装重建在一起，从而导致服装在运动过程中出现不自然的拉伸。为了应对这些挑战，我们提出了 TDGar-Ani。在方法设计方面，我们首先提出了一个运动融合模块来优化人体运动序列，解决帧抖动问题。随后，通过物理约束生成初始服装变形，结合变形校正网络输出的校正参数，确保服装和人体在运动过程中协调变形，从而增强模拟的真实感。实验结果表明，运动融合模块适用于从真实视频中捕捉人体运动。同时，整体仿真结果表现出更高的自然度和真实性，有效改善了服装与人体运动之间的对齐和变形效果。

{"title":"TDGar-Ani: temporal motion fusion model and deformation correction network for enhancing garment animation details","authors":"Jiazhe Miao, Tao Peng, Fei Fang, Xinrong Hu, Li Li","doi":"10.1007/s00371-024-03575-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03575-0","url":null,"abstract":"Garment simulation technology has widespread applications in fields such as virtual try-on and game animation. Traditional methods often require extensive manual annotation, leading to decreased efficiency. Recent methods that simulate garment from real videos often suffer from frame jitter problems due to a lack of consideration of temporal details. These approaches usually reconstruct human bodies and garments together without considering physical constraints, leading to unnatural stretching of garments during motion. To address these challenges, we propose TDGar-Ani. In terms of method design, we first propose a motion fusion module to optimize human motion sequences, resolving frame jitter issues. Subsequently, initial garment deformations are generated through physical constraints, combined with correction parameters outputted by a deformation correction network, ensuring coordinated deformations of garments and human bodies during motion, thereby enhancing the realism of simulation. Our experimental results demonstrate the applicability of the motion fusion module in capturing human motion from real videos. Simultaneously, the overall simulation results exhibit higher naturalness and realism, effectively improving alignment and deformation effects between garments and human body motion.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual–language foundation models in medicine 医学视觉语言基础模型

The Visual Computer

Pub Date : 2024-07-29 DOI: 10.1007/s00371-024-03579-w

Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, Dian Zeng

By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.

通过整合视觉和语言理解，视觉语言基础模型（VLFMs）在推进医疗数据解读，从而提高诊断精确度、治疗规划和患者管理方面具有巨大潜力。我们回顾了视觉语言基础模型的发展策略，详细介绍了预培训策略以及随后在各医疗领域的应用。我们描述了 VLFMs 所面临的固有挑战，包括在使用敏感医疗数据的过程中保护数据隐私、确保算法的透明度以及提高可解释性以增强临床决策的可信度。我们强调了 VLFM 在解决从视觉到文本等多模态医疗数据的复杂性方面的重要意义，以及它们在基于图像的疾病诊断、医学报告合成和纵向患者监测等任务中的潜力。综述还探讨了 Med-Flamingo、LLaVA-Med 等 VLFM 的进展及其零点学习能力，以及为实现高效适应而对参数高效微调方法的探索。最后，本综述鼓励业界继续探索这些新兴的、有前途的方向，以加强人工智能和深度学习对医疗保健服务和研究的影响。

{"title":"Visual–language foundation models in medicine","authors":"Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, Dian Zeng","doi":"10.1007/s00371-024-03579-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03579-w","url":null,"abstract":"By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: Jacobi set simplification for tracking topological features in time-varying scalar fields 更正：雅可比集简化用于跟踪时变标量场中的拓扑特征

The Visual Computer

Pub Date : 2024-07-27 DOI: 10.1007/s00371-024-03577-y

Dhruv Meduri, Mohit Sharma, Vijay Natarajan

引用次数: 0

EGDNet: an efficient glomerular detection network for multiple anomalous pathological feature in glomerulonephritis EGDNet：针对肾小球肾炎多种异常病理特征的高效肾小球检测网络

The Visual Computer

Pub Date : 2024-07-26 DOI: 10.1007/s00371-024-03570-5

Saba Ghazanfar Ali, Xiaoxia Wang, Ping Li, Huating Li, Po Yang, Younhyun Jung, Jing Qin, Jinman Kim, Bin Sheng

Glomerulonephritis (GN) is a severe kidney disorder in which the tissues in the kidney become inflamed and have problems filtering waste from the blood. Typical approaches for GN diagnosis require a specialist’s examination of pathological glomerular features (PGF) in pathology images of a patient. These PGF are primarily analyzed via manual quantitative evaluation, which is a time-consuming, labor-intensive, and error-prone task for doctors. Thus, automatic and accurate detection of PGF is crucial for the efficient diagnosis of GN and other kidney-related diseases. Recent advances in convolutional neural network-based deep learning methods have shown the capability of learning complex structural variants with promising detection results in medical image applications. However, these methods are not directly applicable to glomerular detection due to large spatial and structural variability and inter-class imbalance. Thus, we propose an efficient glomerular detection network (EGDNet) for the first time for seven types of PGF detection. Our EGDNet consists of four modules: (i) a hybrid data augmentation strategy to resolve dataset problems, (ii) an efficient intersection over unit balancing module for uniform sampling of hard and easy samples, (iii) a feature pyramid balancing module to obtain balanced multi-scale features for robust detection, and (iv) balanced L1 regression loss which alleviates the impact of anomalous data for multi-PGF detection. We also formulated a private dataset of seven PGF from an affiliated hospital in Shanghai, China. Experiments on the dataset show that our EGDNet outperforms state-of-the-art methods by achieving superior accuracy of 91.2(%), 94.9(%), and 94.2(%) on small, medium, and large pathological features, respectively.

肾小球肾炎（GN）是一种严重的肾脏疾病，肾脏组织发炎，无法过滤血液中的废物。诊断肾小球肾炎的典型方法需要专家检查患者病理图像中的病理肾小球特征（PGF）。这些病理肾小球特征主要通过人工定量评估进行分析，这对医生来说是一项耗时、耗力且容易出错的工作。因此，自动、准确地检测 PGF 对于有效诊断 GN 和其他肾脏相关疾病至关重要。基于卷积神经网络的深度学习方法的最新进展表明，这些方法具有学习复杂结构变体的能力，在医学图像应用中具有良好的检测效果。然而，由于空间和结构变异性大以及类间不平衡，这些方法并不能直接用于肾小球检测。因此，我们首次提出了一种高效的肾小球检测网络（EGDNet），用于七种类型的 PGF 检测。我们的 EGDNet 由四个模块组成：(i) 混合数据增强策略以解决数据集问题；(ii) 高效的单位交集平衡模块以实现难样本和易样本的均匀采样；(iii) 特征金字塔平衡模块以获得均衡的多尺度特征以实现鲁棒检测；(iv) 均衡的 L1 回归损失可减轻异常数据对多 PGF 检测的影响。我们还建立了一个包含 7 个 PGF 的私有数据集，这些数据来自中国上海的一家附属医院。在该数据集上的实验表明，我们的EGDNet优于最先进的方法，在小、中、大病理特征上的准确率分别达到91.2%、94.9%和94.2%。

{"title":"EGDNet: an efficient glomerular detection network for multiple anomalous pathological feature in glomerulonephritis","authors":"Saba Ghazanfar Ali, Xiaoxia Wang, Ping Li, Huating Li, Po Yang, Younhyun Jung, Jing Qin, Jinman Kim, Bin Sheng","doi":"10.1007/s00371-024-03570-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03570-5","url":null,"abstract":"Glomerulonephritis (GN) is a severe kidney disorder in which the tissues in the kidney become inflamed and have problems filtering waste from the blood. Typical approaches for GN diagnosis require a specialist’s examination of pathological glomerular features (PGF) in pathology images of a patient. These PGF are primarily analyzed via manual quantitative evaluation, which is a time-consuming, labor-intensive, and error-prone task for doctors. Thus, automatic and accurate detection of PGF is crucial for the efficient diagnosis of GN and other kidney-related diseases. Recent advances in convolutional neural network-based deep learning methods have shown the capability of learning complex structural variants with promising detection results in medical image applications. However, these methods are not directly applicable to glomerular detection due to large spatial and structural variability and inter-class imbalance. Thus, we propose an efficient glomerular detection network (EGDNet) for the first time for seven types of PGF detection. Our EGDNet consists of four modules: (i) a hybrid data augmentation strategy to resolve dataset problems, (ii) an efficient intersection over unit balancing module for uniform sampling of hard and easy samples, (iii) a feature pyramid balancing module to obtain balanced multi-scale features for robust detection, and (iv) balanced L1 regression loss which alleviates the impact of anomalous data for multi-PGF detection. We also formulated a private dataset of seven PGF from an affiliated hospital in Shanghai, China. Experiments on the dataset show that our EGDNet outperforms state-of-the-art methods by achieving superior accuracy of 91.2(%), 94.9(%), and 94.2(%) on small, medium, and large pathological features, respectively.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weakly supervised semantic segmentation via saliency perception with uncertainty-guided noise suppression 通过具有不确定性指导的噪声抑制的突出感知进行弱监督语义分割

The Visual Computer

Pub Date : 2024-07-26 DOI: 10.1007/s00371-024-03574-1

Xinyi Liu, Guoheng Huang, Xiaochen Yuan, Zewen Zheng, Guo Zhong, Xuhang Chen, Chi-Man Pun

Weakly Supervised Semantic Segmentation (WSSS) has become increasingly popular for achieving remarkable segmentation with only image-level labels. Current WSSS approaches extract Class Activation Mapping (CAM) from classification models to produce pseudo-masks for segmentation supervision. However, due to the gap between image-level supervised classification loss and pixel-level CAM generation tasks, the model tends to activate discriminative regions at the image level rather than pursuing pixel-level classification results. Moreover, insufficient supervision leads to unrestricted attention diffusion in the model, further introducing inter-class recognition noise. In this paper, we introduce a framework that employs Saliency Perception and Uncertainty, which includes a Saliency Perception Module (SPM) with Pixel-wise Transfer Loss (SP-PT), and an Uncertainty-guided Noise Suppression method. Specifically, within the SPM, we employ a hybrid attention mechanism to expand the receptive field of the module and enhance its ability to perceive salient object features. Meanwhile, a Pixel-wise Transfer Loss is designed to guide the attention diffusion of the classification model to non-discriminative regions at the pixel-level, thereby mitigating the bias of the model. To further enhance the robustness of CAM for obtaining more accurate pseudo-masks, we propose a noise suppression method based on uncertainty estimation, which applies a confidence matrix to the loss function to suppress the propagation of erroneous information and correct it, thus making the model more robust to noise. We conducted experiments on the PASCAL VOC 2012 and MS COCO 2014, and the experimental results demonstrate the effectiveness of our proposed framework. Code is available at https://github.com/pur-suit/SPU.

弱监督语义分割（WSSS）在仅使用图像级标签实现出色分割方面越来越受欢迎。目前的 WSSS 方法从分类模型中提取类激活映射（CAM），生成用于分割监督的伪掩码。然而，由于图像级监督分类损失与像素级 CAM 生成任务之间存在差距，该模型倾向于激活图像级的区分区域，而不是追求像素级的分类结果。此外，监督不足会导致模型中的注意力无限制扩散，进一步引入类间识别噪声。在本文中，我们介绍了一个采用显著性感知和不确定性的框架，其中包括带有像素转移损耗（SP-PT）的显著性感知模块（SPM）和不确定性指导的噪声抑制方法。具体来说，我们在 SPM 中采用了一种混合注意力机制，以扩大模块的感受野，增强其感知突出物体特征的能力。同时，我们还设计了一种像素转移损失（Pixel-wise Transfer Loss）机制，以引导分类模型的注意力扩散到像素级的非识别区域，从而减轻模型的偏差。为了进一步增强 CAM 的鲁棒性以获得更准确的伪掩模，我们提出了一种基于不确定性估计的噪声抑制方法，该方法将置信矩阵应用于损失函数，以抑制错误信息的传播并纠正错误信息，从而使模型对噪声具有更强的鲁棒性。我们在 PASCAL VOC 2012 和 MS COCO 2014 上进行了实验，实验结果证明了我们提出的框架的有效性。代码见 https://github.com/pur-suit/SPU。

{"title":"Weakly supervised semantic segmentation via saliency perception with uncertainty-guided noise suppression","authors":"Xinyi Liu, Guoheng Huang, Xiaochen Yuan, Zewen Zheng, Guo Zhong, Xuhang Chen, Chi-Man Pun","doi":"10.1007/s00371-024-03574-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03574-1","url":null,"abstract":"Weakly Supervised Semantic Segmentation (WSSS) has become increasingly popular for achieving remarkable segmentation with only image-level labels. Current WSSS approaches extract Class Activation Mapping (CAM) from classification models to produce pseudo-masks for segmentation supervision. However, due to the gap between image-level supervised classification loss and pixel-level CAM generation tasks, the model tends to activate discriminative regions at the image level rather than pursuing pixel-level classification results. Moreover, insufficient supervision leads to unrestricted attention diffusion in the model, further introducing inter-class recognition noise. In this paper, we introduce a framework that employs Saliency Perception and Uncertainty, which includes a Saliency Perception Module (SPM) with Pixel-wise Transfer Loss (SP-PT), and an Uncertainty-guided Noise Suppression method. Specifically, within the SPM, we employ a hybrid attention mechanism to expand the receptive field of the module and enhance its ability to perceive salient object features. Meanwhile, a Pixel-wise Transfer Loss is designed to guide the attention diffusion of the classification model to non-discriminative regions at the pixel-level, thereby mitigating the bias of the model. To further enhance the robustness of CAM for obtaining more accurate pseudo-masks, we propose a noise suppression method based on uncertainty estimation, which applies a confidence matrix to the loss function to suppress the propagation of erroneous information and correct it, thus making the model more robust to noise. We conducted experiments on the PASCAL VOC 2012 and MS COCO 2014, and the experimental results demonstrate the effectiveness of our proposed framework. Code is available at https://github.com/pur-suit/SPU.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FuseNet: a multi-modal feature fusion network for 3D shape classification FuseNet：用于三维形状分类的多模态特征融合网络

The Visual Computer

Pub Date : 2024-07-26 DOI: 10.1007/s00371-024-03581-2

Xin Zhao, Yinhuang Chen, Chengzhuan Yang, Lincong Fang

Recently, the primary focus of research in 3D shape classification has been on point cloud and multi-view methods. However, the multi-view approaches inevitably lose the structural information of 3D shapes due to the camera angle limitation. The point cloud methods use a neural network to maximize the pooling of all points to obtain a global feature, resulting in the loss of local detailed information. The disadvantages of multi-view and point cloud methods affect the performance of 3D shape classification. This paper proposes a novel FuseNet model, which integrates multi-view and point cloud information and significantly improves the accuracy of 3D model classification. First, we propose a multi-view and point cloud part to obtain the raw features of different convolution layers of multi-view and point clouds. Second, we adopt a multi-view pooling method for feature fusion of multiple views to integrate features of different convolution layers more effectively, and we propose an attention-based multi-view and point cloud fusion block for integrating features of point cloud and multiple views. Finally, we extensively tested our method on three benchmark datasets: the ModelNet10, ModelNet40, and ShapeNet Core55. Our method’s experimental results demonstrate superior or comparable classification performance to previously established state-of-the-art techniques for 3D shape classification.

近来，三维形状分类研究的主要焦点集中在点云和多视角方法上。然而，由于相机角度的限制，多视角方法不可避免地会丢失三维形状的结构信息。点云方法使用神经网络最大限度地汇集所有点以获得全局特征，从而导致局部细节信息的丢失。多视角和点云方法的缺点影响了三维形状分类的性能。本文提出了一种新颖的 FuseNet 模型，它整合了多视角和点云信息，显著提高了三维模型分类的准确性。首先，我们提出了多视图和点云部分，以获取多视图和点云不同卷积层的原始特征。其次，我们采用多视图池方法进行多视图特征融合，以更有效地整合不同卷积层的特征，并提出了基于注意力的多视图和点云融合块，用于整合点云和多视图的特征。最后，我们在 ModelNet10、ModelNet40 和 ShapeNet Core55 三个基准数据集上广泛测试了我们的方法。实验结果表明，我们的方法在三维形状分类方面的分类性能优于或可媲美之前的先进技术。

{"title":"FuseNet: a multi-modal feature fusion network for 3D shape classification","authors":"Xin Zhao, Yinhuang Chen, Chengzhuan Yang, Lincong Fang","doi":"10.1007/s00371-024-03581-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03581-2","url":null,"abstract":"Recently, the primary focus of research in 3D shape classification has been on point cloud and multi-view methods. However, the multi-view approaches inevitably lose the structural information of 3D shapes due to the camera angle limitation. The point cloud methods use a neural network to maximize the pooling of all points to obtain a global feature, resulting in the loss of local detailed information. The disadvantages of multi-view and point cloud methods affect the performance of 3D shape classification. This paper proposes a novel FuseNet model, which integrates multi-view and point cloud information and significantly improves the accuracy of 3D model classification. First, we propose a multi-view and point cloud part to obtain the raw features of different convolution layers of multi-view and point clouds. Second, we adopt a multi-view pooling method for feature fusion of multiple views to integrate features of different convolution layers more effectively, and we propose an attention-based multi-view and point cloud fusion block for integrating features of point cloud and multiple views. Finally, we extensively tested our method on three benchmark datasets: the ModelNet10, ModelNet40, and ShapeNet Core55. Our method’s experimental results demonstrate superior or comparable classification performance to previously established state-of-the-art techniques for 3D shape classification.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LKSMN: Large Kernel Spatial Modulation Network for Lightweight Image Super-Resolution LKSMN：用于轻量级图像超分辨率的大核空间调制网络

The Visual Computer

Pub Date : 2024-07-25 DOI: 10.1007/s00371-024-03562-5

Yubo Zhang, Lei Xu, Haibin Xiang, Haihua Kong, Junhao Bi, Chao Han

Although Vits-based networks have achieved stunning results in image super-resolution, their self-attention (SA) modeling in the unidimension greatly limits the reconstruction performance. In addition, the high consumption of resources for SA limits its application scenarios. In this study, we explore the working mechanism of SA and redesign its key structures to retain powerful modeling capabilities while reducing resource consumption. Further, we propose large kernel spatial modulation network (LKSMN); it can leverage the complementary strengths of attention from spatial and channel dimensions to mine a fuller range of potential correlations. Specifically, three effective designs were included in LKSMN. First, we propose multi-scale spatial modulation attention (MSMA) based on convolutional modulation (CM) and large-kernel convolution decomposition (LKCD). Instead of generating feature-relevance scores via queries and keys in the SA, MSMA uses LKCD to act directly on the input features to produce convolutional features that imitate relevance scores matrix. This process reduces the computational and storage overhead of the SA while retaining its ability to robustly model long-range dependent correlations. Second, we introduce multi-dconv head transposed attention (MDTA) as an attention modeling scheme in the channel dimension, which complements the advantages of our MSMA to model pixel interactions in both dimensions simultaneously. Final, we propose a multi-level feature aggregation module (MLFA) for aggregating the feature information extracted from different depth modules located in the network, to avoid the problem of shallow feature information disappearance. Extensive experiments demonstrate that our proposed method can achieve competitive results with a small network scale (e.g., 26.33dB@Urban100 (times ) 4 with only 253K parameters). The code is available at https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893

虽然基于 Vits 的网络在图像超分辨率方面取得了令人惊叹的成果，但其单维自注意（SA）建模极大地限制了重建性能。此外，SA 的高资源消耗也限制了其应用场景。在本研究中，我们探索了 SA 的工作机制，并重新设计了其关键结构，在保留强大建模能力的同时降低了资源消耗。此外，我们还提出了大内核空间调制网络（LKSMN），它可以利用空间和信道维度的互补优势，挖掘出更全面的潜在相关性。具体来说，LKSMN 包括三种有效的设计。首先，我们提出了基于卷积调制（CM）和大核卷积分解（LKCD）的多尺度空间调制注意力（MSMA）。MSMA 使用 LKCD 直接作用于输入特征，生成模仿相关性分数矩阵的卷积特征，而不是通过 SA 中的查询和键生成特征相关性分数。这一过程减少了 SA 的计算和存储开销，同时保留了其对远距离相关性进行稳健建模的能力。其次，我们引入了多dconv头转置注意力（MDTA）作为通道维度的注意力建模方案，它与我们的MSMA优势互补，可同时对两个维度的像素交互进行建模。最后，我们提出了一种多层次特征聚合模块（MLFA），用于聚合从网络中不同深度模块提取的特征信息，以避免浅层特征信息消失的问题。广泛的实验证明，我们提出的方法可以在较小的网络规模下（例如，26.33dB@Urban100 (times ) 4，仅需 253K 个参数）获得有竞争力的结果。代码见 https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893

{"title":"LKSMN: Large Kernel Spatial Modulation Network for Lightweight Image Super-Resolution","authors":"Yubo Zhang, Lei Xu, Haibin Xiang, Haihua Kong, Junhao Bi, Chao Han","doi":"10.1007/s00371-024-03562-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03562-5","url":null,"abstract":"Although Vits-based networks have achieved stunning results in image super-resolution, their self-attention (SA) modeling in the unidimension greatly limits the reconstruction performance. In addition, the high consumption of resources for SA limits its application scenarios. In this study, we explore the working mechanism of SA and redesign its key structures to retain powerful modeling capabilities while reducing resource consumption. Further, we propose large kernel spatial modulation network (LKSMN); it can leverage the complementary strengths of attention from spatial and channel dimensions to mine a fuller range of potential correlations. Specifically, three effective designs were included in LKSMN. First, we propose multi-scale spatial modulation attention (MSMA) based on convolutional modulation (CM) and large-kernel convolution decomposition (LKCD). Instead of generating feature-relevance scores via queries and keys in the SA, MSMA uses LKCD to act directly on the input features to produce convolutional features that imitate relevance scores matrix. This process reduces the computational and storage overhead of the SA while retaining its ability to robustly model long-range dependent correlations. Second, we introduce multi-dconv head transposed attention (MDTA) as an attention modeling scheme in the channel dimension, which complements the advantages of our MSMA to model pixel interactions in both dimensions simultaneously. Final, we propose a multi-level feature aggregation module (MLFA) for aggregating the feature information extracted from different depth modules located in the network, to avoid the problem of shallow feature information disappearance. Extensive experiments demonstrate that our proposed method can achieve competitive results with a small network scale (e.g., 26.33dB@Urban100 (times ) 4 with only 253K parameters). The code is available at https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EDM: a enhanced diffusion models for image restoration in complex scenes EDM：用于复杂场景图像复原的增强扩散模型

The Visual Computer

Pub Date : 2024-07-24 DOI: 10.1007/s00371-024-03549-2

JiaYan Wen, YuanSheng Zhuang, JunYi Deng

At presently, diffusion model has achieved state-of-the-art performance by modeling the image synthesis process through a series of denoising network applications. Image restoration (IR) is to improve the subjective image quality corrupted by various kinds of degradation unlike image synthesis. However, IR for complex scenes such as worksite images is greatly challenging in the low-level vision field due to complicated environmental factors. To solve this problem, we propose a enhanced diffusion models for image restoration in complex scenes (EDM). It improves the authenticity and representation ability for the generation process, while effectively handles complex backgrounds and diverse object types. EDM has three main contributions: (1) Its framework adopts a Mish-based residual module, which enhances the ability to learn complex patterns of images, and allows for the presence of negative gradients to reduce overfitting risks during model training. (2) It employs a mixed-head self-attention mechanism, which augments the correlation among input elements at each time step, and maintains a better balance between capturing the global structural information and local detailed textures of the image. (3) This study evaluates EDM on a self-built dataset specifically tailored for worksite image restoration, named “Workplace,” and was compared with results from another two public datasets named Places2 and Rain100H. Furthermore, the achievement of experiments on these datasets not only demonstrates EDM’s application value in a specific domain, but also its potential and versatility in broader image restoration tasks. Code, dataset and models are available at: https://github.com/Zhuangvictor0/EDM-A-Enhanced-Diffusion-Models-for-Image-Restoration-in-Complex-Scenes

目前，扩散模型通过一系列去噪网络应用对图像合成过程进行建模，取得了最先进的性能。与图像合成不同，图像复原（IR）是为了改善被各种退化破坏的主观图像质量。然而，由于复杂的环境因素，在低层次视觉领域，复杂场景（如工地图像）的 IR 具有极大的挑战性。为了解决这个问题，我们提出了一种用于复杂场景图像复原的增强扩散模型（EDM）。它提高了生成过程的真实性和表示能力，同时还能有效处理复杂背景和不同物体类型。EDM 有三大贡献：（1）其框架采用了基于 Mish 的残差模块，增强了学习复杂图像模式的能力，并允许负梯度的存在，以降低模型训练过程中的过拟合风险。(2）它采用了混合头自关注机制，在每个时间步增强了输入元素之间的相关性，在捕捉图像的整体结构信息和局部细节纹理之间保持了较好的平衡。(3) 本研究在自建的专门用于工地图像修复的数据集 "Workplace "上对 EDM 进行了评估，并将其与另外两个公共数据集 "Places2 "和 "Rain100H "的结果进行了比较。此外，在这些数据集上的实验结果不仅证明了 EDM 在特定领域的应用价值，还证明了它在更广泛的图像修复任务中的潜力和多功能性。代码、数据集和模型可在以下网址获取： https://github.com/Zhuangvictor0/EDM-A-Enhanced-Diffusion-Models-for-Image-Restoration-in-Complex-Scenes

{"title":"EDM: a enhanced diffusion models for image restoration in complex scenes","authors":"JiaYan Wen, YuanSheng Zhuang, JunYi Deng","doi":"10.1007/s00371-024-03549-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03549-2","url":null,"abstract":"At presently, diffusion model has achieved state-of-the-art performance by modeling the image synthesis process through a series of denoising network applications. Image restoration (IR) is to improve the subjective image quality corrupted by various kinds of degradation unlike image synthesis. However, IR for complex scenes such as worksite images is greatly challenging in the low-level vision field due to complicated environmental factors. To solve this problem, we propose a enhanced diffusion models for image restoration in complex scenes (EDM). It improves the authenticity and representation ability for the generation process, while effectively handles complex backgrounds and diverse object types. EDM has three main contributions: (1) Its framework adopts a Mish-based residual module, which enhances the ability to learn complex patterns of images, and allows for the presence of negative gradients to reduce overfitting risks during model training. (2) It employs a mixed-head self-attention mechanism, which augments the correlation among input elements at each time step, and maintains a better balance between capturing the global structural information and local detailed textures of the image. (3) This study evaluates EDM on a self-built dataset specifically tailored for worksite image restoration, named “Workplace,” and was compared with results from another two public datasets named Places2 and Rain100H. Furthermore, the achievement of experiments on these datasets not only demonstrates EDM’s application value in a specific domain, but also its potential and versatility in broader image restoration tasks. Code, dataset and models are available at: https://github.com/Zhuangvictor0/EDM-A-Enhanced-Diffusion-Models-for-Image-Restoration-in-Complex-Scenes","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Video anomaly detection with both normal and anomaly memory modules 通过正常和异常内存模块进行视频异常检测

The Visual Computer

Pub Date : 2024-07-22 DOI: 10.1007/s00371-024-03584-z

Liang Zhang, Shifeng Li, Xi Luo, Xiaoru Liu, Ruixuan Zhang

In this paper, we propose a novel framework for video anomaly detection that employs dual memory modules for both normal and anomaly patterns. By maintaining separate memory modules, one for normal patterns and one for anomaly patterns, our approach captures a broader range of video data behaviors. By exploring separate memory modules for normal and anomaly patterns, we begin by generating pseudo-anomalies using a temporal pseudo-anomaly synthesizer. This data is then used to train the anomaly memory module, while normal data trains the normal memory module. To distinguish between normal and anomalous data, we introduce a loss function that computes memory loss between the two memory modules. We enhance the memory modules by incorporating entropy loss and a hard shrinkage rectified linear unit (ReLU). Additionally, we integrate skip connections within our model to ensure the memory module captures comprehensive patterns beyond prototypical representations. Extensive experimentation and analysis on various challenging video anomaly datasets validate the effectiveness of our approach in detecting anomalies. The code for our method is available at https://github.com/SVIL2024/Pseudo-Anomaly-MemAE.

在本文中，我们提出了一种新颖的视频异常检测框架，该框架采用双内存模块来检测正常模式和异常模式。通过分别维护正常模式和异常模式的内存模块，我们的方法可以捕捉到更广泛的视频数据行为。通过探索正常模式和异常模式的独立存储模块，我们首先使用时态伪异常合成器生成伪异常。然后利用这些数据训练异常记忆模块，而正常数据则训练正常记忆模块。为了区分正常数据和异常数据，我们引入了一个损失函数，用于计算两个记忆模块之间的记忆损失。我们通过整合熵损失和硬收缩矫正线性单元（ReLU）来增强记忆模块。此外，我们还在模型中整合了跳转连接，以确保记忆模块能够捕捉原型表征之外的综合模式。在各种具有挑战性的视频异常数据集上进行的大量实验和分析验证了我们的方法在检测异常方面的有效性。我们方法的代码可在 https://github.com/SVIL2024/Pseudo-Anomaly-MemAE 上获取。

{"title":"Video anomaly detection with both normal and anomaly memory modules","authors":"Liang Zhang, Shifeng Li, Xi Luo, Xiaoru Liu, Ruixuan Zhang","doi":"10.1007/s00371-024-03584-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03584-z","url":null,"abstract":"In this paper, we propose a novel framework for video anomaly detection that employs dual memory modules for both normal and anomaly patterns. By maintaining separate memory modules, one for normal patterns and one for anomaly patterns, our approach captures a broader range of video data behaviors. By exploring separate memory modules for normal and anomaly patterns, we begin by generating pseudo-anomalies using a temporal pseudo-anomaly synthesizer. This data is then used to train the anomaly memory module, while normal data trains the normal memory module. To distinguish between normal and anomalous data, we introduce a loss function that computes memory loss between the two memory modules. We enhance the memory modules by incorporating entropy loss and a hard shrinkage rectified linear unit (ReLU). Additionally, we integrate skip connections within our model to ensure the memory module captures comprehensive patterns beyond prototypical representations. Extensive experimentation and analysis on various challenging video anomaly datasets validate the effectiveness of our approach in detecting anomalies. The code for our method is available at https://github.com/SVIL2024/Pseudo-Anomaly-MemAE.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning geometric invariants through neural networks 通过神经网络学习几何不变式

The Visual Computer

Pub Date : 2024-07-22 DOI: 10.1007/s00371-024-03398-z

Arpit Rai

Convolution neural networks have become a fundamental model for solving various computer vision tasks. However, these operations are only invariant to translations of objects and their performance suffer under rotation and other affine transformations. This work proposes a novel neural network that leverages geometric invariants, including curvature, higher-order differentials of curves extracted from object boundaries at multiple scales, and the relative orientations of edges. These features are invariant to affine transformation and can improve the robustness of shape recognition in neural networks. Our experiments on the smallNORB dataset with a 2-layer network operating over these geometric invariants outperforms a 3-layer convolutional network by 9.69% while being more robust to affine transformations, even when trained without any data augmentations. Notably, our network exhibits a mere 6% degradation in test accuracy when test images are rotated by 40(^{circ }), in contrast to significant drops of 51.7 and 69% observed in VGG networks and convolution networks, respectively, under the same transformations. Additionally, our models show superior robustness than invariant feature descriptors such as the SIFT-based bag-of-words classifier, and its rotation invariant extension, the RIFT descriptor that suffer drops of 35 and 14.1% respectively, under similar image transformations. Our experimental results further show improved robustness against scale and shear transformations. Furthermore, the multi-scale extension of our geometric invariant network, that extracts curve differentials of higher orders, show enhanced robustness to scaling and shearing transformations.

卷积神经网络已成为解决各种计算机视觉任务的基本模型。然而，这些操作只对物体的平移具有不变性，在旋转和其他仿射变换情况下，其性能会受到影响。这项研究提出了一种新型神经网络，它利用了几何不变性，包括曲率、从多个尺度的物体边界提取的曲线的高阶微分以及边缘的相对方向。这些特征对仿射变换具有不变性，可以提高神经网络形状识别的鲁棒性。我们在小型 NORB 数据集上进行的实验表明，在这些几何不变性上运行的 2 层网络的性能比 3 层卷积网络高出 9.69%，同时对仿射变换的鲁棒性更高，即使在没有任何数据增强的情况下也是如此。值得注意的是，当测试图像旋转 40 （^{circ } ）时，我们的网络的测试准确率仅下降了 6%，而在相同的变换下，VGG 网络和卷积网络的准确率分别大幅下降了 51.7% 和 69%。此外，与基于 SIFT 的词袋分类器及其旋转不变扩展 RIFT 描述符等不变特征描述符相比，我们的模型显示出更强的鲁棒性，在类似的图像变换下，它们的鲁棒性分别下降了 35% 和 14.1%。我们的实验结果进一步表明，在尺度和剪切变换下的鲁棒性有所提高。此外，我们的几何不变性网络的多尺度扩展提取了更高阶的曲线微分，从而增强了对缩放和剪切变换的鲁棒性。

{"title":"Learning geometric invariants through neural networks","authors":"Arpit Rai","doi":"10.1007/s00371-024-03398-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03398-z","url":null,"abstract":"Convolution neural networks have become a fundamental model for solving various computer vision tasks. However, these operations are only invariant to translations of objects and their performance suffer under rotation and other affine transformations. This work proposes a novel neural network that leverages geometric invariants, including curvature, higher-order differentials of curves extracted from object boundaries at multiple scales, and the relative orientations of edges. These features are invariant to affine transformation and can improve the robustness of shape recognition in neural networks. Our experiments on the smallNORB dataset with a 2-layer network operating over these geometric invariants outperforms a 3-layer convolutional network by 9.69% while being more robust to affine transformations, even when trained without any data augmentations. Notably, our network exhibits a mere 6% degradation in test accuracy when test images are rotated by 40(^{circ }), in contrast to significant drops of 51.7 and 69% observed in VGG networks and convolution networks, respectively, under the same transformations. Additionally, our models show superior robustness than invariant feature descriptors such as the SIFT-based bag-of-words classifier, and its rotation invariant extension, the RIFT descriptor that suffer drops of 35 and 14.1% respectively, under similar image transformations. Our experimental results further show improved robustness against scale and shear transformations. Furthermore, the multi-scale extension of our geometric invariant network, that extracts curve differentials of higher orders, show enhanced robustness to scaling and shearing transformations.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0