Pub Date : 2024-07-30DOI: 10.1007/s00371-024-03575-0
Jiazhe Miao, Tao Peng, Fei Fang, Xinrong Hu, Li Li
Garment simulation technology has widespread applications in fields such as virtual try-on and game animation. Traditional methods often require extensive manual annotation, leading to decreased efficiency. Recent methods that simulate garment from real videos often suffer from frame jitter problems due to a lack of consideration of temporal details. These approaches usually reconstruct human bodies and garments together without considering physical constraints, leading to unnatural stretching of garments during motion. To address these challenges, we propose TDGar-Ani. In terms of method design, we first propose a motion fusion module to optimize human motion sequences, resolving frame jitter issues. Subsequently, initial garment deformations are generated through physical constraints, combined with correction parameters outputted by a deformation correction network, ensuring coordinated deformations of garments and human bodies during motion, thereby enhancing the realism of simulation. Our experimental results demonstrate the applicability of the motion fusion module in capturing human motion from real videos. Simultaneously, the overall simulation results exhibit higher naturalness and realism, effectively improving alignment and deformation effects between garments and human body motion.
{"title":"TDGar-Ani: temporal motion fusion model and deformation correction network for enhancing garment animation details","authors":"Jiazhe Miao, Tao Peng, Fei Fang, Xinrong Hu, Li Li","doi":"10.1007/s00371-024-03575-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03575-0","url":null,"abstract":"<p>Garment simulation technology has widespread applications in fields such as virtual try-on and game animation. Traditional methods often require extensive manual annotation, leading to decreased efficiency. Recent methods that simulate garment from real videos often suffer from frame jitter problems due to a lack of consideration of temporal details. These approaches usually reconstruct human bodies and garments together without considering physical constraints, leading to unnatural stretching of garments during motion. To address these challenges, we propose TDGar-Ani. In terms of method design, we first propose a motion fusion module to optimize human motion sequences, resolving frame jitter issues. Subsequently, initial garment deformations are generated through physical constraints, combined with correction parameters outputted by a deformation correction network, ensuring coordinated deformations of garments and human bodies during motion, thereby enhancing the realism of simulation. Our experimental results demonstrate the applicability of the motion fusion module in capturing human motion from real videos. Simultaneously, the overall simulation results exhibit higher naturalness and realism, effectively improving alignment and deformation effects between garments and human body motion.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.
{"title":"Visual–language foundation models in medicine","authors":"Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yilan Wu, Xiangning Wang, Ying Feng Zheng, Dian Zeng","doi":"10.1007/s00371-024-03579-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03579-w","url":null,"abstract":"<p>By integrating visual and linguistic understanding, visual–language foundation models (VLFMs) have the great potential to advance the interpretation of medical data, thereby enhancing diagnostic precision, treatment planning, and patient management. We reviewed the developmental strategies of VLFMs, detailing the pretraining strategies, and subsequent application across various healthcare facets. The challenges inherent to VLFMs are described, including safeguarding data privacy amidst sensitive medical data usage, ensuring algorithmic transparency, and fostering explainability for trust in clinical decision-making. We underscored the significance of VLFMs in addressing the complexity of multimodal medical data, from visual to textual, and their potential in tasks like image-based disease diagnosis, medicine report synthesis, and longitudinal patient monitoring. It also examines the progress in VLFMs like Med-Flamingo, LLaVA-Med, and their zero-shot learning capabilities, and the exploration of parameter-efficient fine-tuning methods for efficient adaptation. This review concludes by encouraging the community to pursue these emergent and promising directions to strengthen the impact of artificial intelligence and deep learning on healthcare delivery and research.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-27DOI: 10.1007/s00371-024-03577-y
Dhruv Meduri, Mohit Sharma, Vijay Natarajan
{"title":"Correction to: Jacobi set simplification for tracking topological features in time-varying scalar fields","authors":"Dhruv Meduri, Mohit Sharma, Vijay Natarajan","doi":"10.1007/s00371-024-03577-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03577-y","url":null,"abstract":"","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"83 14","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141798510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s00371-024-03570-5
Saba Ghazanfar Ali, Xiaoxia Wang, Ping Li, Huating Li, Po Yang, Younhyun Jung, Jing Qin, Jinman Kim, Bin Sheng
Glomerulonephritis (GN) is a severe kidney disorder in which the tissues in the kidney become inflamed and have problems filtering waste from the blood. Typical approaches for GN diagnosis require a specialist’s examination of pathological glomerular features (PGF) in pathology images of a patient. These PGF are primarily analyzed via manual quantitative evaluation, which is a time-consuming, labor-intensive, and error-prone task for doctors. Thus, automatic and accurate detection of PGF is crucial for the efficient diagnosis of GN and other kidney-related diseases. Recent advances in convolutional neural network-based deep learning methods have shown the capability of learning complex structural variants with promising detection results in medical image applications. However, these methods are not directly applicable to glomerular detection due to large spatial and structural variability and inter-class imbalance. Thus, we propose an efficient glomerular detection network (EGDNet) for the first time for seven types of PGF detection. Our EGDNet consists of four modules: (i) a hybrid data augmentation strategy to resolve dataset problems, (ii) an efficient intersection over unit balancing module for uniform sampling of hard and easy samples, (iii) a feature pyramid balancing module to obtain balanced multi-scale features for robust detection, and (iv) balanced L1 regression loss which alleviates the impact of anomalous data for multi-PGF detection. We also formulated a private dataset of seven PGF from an affiliated hospital in Shanghai, China. Experiments on the dataset show that our EGDNet outperforms state-of-the-art methods by achieving superior accuracy of 91.2(%), 94.9(%), and 94.2(%) on small, medium, and large pathological features, respectively.
{"title":"EGDNet: an efficient glomerular detection network for multiple anomalous pathological feature in glomerulonephritis","authors":"Saba Ghazanfar Ali, Xiaoxia Wang, Ping Li, Huating Li, Po Yang, Younhyun Jung, Jing Qin, Jinman Kim, Bin Sheng","doi":"10.1007/s00371-024-03570-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03570-5","url":null,"abstract":"<p>Glomerulonephritis (GN) is a severe kidney disorder in which the tissues in the kidney become inflamed and have problems filtering waste from the blood. Typical approaches for GN diagnosis require a specialist’s examination of pathological glomerular features (PGF) in pathology images of a patient. These PGF are primarily analyzed via manual quantitative evaluation, which is a time-consuming, labor-intensive, and error-prone task for doctors. Thus, automatic and accurate detection of PGF is crucial for the efficient diagnosis of GN and other kidney-related diseases. Recent advances in convolutional neural network-based deep learning methods have shown the capability of learning complex structural variants with promising detection results in medical image applications. However, these methods are not directly applicable to glomerular detection due to large spatial and structural variability and inter-class imbalance. Thus, we propose an efficient glomerular detection network (EGDNet) for the first time for seven types of PGF detection. Our EGDNet consists of four modules: (i) a hybrid data augmentation strategy to resolve dataset problems, (ii) an efficient intersection over unit balancing module for uniform sampling of hard and easy samples, (iii) a feature pyramid balancing module to obtain balanced multi-scale features for robust detection, and (iv) balanced <i>L</i>1 regression loss which alleviates the impact of anomalous data for multi-PGF detection. We also formulated a private dataset of seven PGF from an affiliated hospital in Shanghai, China. Experiments on the dataset show that our EGDNet outperforms state-of-the-art methods by achieving superior accuracy of 91.2<span>(%)</span>, 94.9<span>(%)</span>, and 94.2<span>(%)</span> on small, medium, and large pathological features, respectively.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weakly Supervised Semantic Segmentation (WSSS) has become increasingly popular for achieving remarkable segmentation with only image-level labels. Current WSSS approaches extract Class Activation Mapping (CAM) from classification models to produce pseudo-masks for segmentation supervision. However, due to the gap between image-level supervised classification loss and pixel-level CAM generation tasks, the model tends to activate discriminative regions at the image level rather than pursuing pixel-level classification results. Moreover, insufficient supervision leads to unrestricted attention diffusion in the model, further introducing inter-class recognition noise. In this paper, we introduce a framework that employs Saliency Perception and Uncertainty, which includes a Saliency Perception Module (SPM) with Pixel-wise Transfer Loss (SP-PT), and an Uncertainty-guided Noise Suppression method. Specifically, within the SPM, we employ a hybrid attention mechanism to expand the receptive field of the module and enhance its ability to perceive salient object features. Meanwhile, a Pixel-wise Transfer Loss is designed to guide the attention diffusion of the classification model to non-discriminative regions at the pixel-level, thereby mitigating the bias of the model. To further enhance the robustness of CAM for obtaining more accurate pseudo-masks, we propose a noise suppression method based on uncertainty estimation, which applies a confidence matrix to the loss function to suppress the propagation of erroneous information and correct it, thus making the model more robust to noise. We conducted experiments on the PASCAL VOC 2012 and MS COCO 2014, and the experimental results demonstrate the effectiveness of our proposed framework. Code is available at https://github.com/pur-suit/SPU.
弱监督语义分割(WSSS)在仅使用图像级标签实现出色分割方面越来越受欢迎。目前的 WSSS 方法从分类模型中提取类激活映射(CAM),生成用于分割监督的伪掩码。然而,由于图像级监督分类损失与像素级 CAM 生成任务之间存在差距,该模型倾向于激活图像级的区分区域,而不是追求像素级的分类结果。此外,监督不足会导致模型中的注意力无限制扩散,进一步引入类间识别噪声。在本文中,我们介绍了一个采用显著性感知和不确定性的框架,其中包括带有像素转移损耗(SP-PT)的显著性感知模块(SPM)和不确定性指导的噪声抑制方法。具体来说,我们在 SPM 中采用了一种混合注意力机制,以扩大模块的感受野,增强其感知突出物体特征的能力。同时,我们还设计了一种像素转移损失(Pixel-wise Transfer Loss)机制,以引导分类模型的注意力扩散到像素级的非识别区域,从而减轻模型的偏差。为了进一步增强 CAM 的鲁棒性以获得更准确的伪掩模,我们提出了一种基于不确定性估计的噪声抑制方法,该方法将置信矩阵应用于损失函数,以抑制错误信息的传播并纠正错误信息,从而使模型对噪声具有更强的鲁棒性。我们在 PASCAL VOC 2012 和 MS COCO 2014 上进行了实验,实验结果证明了我们提出的框架的有效性。代码见 https://github.com/pur-suit/SPU。
{"title":"Weakly supervised semantic segmentation via saliency perception with uncertainty-guided noise suppression","authors":"Xinyi Liu, Guoheng Huang, Xiaochen Yuan, Zewen Zheng, Guo Zhong, Xuhang Chen, Chi-Man Pun","doi":"10.1007/s00371-024-03574-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03574-1","url":null,"abstract":"<p>Weakly Supervised Semantic Segmentation (WSSS) has become increasingly popular for achieving remarkable segmentation with only image-level labels. Current WSSS approaches extract Class Activation Mapping (CAM) from classification models to produce pseudo-masks for segmentation supervision. However, due to the gap between image-level supervised classification loss and pixel-level CAM generation tasks, the model tends to activate discriminative regions at the image level rather than pursuing pixel-level classification results. Moreover, insufficient supervision leads to unrestricted attention diffusion in the model, further introducing inter-class recognition noise. In this paper, we introduce a framework that employs Saliency Perception and Uncertainty, which includes a Saliency Perception Module (SPM) with Pixel-wise Transfer Loss (SP-PT), and an Uncertainty-guided Noise Suppression method. Specifically, within the SPM, we employ a hybrid attention mechanism to expand the receptive field of the module and enhance its ability to perceive salient object features. Meanwhile, a Pixel-wise Transfer Loss is designed to guide the attention diffusion of the classification model to non-discriminative regions at the pixel-level, thereby mitigating the bias of the model. To further enhance the robustness of CAM for obtaining more accurate pseudo-masks, we propose a noise suppression method based on uncertainty estimation, which applies a confidence matrix to the loss function to suppress the propagation of erroneous information and correct it, thus making the model more robust to noise. We conducted experiments on the PASCAL VOC 2012 and MS COCO 2014, and the experimental results demonstrate the effectiveness of our proposed framework. Code is available at https://github.com/pur-suit/SPU.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the primary focus of research in 3D shape classification has been on point cloud and multi-view methods. However, the multi-view approaches inevitably lose the structural information of 3D shapes due to the camera angle limitation. The point cloud methods use a neural network to maximize the pooling of all points to obtain a global feature, resulting in the loss of local detailed information. The disadvantages of multi-view and point cloud methods affect the performance of 3D shape classification. This paper proposes a novel FuseNet model, which integrates multi-view and point cloud information and significantly improves the accuracy of 3D model classification. First, we propose a multi-view and point cloud part to obtain the raw features of different convolution layers of multi-view and point clouds. Second, we adopt a multi-view pooling method for feature fusion of multiple views to integrate features of different convolution layers more effectively, and we propose an attention-based multi-view and point cloud fusion block for integrating features of point cloud and multiple views. Finally, we extensively tested our method on three benchmark datasets: the ModelNet10, ModelNet40, and ShapeNet Core55. Our method’s experimental results demonstrate superior or comparable classification performance to previously established state-of-the-art techniques for 3D shape classification.
{"title":"FuseNet: a multi-modal feature fusion network for 3D shape classification","authors":"Xin Zhao, Yinhuang Chen, Chengzhuan Yang, Lincong Fang","doi":"10.1007/s00371-024-03581-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03581-2","url":null,"abstract":"<p>Recently, the primary focus of research in 3D shape classification has been on point cloud and multi-view methods. However, the multi-view approaches inevitably lose the structural information of 3D shapes due to the camera angle limitation. The point cloud methods use a neural network to maximize the pooling of all points to obtain a global feature, resulting in the loss of local detailed information. The disadvantages of multi-view and point cloud methods affect the performance of 3D shape classification. This paper proposes a novel FuseNet model, which integrates multi-view and point cloud information and significantly improves the accuracy of 3D model classification. First, we propose a multi-view and point cloud part to obtain the raw features of different convolution layers of multi-view and point clouds. Second, we adopt a multi-view pooling method for feature fusion of multiple views to integrate features of different convolution layers more effectively, and we propose an attention-based multi-view and point cloud fusion block for integrating features of point cloud and multiple views. Finally, we extensively tested our method on three benchmark datasets: the ModelNet10, ModelNet40, and ShapeNet Core55. Our method’s experimental results demonstrate superior or comparable classification performance to previously established state-of-the-art techniques for 3D shape classification.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s00371-024-03562-5
Yubo Zhang, Lei Xu, Haibin Xiang, Haihua Kong, Junhao Bi, Chao Han
Although Vits-based networks have achieved stunning results in image super-resolution, their self-attention (SA) modeling in the unidimension greatly limits the reconstruction performance. In addition, the high consumption of resources for SA limits its application scenarios. In this study, we explore the working mechanism of SA and redesign its key structures to retain powerful modeling capabilities while reducing resource consumption. Further, we propose large kernel spatial modulation network (LKSMN); it can leverage the complementary strengths of attention from spatial and channel dimensions to mine a fuller range of potential correlations. Specifically, three effective designs were included in LKSMN. First, we propose multi-scale spatial modulation attention (MSMA) based on convolutional modulation (CM) and large-kernel convolution decomposition (LKCD). Instead of generating feature-relevance scores via queries and keys in the SA, MSMA uses LKCD to act directly on the input features to produce convolutional features that imitate relevance scores matrix. This process reduces the computational and storage overhead of the SA while retaining its ability to robustly model long-range dependent correlations. Second, we introduce multi-dconv head transposed attention (MDTA) as an attention modeling scheme in the channel dimension, which complements the advantages of our MSMA to model pixel interactions in both dimensions simultaneously. Final, we propose a multi-level feature aggregation module (MLFA) for aggregating the feature information extracted from different depth modules located in the network, to avoid the problem of shallow feature information disappearance. Extensive experiments demonstrate that our proposed method can achieve competitive results with a small network scale (e.g., 26.33dB@Urban100 (times ) 4 with only 253K parameters). The code is available at https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893
虽然基于 Vits 的网络在图像超分辨率方面取得了令人惊叹的成果,但其单维自注意(SA)建模极大地限制了重建性能。此外,SA 的高资源消耗也限制了其应用场景。在本研究中,我们探索了 SA 的工作机制,并重新设计了其关键结构,在保留强大建模能力的同时降低了资源消耗。此外,我们还提出了大内核空间调制网络(LKSMN),它可以利用空间和信道维度的互补优势,挖掘出更全面的潜在相关性。具体来说,LKSMN 包括三种有效的设计。首先,我们提出了基于卷积调制(CM)和大核卷积分解(LKCD)的多尺度空间调制注意力(MSMA)。MSMA 使用 LKCD 直接作用于输入特征,生成模仿相关性分数矩阵的卷积特征,而不是通过 SA 中的查询和键生成特征相关性分数。这一过程减少了 SA 的计算和存储开销,同时保留了其对远距离相关性进行稳健建模的能力。其次,我们引入了多dconv头转置注意力(MDTA)作为通道维度的注意力建模方案,它与我们的MSMA优势互补,可同时对两个维度的像素交互进行建模。最后,我们提出了一种多层次特征聚合模块(MLFA),用于聚合从网络中不同深度模块提取的特征信息,以避免浅层特征信息消失的问题。广泛的实验证明,我们提出的方法可以在较小的网络规模下(例如,26.33dB@Urban100 (times ) 4,仅需 253K 个参数)获得有竞争力的结果。代码见 https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893
{"title":"LKSMN: Large Kernel Spatial Modulation Network for Lightweight Image Super-Resolution","authors":"Yubo Zhang, Lei Xu, Haibin Xiang, Haihua Kong, Junhao Bi, Chao Han","doi":"10.1007/s00371-024-03562-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03562-5","url":null,"abstract":"<p>Although Vits-based networks have achieved stunning results in image super-resolution, their self-attention (SA) modeling in the unidimension greatly limits the reconstruction performance. In addition, the high consumption of resources for SA limits its application scenarios. In this study, we explore the working mechanism of SA and redesign its key structures to retain powerful modeling capabilities while reducing resource consumption. Further, we propose large kernel spatial modulation network (LKSMN); it can leverage the complementary strengths of attention from spatial and channel dimensions to mine a fuller range of potential correlations. Specifically, three effective designs were included in LKSMN. First, we propose multi-scale spatial modulation attention (MSMA) based on convolutional modulation (CM) and large-kernel convolution decomposition (LKCD). Instead of generating feature-relevance scores via queries and keys in the SA, MSMA uses LKCD to act directly on the input features to produce convolutional features that imitate relevance scores matrix. This process reduces the computational and storage overhead of the SA while retaining its ability to robustly model long-range dependent correlations. Second, we introduce multi-dconv head transposed attention (MDTA) as an attention modeling scheme in the channel dimension, which complements the advantages of our MSMA to model pixel interactions in both dimensions simultaneously. Final, we propose a multi-level feature aggregation module (MLFA) for aggregating the feature information extracted from different depth modules located in the network, to avoid the problem of shallow feature information disappearance. Extensive experiments demonstrate that our proposed method can achieve competitive results with a small network scale (e.g., 26.33dB@Urban100 <span>(times )</span> 4 with only 253K parameters). The code is available at https://figshare.com/articles/software/LKSMN_Large_Kernel_Spatial_Modulation_Network_for_Lightweight_Image_Super-Resolution/25603893</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1007/s00371-024-03549-2
JiaYan Wen, YuanSheng Zhuang, JunYi Deng
At presently, diffusion model has achieved state-of-the-art performance by modeling the image synthesis process through a series of denoising network applications. Image restoration (IR) is to improve the subjective image quality corrupted by various kinds of degradation unlike image synthesis. However, IR for complex scenes such as worksite images is greatly challenging in the low-level vision field due to complicated environmental factors. To solve this problem, we propose a enhanced diffusion models for image restoration in complex scenes (EDM). It improves the authenticity and representation ability for the generation process, while effectively handles complex backgrounds and diverse object types. EDM has three main contributions: (1) Its framework adopts a Mish-based residual module, which enhances the ability to learn complex patterns of images, and allows for the presence of negative gradients to reduce overfitting risks during model training. (2) It employs a mixed-head self-attention mechanism, which augments the correlation among input elements at each time step, and maintains a better balance between capturing the global structural information and local detailed textures of the image. (3) This study evaluates EDM on a self-built dataset specifically tailored for worksite image restoration, named “Workplace,” and was compared with results from another two public datasets named Places2 and Rain100H. Furthermore, the achievement of experiments on these datasets not only demonstrates EDM’s application value in a specific domain, but also its potential and versatility in broader image restoration tasks. Code, dataset and models are available at: https://github.com/Zhuangvictor0/EDM-A-Enhanced-Diffusion-Models-for-Image-Restoration-in-Complex-Scenes
{"title":"EDM: a enhanced diffusion models for image restoration in complex scenes","authors":"JiaYan Wen, YuanSheng Zhuang, JunYi Deng","doi":"10.1007/s00371-024-03549-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03549-2","url":null,"abstract":"<p>At presently, <i>diffusion model</i> has achieved state-of-the-art performance by modeling the image synthesis process through a series of denoising network applications. Image restoration (IR) is to improve the subjective image quality corrupted by various kinds of degradation unlike image synthesis. However, IR for complex scenes such as worksite images is greatly challenging in the low-level vision field due to complicated environmental factors. To solve this problem, we propose a enhanced diffusion models for image restoration in complex scenes (EDM). It improves the authenticity and representation ability for the generation process, while effectively handles complex backgrounds and diverse object types. EDM has three main contributions: (1) Its framework adopts a Mish-based residual module, which enhances the ability to learn complex patterns of images, and allows for the presence of negative gradients to reduce overfitting risks during model training. (2) It employs a mixed-head self-attention mechanism, which augments the correlation among input elements at each time step, and maintains a better balance between capturing the global structural information and local detailed textures of the image. (3) This study evaluates EDM on a self-built dataset specifically tailored for worksite image restoration, named “Workplace,” and was compared with results from another two public datasets named Places2 and Rain100H. Furthermore, the achievement of experiments on these datasets not only demonstrates EDM’s application value in a specific domain, but also its potential and versatility in broader image restoration tasks. Code, dataset and models are available at: https://github.com/Zhuangvictor0/EDM-A-Enhanced-Diffusion-Models-for-Image-Restoration-in-Complex-Scenes</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141772301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1007/s00371-024-03584-z
Liang Zhang, Shifeng Li, Xi Luo, Xiaoru Liu, Ruixuan Zhang
In this paper, we propose a novel framework for video anomaly detection that employs dual memory modules for both normal and anomaly patterns. By maintaining separate memory modules, one for normal patterns and one for anomaly patterns, our approach captures a broader range of video data behaviors. By exploring separate memory modules for normal and anomaly patterns, we begin by generating pseudo-anomalies using a temporal pseudo-anomaly synthesizer. This data is then used to train the anomaly memory module, while normal data trains the normal memory module. To distinguish between normal and anomalous data, we introduce a loss function that computes memory loss between the two memory modules. We enhance the memory modules by incorporating entropy loss and a hard shrinkage rectified linear unit (ReLU). Additionally, we integrate skip connections within our model to ensure the memory module captures comprehensive patterns beyond prototypical representations. Extensive experimentation and analysis on various challenging video anomaly datasets validate the effectiveness of our approach in detecting anomalies. The code for our method is available at https://github.com/SVIL2024/Pseudo-Anomaly-MemAE.
{"title":"Video anomaly detection with both normal and anomaly memory modules","authors":"Liang Zhang, Shifeng Li, Xi Luo, Xiaoru Liu, Ruixuan Zhang","doi":"10.1007/s00371-024-03584-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03584-z","url":null,"abstract":"<p>In this paper, we propose a novel framework for video anomaly detection that employs dual memory modules for both normal and anomaly patterns. By maintaining separate memory modules, one for normal patterns and one for anomaly patterns, our approach captures a broader range of video data behaviors. By exploring separate memory modules for normal and anomaly patterns, we begin by generating pseudo-anomalies using a temporal pseudo-anomaly synthesizer. This data is then used to train the anomaly memory module, while normal data trains the normal memory module. To distinguish between normal and anomalous data, we introduce a loss function that computes memory loss between the two memory modules. We enhance the memory modules by incorporating entropy loss and a hard shrinkage rectified linear unit (ReLU). Additionally, we integrate skip connections within our model to ensure the memory module captures comprehensive patterns beyond prototypical representations. Extensive experimentation and analysis on various challenging video anomaly datasets validate the effectiveness of our approach in detecting anomalies. The code for our method is available at https://github.com/SVIL2024/Pseudo-Anomaly-MemAE.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1007/s00371-024-03398-z
Arpit Rai
Convolution neural networks have become a fundamental model for solving various computer vision tasks. However, these operations are only invariant to translations of objects and their performance suffer under rotation and other affine transformations. This work proposes a novel neural network that leverages geometric invariants, including curvature, higher-order differentials of curves extracted from object boundaries at multiple scales, and the relative orientations of edges. These features are invariant to affine transformation and can improve the robustness of shape recognition in neural networks. Our experiments on the smallNORB dataset with a 2-layer network operating over these geometric invariants outperforms a 3-layer convolutional network by 9.69% while being more robust to affine transformations, even when trained without any data augmentations. Notably, our network exhibits a mere 6% degradation in test accuracy when test images are rotated by 40(^{circ }), in contrast to significant drops of 51.7 and 69% observed in VGG networks and convolution networks, respectively, under the same transformations. Additionally, our models show superior robustness than invariant feature descriptors such as the SIFT-based bag-of-words classifier, and its rotation invariant extension, the RIFT descriptor that suffer drops of 35 and 14.1% respectively, under similar image transformations. Our experimental results further show improved robustness against scale and shear transformations. Furthermore, the multi-scale extension of our geometric invariant network, that extracts curve differentials of higher orders, show enhanced robustness to scaling and shearing transformations.
{"title":"Learning geometric invariants through neural networks","authors":"Arpit Rai","doi":"10.1007/s00371-024-03398-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03398-z","url":null,"abstract":"<p>Convolution neural networks have become a fundamental model for solving various computer vision tasks. However, these operations are only invariant to translations of objects and their performance suffer under rotation and other affine transformations. This work proposes a novel neural network that leverages geometric invariants, including curvature, higher-order differentials of curves extracted from object boundaries at multiple scales, and the relative orientations of edges. These features are invariant to affine transformation and can improve the robustness of shape recognition in neural networks. Our experiments on the smallNORB dataset with a 2-layer network operating over these geometric invariants outperforms a 3-layer convolutional network by 9.69% while being more robust to affine transformations, even when trained without any data augmentations. Notably, our network exhibits a mere 6% degradation in test accuracy when test images are rotated by 40<span>(^{circ })</span>, in contrast to significant drops of 51.7 and 69% observed in VGG networks and convolution networks, respectively, under the same transformations. Additionally, our models show superior robustness than invariant feature descriptors such as the SIFT-based bag-of-words classifier, and its rotation invariant extension, the RIFT descriptor that suffer drops of 35 and 14.1% respectively, under similar image transformations. Our experimental results further show improved robustness against scale and shear transformations. Furthermore, the multi-scale extension of our geometric invariant network, that extracts curve differentials of higher orders, show enhanced robustness to scaling and shearing transformations.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}