The Visual Computer最新文献_第8页

PackMolds: computational design of packaging molds for thermoforming PackMolds：热成型包装模具的计算设计

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03462-8

Naoki Kita

We present a novel technique for designing molds suitable for desktop thermoforming, specifically for creating packaging such as blister packs. Our molds, PackMolds, feature neither undercuts nor negative draft angles, facilitating their easy release from thermoformed plastic sheets. In this study, we optimize the geometry of PackMolds to comply with user-specified draft angle constraints. Instead of simulating the traditional thermoforming process, which necessitates time discretization and specifying detailed parameters for both material properties and machine configuration to achieve an accurate simulation result, we formulate our problem as a constrained geometric optimization problem and solve it using a gradient-based solver. Additionally, in contrast to industrial thermoforming, which benefits from advanced tools, desktop thermoforming lacks such sophisticated resources. Therefore, we introduce a suite of assistive tools to enhance the success of desktop thermoforming. Furthermore, we demonstrate its wide applicability by showcasing its use in not only designing blister packs but also in creating double-sided blister packs and model stands.

我们提出了一种新技术，用于设计适合台式热成型的模具，特别是用于制造吸塑包装等包装。我们的模具（PackMolds）既没有底切，也没有负拔模角，因此易于从热成型塑料板上脱模。在这项研究中，我们优化了 PackMolds 的几何形状，使其符合用户指定的拔模角度限制。传统的热成型过程需要对时间进行离散化处理，并指定材料属性和机器配置的详细参数，才能获得精确的模拟结果，而我们并没有模拟传统的热成型过程，而是将问题表述为受限几何优化问题，并使用基于梯度的求解器进行求解。此外，与受益于先进工具的工业热成型相比，桌面热成型缺乏此类复杂资源。因此，我们引入了一套辅助工具，以提高桌面热成型的成功率。此外，我们还展示了其广泛的适用性，不仅用于设计吸塑包装，还用于创建双面吸塑包装和模型支架。

引用次数: 0

Learning kernel parameter lookup tables to implement adaptive bilateral filtering 学习内核参数查找表以实现自适应双边滤波

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03553-6

Runtao Xi, Jiahao Lyu, Kang Sun, Tian Ma

Bilateral filtering is a widely used image smoothing filter that preserves image edges while also smoothing texture. In previous research, the focus of improving the bilateral filter has primarily been on constructing an adaptive range kernel. However, recent research has shown that even slight noise perturbations can prevent the bilateral filter from effectively preserving image edges. To address this issue, we employ a neural network to learn the kernel parameters that can effectively counteract noise perturbations. Additionally, to enhance the adaptability of the learned kernel parameters to the local edge features of the image, we utilize the edge-sensitive indexing method to construct kernel parameter lookup tables (LUTs). During testing, we determine the appropriate spatial kernel and range kernel parameters for each pixel using a lookup table and interpolation. This allows us to effectively smooth the image in the presence of noise perturbation. In this paper, we conducted comparative experiments on several datasets to verify that the proposed method outperforms existing bilateral filtering methods in preserving image structure, removing image texture, and resisting slight noise perturbations. The code is available at https://github.com/FightingSrain/AdaBFLUT.

双边滤波是一种广泛使用的图像平滑滤波器，它在保留图像边缘的同时还能平滑纹理。在以往的研究中，改进双边滤波器的重点主要是构建自适应范围内核。然而，最近的研究表明，即使是轻微的噪声扰动也会使双边滤波器无法有效保留图像边缘。为了解决这个问题，我们采用神经网络来学习能有效抵消噪声扰动的内核参数。此外，为了提高学习到的内核参数对图像局部边缘特征的适应性，我们利用边缘敏感索引方法来构建内核参数查找表（LUT）。在测试过程中，我们使用查找表和插值法为每个像素确定合适的空间内核和范围内核参数。这样，我们就能在噪声扰动的情况下有效地平滑图像。在本文中，我们在多个数据集上进行了对比实验，验证了所提出的方法在保留图像结构、去除图像纹理和抵抗轻微噪声扰动方面优于现有的双边滤波方法。代码见 https://github.com/FightingSrain/AdaBFLUT。

{"title":"Learning kernel parameter lookup tables to implement adaptive bilateral filtering","authors":"Runtao Xi, Jiahao Lyu, Kang Sun, Tian Ma","doi":"10.1007/s00371-024-03553-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03553-6","url":null,"abstract":"Bilateral filtering is a widely used image smoothing filter that preserves image edges while also smoothing texture. In previous research, the focus of improving the bilateral filter has primarily been on constructing an adaptive range kernel. However, recent research has shown that even slight noise perturbations can prevent the bilateral filter from effectively preserving image edges. To address this issue, we employ a neural network to learn the kernel parameters that can effectively counteract noise perturbations. Additionally, to enhance the adaptability of the learned kernel parameters to the local edge features of the image, we utilize the edge-sensitive indexing method to construct kernel parameter lookup tables (LUTs). During testing, we determine the appropriate spatial kernel and range kernel parameters for each pixel using a lookup table and interpolation. This allows us to effectively smooth the image in the presence of noise perturbation. In this paper, we conducted comparative experiments on several datasets to verify that the proposed method outperforms existing bilateral filtering methods in preserving image structure, removing image texture, and resisting slight noise perturbations. The code is available at https://github.com/FightingSrain/AdaBFLUT.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeRF-FF: a plug-in method to mitigate defocus blur for runtime optimized neural radiance fields NeRF-FF：为运行优化神经辐射场减轻散焦模糊的插件方法

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03507-y

Tristan Wirth, Arne Rak, Max von Buelow, Volker Knauthe, Arjan Kuijper, Dieter W. Fellner

Neural radiance fields (NeRFs) have revolutionized novel view synthesis, leading to an unprecedented level of realism in rendered images. However, the reconstruction quality of NeRFs suffers significantly from out-of-focus regions in the input images. We propose NeRF-FF, a plug-in strategy that estimates image masks based on Focus Frustums (FFs), i.e., the visible volume in the scene space that is in-focus. NeRF-FF enables a subsequently trained NeRF model to omit out-of-focus image regions during the training process. Existing methods to mitigate the effects of defocus blurred input images often leverage dynamic ray generation. This makes them incompatible with the static ray assumptions employed by runtime-performance-optimized NeRF variants, such as Instant-NGP, leading to high training times. Our experiments show that NeRF-FF outperforms state-of-the-art approaches regarding training time by two orders of magnitude—reducing it to under 1 min on end-consumer hardware—while maintaining comparable visual quality.

神经辐射场（NeRFs）给新颖的视图合成带来了革命性的变化，使渲染图像的逼真度达到了前所未有的水平。然而，NeRF 的重建质量受到输入图像中焦外区域的严重影响。我们提出了 NeRF-FF，这是一种基于 FF（Focus Frustums）（即场景空间中处于焦点内的可见体积）估算图像遮罩的插件策略。NeRF-FF 使随后训练的 NeRF 模型能够在训练过程中省略失焦图像区域。现有的减轻离焦模糊输入图像影响的方法通常利用动态光线生成。这使得它们与运行时性能优化的 NeRF 变体（如 Instant-NGP）所采用的静态射线假设不兼容，从而导致训练时间过长。我们的实验表明，在训练时间方面，NeRF-FF 优于最先进的方法两个数量级--在终端消费者硬件上，训练时间缩短到 1 分钟以下--同时保持了相当的视觉质量。

引用次数: 0

DCSG: data complement pseudo-label refinement and self-guided pre-training for unsupervised person re-identification DCSG：数据补充伪标签完善和无监督人员再识别的自我指导预训练

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03542-9

Qing Han, Jiongjin Chen, Weidong Min, Jiahao Li, Lixin Zhan, Longfei Li

Existing unsupervised person re-identification (Re-ID) methods use clustering to generate pseudo-labels that are generally noisy, and initializing the model with ImageNet pre-training weights introduces a large domain gap that severely impacts the model’s performance. To address the aforementioned issues, we propose the data complement pseudo-label refinement and self-guided pre-training framework, referred to as DCSG. Firstly, our method utilizes image information from multiple augmentation views to complement the source image data, resulting in aggregated information. We employ this aggregated information to design a correlation score that serves as a reliability evaluation for the source features and cluster centroids. By optimizing the pseudo-labels for each sample, we enhance their robustness. Secondly, we propose a pre-training strategy that leverages the potential information within the training process. This strategy involves mining classes with high similarity in the training set to guide model training and facilitate smooth pre-training. Consequently, the model acquires preliminary capabilities to distinguish pedestrian-related features at an early stage of training, thereby reducing the impact of domain gaps arising from ImageNet pre-training weights. Our method demonstrates superior performance on multiple person Re-ID datasets, validating the effectiveness of our proposed approach. Notably, it achieves an mAP metric of 84.3% on the Market1501 dataset, representing a 2.8% improvement compared to the state-of-the-art method. The code is available at https://github.com/duolaJohn/DCSG.git.

现有的无监督人员再识别（Re-ID）方法使用聚类来生成伪标签，而这些伪标签通常是有噪声的，而且使用 ImageNet 预训练权重初始化模型会带来很大的领域差距，严重影响模型的性能。为了解决上述问题，我们提出了数据补充伪标签完善和自引导预训练框架，简称 DCSG。首先，我们的方法利用来自多个增强视图的图像信息对源图像数据进行补充，从而产生聚合信息。我们利用这些聚合信息来设计一个相关性得分，作为源特征和聚类中心点的可靠性评估。通过优化每个样本的伪标签，我们增强了它们的鲁棒性。其次，我们提出了一种在训练过程中利用潜在信息的预训练策略。该策略包括挖掘训练集中相似度高的类别，以指导模型训练并促进预训练的顺利进行。因此，模型在训练的早期阶段就获得了分辨行人相关特征的初步能力，从而减少了 ImageNet 预训练权重带来的领域差距的影响。我们的方法在多个人物再识别数据集上表现出了卓越的性能，验证了我们提出的方法的有效性。值得注意的是，它在 Market1501 数据集上实现了 84.3% 的 mAP 指标，与最先进的方法相比提高了 2.8%。代码见 https://github.com/duolaJohn/DCSG.git。

{"title":"DCSG: data complement pseudo-label refinement and self-guided pre-training for unsupervised person re-identification","authors":"Qing Han, Jiongjin Chen, Weidong Min, Jiahao Li, Lixin Zhan, Longfei Li","doi":"10.1007/s00371-024-03542-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03542-9","url":null,"abstract":"Existing unsupervised person re-identification (Re-ID) methods use clustering to generate pseudo-labels that are generally noisy, and initializing the model with ImageNet pre-training weights introduces a large domain gap that severely impacts the model’s performance. To address the aforementioned issues, we propose the data complement pseudo-label refinement and self-guided pre-training framework, referred to as DCSG. Firstly, our method utilizes image information from multiple augmentation views to complement the source image data, resulting in aggregated information. We employ this aggregated information to design a correlation score that serves as a reliability evaluation for the source features and cluster centroids. By optimizing the pseudo-labels for each sample, we enhance their robustness. Secondly, we propose a pre-training strategy that leverages the potential information within the training process. This strategy involves mining classes with high similarity in the training set to guide model training and facilitate smooth pre-training. Consequently, the model acquires preliminary capabilities to distinguish pedestrian-related features at an early stage of training, thereby reducing the impact of domain gaps arising from ImageNet pre-training weights. Our method demonstrates superior performance on multiple person Re-ID datasets, validating the effectiveness of our proposed approach. Notably, it achieves an mAP metric of 84.3% on the Market1501 dataset, representing a 2.8% improvement compared to the state-of-the-art method. The code is available at https://github.com/duolaJohn/DCSG.git.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"22-23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distribution-decouple learning network: an innovative approach for single image dehazing with spatial and frequency decoupling 分布-解耦学习网络：空间和频率解耦的单幅图像去噪创新方法

The Visual Computer

Pub Date : 2024-06-28 DOI: 10.1007/s00371-024-03556-3

Yabo Wu, Wenting Li, Ziyang Chen, Hui Wen, Zhongwei Cui, Yongjun Zhang

Image dehazing methods face challenges in addressing the high coupling between haze and object feature distributions in the spatial and frequency domains. This coupling often results in oversharpening, color distortion, and blurring of details during the dehazing process. To address these issues, we introduce the distribution-decouple module (DDM) and dual-frequency attention mechanism (DFAM). The DDM works effectively in the spatial domain, decoupling haze and object features through a feature decoupler and then uses a two-stream modulator to further reduce the negative impact of haze on the distribution of object features. Simultaneously, the DFAM focuses on decoupling information in the frequency domain, separating high- and low-frequency information and applying attention to different frequency components for frequency calibration. Finally, we introduce a novel dehazing network, the distribution-decouple learning network for single image dehazing with spatial and frequency decoupling (DDLNet). This network integrates DDM and DFAM, effectively addressing the issue of coupled feature distributions in both spatial and frequency domains, thereby enhancing the clarity and fidelity of the dehazed images. Extensive experiments indicate the outperformance of our DDLNet when compared to the state-of-the-art (SOTA) methods, achieving a 1.50 dB increase in PSNR on the SOTS-indoor dataset. Concomitantly, it indicates a 1.26 dB boost on the SOTS-outdoor dataset. Additionally, our method performs significantly well on the nighttime dehazing dataset NHR, achieving a 0.91 dB improvement. Code and trained models are available at https://github.com/aoe-wyb/DDLNet.

图像去毛刺方法在解决灰度和物体特征分布在空间和频率域的高度耦合方面面临挑战。这种耦合往往会在去毛刺过程中导致过度锐化、色彩失真和细节模糊。为了解决这些问题，我们引入了分布解耦模块（DDM）和双频关注机制（DFAM）。DDM 在空间域有效工作，通过特征解耦器将雾霾和物体特征解耦，然后使用双流调制器进一步降低雾霾对物体特征分布的负面影响。同时，DFAM 专注于频域信息的解耦，分离高频和低频信息，并关注不同的频率成分，以进行频率校准。最后，我们介绍了一种新型去毛刺网络，即空间和频率去耦的单图像去毛刺分布-去耦学习网络（DDLNet）。该网络整合了 DDM 和 DFAM，有效解决了空间域和频率域的耦合特征分布问题，从而提高了去毛刺图像的清晰度和保真度。大量实验表明，与最先进的（SOTA）方法相比，我们的 DDLNet 性能更优，在 SOTS 室内数据集上的 PSNR 提高了 1.50 dB。同时，在 SOTS-outdoor 数据集上也提高了 1.26 分贝。此外，我们的方法在夜间去噪数据集 NHR 上的表现也非常出色，提高了 0.91 分贝。代码和训练好的模型可在 https://github.com/aoe-wyb/DDLNet 上获取。

{"title":"Distribution-decouple learning network: an innovative approach for single image dehazing with spatial and frequency decoupling","authors":"Yabo Wu, Wenting Li, Ziyang Chen, Hui Wen, Zhongwei Cui, Yongjun Zhang","doi":"10.1007/s00371-024-03556-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03556-3","url":null,"abstract":"Image dehazing methods face challenges in addressing the high coupling between haze and object feature distributions in the spatial and frequency domains. This coupling often results in oversharpening, color distortion, and blurring of details during the dehazing process. To address these issues, we introduce the distribution-decouple module (DDM) and dual-frequency attention mechanism (DFAM). The DDM works effectively in the spatial domain, decoupling haze and object features through a feature decoupler and then uses a two-stream modulator to further reduce the negative impact of haze on the distribution of object features. Simultaneously, the DFAM focuses on decoupling information in the frequency domain, separating high- and low-frequency information and applying attention to different frequency components for frequency calibration. Finally, we introduce a novel dehazing network, the distribution-decouple learning network for single image dehazing with spatial and frequency decoupling (DDLNet). This network integrates DDM and DFAM, effectively addressing the issue of coupled feature distributions in both spatial and frequency domains, thereby enhancing the clarity and fidelity of the dehazed images. Extensive experiments indicate the outperformance of our DDLNet when compared to the state-of-the-art (SOTA) methods, achieving a 1.50 dB increase in PSNR on the SOTS-indoor dataset. Concomitantly, it indicates a 1.26 dB boost on the SOTS-outdoor dataset. Additionally, our method performs significantly well on the nighttime dehazing dataset NHR, achieving a 0.91 dB improvement. Code and trained models are available at https://github.com/aoe-wyb/DDLNet.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141500571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-granularity hypergraph-guided transformer learning framework for visual classification 用于视觉分类的多粒度超图引导变换器学习框架

The Visual Computer

Pub Date : 2024-06-28 DOI: 10.1007/s00371-024-03541-w

Jianjian Jiang, Ziwei Chen, Fangyuan Lei, Long Xu, Jiahao Huang, Xiaochen Yuan

Fine-grained single-label classification tasks aim to distinguish highly similar categories but often overlook inter-category relationships. Hierarchical multi-granularity visual classification strives to categorize image labels at various hierarchy levels, offering optimize label selection for people. This paper addresses the hierarchical multi-granularity classification problem from two perspectives: (1) effective utilization of labels at different levels and (2) efficient learning to distinguish multi-granularity visual features. To tackle these issues, we propose a novel multi-granularity hypergraph-guided transformer learning framework (MHTL), seamlessly integrating swin transformers and hypergraph neural networks for handling visual classification tasks. Firstly, we employ swin transformer as an image hierarchical feature learning (IHFL) module to capture hierarchical features. Secondly, a feature reassemble (FR) module is applied to rearrange features at different hierarchy levels, creating a spectrum of features from coarse to fine-grained. Thirdly, we propose a feature relationship mining (FRM) module, to unveil the correlation between features at different granularity. Within this module, we introduce a learnable hypergraph modeling method to construct coarse to fine-grained hypergraph structures. Simultaneously, multi-granularity hypergraph neural networks are employed to explore grouping relationships across different granularities, thereby enhancing the learning of semantic feature representations. Finally, we adopt a multi-granularity classifier (MC) to predict hierarchical label probabilities. Experimental results demonstrate that MHTL outperforms other state-of-the-art classification methods across three multi-granularity datasets. The source code and models are released at https://github.com/JJJTF/MHTL.

精细的单标签分类任务旨在区分高度相似的类别，但往往忽略了类别间的关系。分层多粒度视觉分类致力于将图像标签按不同层次进行分类，为人们提供最优化的标签选择。本文从两个方面探讨了分层多粒度分类问题：(1) 有效利用不同层次的标签；(2) 高效学习区分多粒度视觉特征。为了解决这些问题，我们提出了一种新颖的多粒度超图引导变换器学习框架（MHTL），无缝集成了swin变换器和超图神经网络来处理视觉分类任务。首先，我们采用swin变换器作为图像分层特征学习（IHFL）模块来捕捉分层特征。其次，应用特征重组合（FR）模块重新排列不同层次的特征，创建从粗粒度到细粒度的特征谱。第三，我们提出了一个特征关系挖掘（FRM）模块，以揭示不同粒度特征之间的相关性。在该模块中，我们引入了一种可学习的超图建模方法，以构建从粗粒度到细粒度的超图结构。同时，我们采用多粒度超图神经网络来探索不同粒度的分组关系，从而加强语义特征表征的学习。最后，我们采用多粒度分类器（MC）来预测分层标签概率。实验结果表明，MHTL 在三个多粒度数据集上的表现优于其他最先进的分类方法。源代码和模型发布于 https://github.com/JJJTF/MHTL。

{"title":"Multi-granularity hypergraph-guided transformer learning framework for visual classification","authors":"Jianjian Jiang, Ziwei Chen, Fangyuan Lei, Long Xu, Jiahao Huang, Xiaochen Yuan","doi":"10.1007/s00371-024-03541-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03541-w","url":null,"abstract":"Fine-grained single-label classification tasks aim to distinguish highly similar categories but often overlook inter-category relationships. Hierarchical multi-granularity visual classification strives to categorize image labels at various hierarchy levels, offering optimize label selection for people. This paper addresses the hierarchical multi-granularity classification problem from two perspectives: (1) effective utilization of labels at different levels and (2) efficient learning to distinguish multi-granularity visual features. To tackle these issues, we propose a novel multi-granularity hypergraph-guided transformer learning framework (MHTL), seamlessly integrating swin transformers and hypergraph neural networks for handling visual classification tasks. Firstly, we employ swin transformer as an image hierarchical feature learning (IHFL) module to capture hierarchical features. Secondly, a feature reassemble (FR) module is applied to rearrange features at different hierarchy levels, creating a spectrum of features from coarse to fine-grained. Thirdly, we propose a feature relationship mining (FRM) module, to unveil the correlation between features at different granularity. Within this module, we introduce a learnable hypergraph modeling method to construct coarse to fine-grained hypergraph structures. Simultaneously, multi-granularity hypergraph neural networks are employed to explore grouping relationships across different granularities, thereby enhancing the learning of semantic feature representations. Finally, we adopt a multi-granularity classifier (MC) to predict hierarchical label probabilities. Experimental results demonstrate that MHTL outperforms other state-of-the-art classification methods across three multi-granularity datasets. The source code and models are released at https://github.com/JJJTF/MHTL.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TransFGVC: transformer-based fine-grained visual classification TransFGVC：基于变换器的细粒度视觉分类

The Visual Computer

Pub Date : 2024-06-28 DOI: 10.1007/s00371-024-03545-6

Longfeng Shen, Bin Hou, Yulei Jian, Xisong Tu, Yingjie Zhang, Lingying Shuai, Fangzhen Ge, Debao Chen

Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.

细粒度视觉分类（FGVC）旨在识别同一超类中的物体子类。由于类内差异大而类间差异小，这项任务具有挑战性。最新的方法主要集中在定位分辨区域，然后训练分类网络，以进一步捕捉它们之间的细微差别。一方面，检测网络往往只能获得物体的整个部分，因此会出现定位误差。另一方面，这些方法忽略了提取区域之间的相关性。我们提出了一种高度可扩展的新方法，称为 TransFGVC，它巧妙地将 Swin 变换器与长短期记忆（LSTM）网络相结合，以解决上述问题。Swin Transformer 用于通过自注意层堆叠获得显著的视觉标记，而 LSTM 则用于对其进行全局建模，这不仅能准确定位分辨区域，还能进一步引入对 FGVC 非常重要的全局信息。所提出的方法在使用公开的 CUB-200-2011 和 NABirds 数据集以及我们的 Birds-267-2022 数据集时，准确率分别达到了 92.7%、91.4% 和 91.5%，而且我们方法的 Params 和 FLOPs 分别比目前的 SotA 方法 HERBS 低 25% 和 27%，性能极具竞争力。为了有效促进 FGVC 的发展，我们开发了 Birds-267-2022 数据集，该数据集有 267 个类别和 12,233 幅图像。

{"title":"TransFGVC: transformer-based fine-grained visual classification","authors":"Longfeng Shen, Bin Hou, Yulei Jian, Xisong Tu, Yingjie Zhang, Lingying Shuai, Fangzhen Ge, Debao Chen","doi":"10.1007/s00371-024-03545-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03545-6","url":null,"abstract":"Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.\u0000","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Training a shadow removal network using only 3D primitive occluders 仅使用 3D 原始遮挡器训练阴影消除网络

The Visual Computer

Pub Date : 2024-06-27 DOI: 10.1007/s00371-024-03536-7

Neil Patrick Del Gallego, Joel Ilao, Macario II Cordel, Conrado Ruiz

Removing shadows in images is often a necessary pre-processing task for improving the performance of computer vision applications. Deep learning shadow removal approaches require a large-scale dataset that is challenging to gather. To address the issue of limited shadow data, we present a new and cost-effective method of synthetically generating shadows using 3D virtual primitives as occluders. We simulate the shadow generation process in a virtual environment where foreground objects are composed of mapped textures from the Places-365 dataset. We argue that complex shadow regions can be approximated by mixing primitives, analogous to how 3D models in computer graphics can be represented as triangle meshes. We use the proposed synthetic shadow removal dataset, DLSUSynthPlaces-100K, to train a feature-attention-based shadow removal network without explicit domain adaptation or style transfer strategy. The results of this study show that the trained network achieves competitive results with state-of-the-art shadow removal networks that were trained purely on typical SR datasets such as ISTD or SRD. Using a synthetic shadow dataset of only triangular prisms and spheres as occluders produces the best results. Therefore, the synthetic shadow removal dataset can be a viable alternative for future deep-learning shadow removal methods. The source code and dataset can be accessed at this link: https://neildg.github.io/SynthShadowRemoval/.

去除图像中的阴影通常是提高计算机视觉应用性能的必要预处理任务。深度学习阴影去除方法需要大规模的数据集，而数据集的收集具有挑战性。为了解决阴影数据有限的问题，我们提出了一种成本效益高的新方法，即使用三维虚拟基元作为遮挡物来合成生成阴影。我们在一个虚拟环境中模拟了阴影生成过程，该环境中的前景物体由 Places-365 数据集中的映射纹理组成。我们认为，复杂的阴影区域可以通过混合基元来近似，就像计算机图形学中的三维模型可以用三角形网格来表示一样。我们使用所提出的合成阴影去除数据集 DLSUSynthPlaces-100K 来训练基于特征关注的阴影去除网络，而无需明确的领域适应或风格转移策略。研究结果表明，训练出的网络与纯粹在典型 SR 数据集（如 ISTD 或 SRD）上训练出的最先进阴影消除网络相比，取得了具有竞争力的结果。仅使用三角棱镜和球体作为遮挡物的合成阴影数据集能产生最佳结果。因此，合成阴影去除数据集可以作为未来深度学习阴影去除方法的可行替代方案。源代码和数据集可从以下链接获取：https://neildg.github.io/SynthShadowRemoval/。

{"title":"Training a shadow removal network using only 3D primitive occluders","authors":"Neil Patrick Del Gallego, Joel Ilao, Macario II Cordel, Conrado Ruiz","doi":"10.1007/s00371-024-03536-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03536-7","url":null,"abstract":"Removing shadows in images is often a necessary pre-processing task for improving the performance of computer vision applications. Deep learning shadow removal approaches require a large-scale dataset that is challenging to gather. To address the issue of limited shadow data, we present a new and cost-effective method of synthetically generating shadows using 3D virtual primitives as occluders. We simulate the shadow generation process in a virtual environment where foreground objects are composed of mapped textures from the Places-365 dataset. We argue that complex shadow regions can be approximated by mixing primitives, analogous to how 3D models in computer graphics can be represented as triangle meshes. We use the proposed synthetic shadow removal dataset, DLSUSynthPlaces-100K, to train a feature-attention-based shadow removal network without explicit domain adaptation or style transfer strategy. The results of this study show that the trained network achieves competitive results with state-of-the-art shadow removal networks that were trained purely on typical SR datasets such as ISTD or SRD. Using a synthetic shadow dataset of only triangular prisms and spheres as occluders produces the best results. Therefore, the synthetic shadow removal dataset can be a viable alternative for future deep-learning shadow removal methods. The source code and dataset can be accessed at this link: https://neildg.github.io/SynthShadowRemoval/.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"101 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI) 基于视觉变换器（ViT）和深度卷积神经网络（D-CNN）的模型，在可解释人工智能（XAI）的支持下用于核磁共振成像脑原发性肿瘤图像的多重分类

The Visual Computer

Pub Date : 2024-06-26 DOI: 10.1007/s00371-024-03524-x

Hiba Mzoughi, Ines Njeh, Mohamed BenSlima, Nouha Farhat, Chokri Mhiri

The manual classification of primary brain tumors through Magnetic Resonance Imaging (MRI) is considered as a critical task during the clinical routines that requires highly qualified neuroradiologists. Deep Learning (DL)-based computer-aided diagnosis tools are established to support the neurosurgeons’ opinion during the diagnosis. However, the black-box nature and the lack of transparency and interpretability of such DL-based models make their implementation, especially in critical and sensitive medical applications, very difficult. The explainable artificial intelligence techniques help to gain clinicians’ confidence and to provide explanations about the models' predictions. Typical and existing Convolutional Neural Network (CNN)-based architectures could not capture long-range global information and feature from pathology MRI scans. Recently, Vision Transformer (ViT) networks have been introduced to solve the issue of long-range dependency in CNN-based architecture by introducing a self-attention mechanism to analyze images, allowing the network to capture deep long-range reliance between pixels. The purpose of the proposed study is to provide efficient CAD tool for MRI brain tumor classification. At the same, we aim to enhance the neuroradiologists' confidence when using DL in clinical and medical standards. In this paper, we investigated a deep ViT architecture trained from scratch for the multi-classification task of common primary tumors (gliomas, meningiomas, and pituitary brain tumors), using T1-weighted contrast-enhanced MRI sequences. Several XAI techniques have been adopted: Gradient-weighted Class Activation Mapping (Grad-CAM), Local Interpretable Model-agnostic Explanations (LIME), and SHapley Additive exPlanations (SHAP), to visualize the most significant and distinguishing features related to the model prediction results. A publicly available benchmark dataset has been used for the evaluation task. The comparative study confirms the efficiency of ViT architecture compared to the CNN model using the testing dataset. The test accuracy of 83.37% for the Convolutional Neural Network (CNN) and 91.61% for the Vision Transformer (ViT) indicates that the ViT model outperformed the CNN model in the classification task. Based on the experimental results, we could confirm that the proposed ViT model presents a competitive performance outperforming the multi-classification state-of-the-art models using MRI sequences. Further, the proposed models present an exact and correct interpretation. Thus, we could confirm that the proposed CAD could be established during the clinical diagnosis routines.

通过磁共振成像（MRI）对原发性脑肿瘤进行人工分类被认为是临床常规工作中的一项关键任务，需要高素质的神经放射科医生。基于深度学习（DL）的计算机辅助诊断工具的建立是为了在诊断过程中支持神经外科医生的意见。然而，这种基于深度学习的模型具有黑箱性质，缺乏透明度和可解释性，因此很难实施，尤其是在关键和敏感的医疗应用中。可解释的人工智能技术有助于获得临床医生的信任，并为模型的预测提供解释。典型的和现有的基于卷积神经网络（CNN）的架构无法从病理核磁共振扫描中捕捉到远距离全局信息和特征。最近，Vision Transformer（ViT）网络的出现解决了基于 CNN 架构的长程依赖性问题，它引入了一种自我注意机制来分析图像，使网络能够捕捉像素之间的深度长程依赖性。本研究旨在为核磁共振成像脑肿瘤分类提供高效的 CAD 工具。同时，我们还希望增强神经放射科医生在临床和医疗标准中使用 DL 时的信心。在本文中，我们利用 T1 加权对比增强 MRI 序列，针对常见原发性肿瘤（胶质瘤、脑膜瘤和垂体脑瘤）的多重分类任务，研究了一种从头开始训练的深度 ViT 架构。目前已采用了几种 XAI 技术：梯度加权类激活图谱（Grad-CAM）、局部可解释模型-诊断性解释（LIME）和 SHapley Additive exPlanations（SHAP），以可视化与模型预测结果相关的最重要和最显著的特征。评估任务使用了一个公开的基准数据集。对比研究证实了 ViT 架构与使用测试数据集的 CNN 模型相比的效率。卷积神经网络（CNN）的测试准确率为 83.37%，而视觉转换器（ViT）的测试准确率为 91.61%，这表明 ViT 模型在分类任务中的表现优于 CNN 模型。根据实验结果，我们可以确认所提出的 ViT 模型在使用核磁共振成像序列进行多重分类方面的性能优于最先进的模型。此外，所提出的模型还提供了准确和正确的解释。因此，我们可以确认所提出的计算机辅助诊断系统可以在临床诊断过程中使用。

{"title":"Vision transformers (ViT) and deep convolutional neural network (D-CNN)-based models for MRI brain primary tumors images multi-classification supported by explainable artificial intelligence (XAI)","authors":"Hiba Mzoughi, Ines Njeh, Mohamed BenSlima, Nouha Farhat, Chokri Mhiri","doi":"10.1007/s00371-024-03524-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03524-x","url":null,"abstract":"The manual classification of primary brain tumors through Magnetic Resonance Imaging (MRI) is considered as a critical task during the clinical routines that requires highly qualified neuroradiologists. Deep Learning (DL)-based computer-aided diagnosis tools are established to support the neurosurgeons’ opinion during the diagnosis. However, the black-box nature and the lack of transparency and interpretability of such DL-based models make their implementation, especially in critical and sensitive medical applications, very difficult. The explainable artificial intelligence techniques help to gain clinicians’ confidence and to provide explanations about the models' predictions. Typical and existing Convolutional Neural Network (CNN)-based architectures could not capture long-range global information and feature from pathology MRI scans. Recently, Vision Transformer (ViT) networks have been introduced to solve the issue of long-range dependency in CNN-based architecture by introducing a self-attention mechanism to analyze images, allowing the network to capture deep long-range reliance between pixels. The purpose of the proposed study is to provide efficient CAD tool for MRI brain tumor classification. At the same, we aim to enhance the neuroradiologists' confidence when using DL in clinical and medical standards. In this paper, we investigated a deep ViT architecture trained from scratch for the multi-classification task of common primary tumors (gliomas, meningiomas, and pituitary brain tumors), using T1-weighted contrast-enhanced MRI sequences. Several XAI techniques have been adopted: Gradient-weighted Class Activation Mapping (Grad-CAM), Local Interpretable Model-agnostic Explanations (LIME), and SHapley Additive exPlanations (SHAP), to visualize the most significant and distinguishing features related to the model prediction results. A publicly available benchmark dataset has been used for the evaluation task. The comparative study confirms the efficiency of ViT architecture compared to the CNN model using the testing dataset. The test accuracy of 83.37% for the Convolutional Neural Network (CNN) and 91.61% for the Vision Transformer (ViT) indicates that the ViT model outperformed the CNN model in the classification task. Based on the experimental results, we could confirm that the proposed ViT model presents a competitive performance outperforming the multi-classification state-of-the-art models using MRI sequences. Further, the proposed models present an exact and correct interpretation. Thus, we could confirm that the proposed CAD could be established during the clinical diagnosis routines.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing autism prediction through visual-based AI approaches: integrating advanced eye movement analysis and shape recognition with Kalman filtering 通过基于视觉的人工智能方法推进自闭症预测：将高级眼动分析和形状识别与卡尔曼滤波相结合

The Visual Computer

Pub Date : 2024-06-26 DOI: 10.1007/s00371-024-03529-6

Suresh Cheekaty, G. Muneeswari

In the recent past, the global prevalence of autism spectrum disorder (ASD) has witnessed a remarkable surge, underscoring its significance as a widespread neurodevelopmental disorder affecting children, with an incidence rate of 0.62%. Individuals diagnosed with ASD often grapple with challenges in language acquisition and comprehending verbal communication, compounded by difficulties in nonverbal communication aspects such as gestures and eye contact. Eye movement analysis, a multifaceted field spanning industrial engineering to psychology, offers invaluable insights into human attention and behavior patterns. The present study proposes an economical eye movement analysis system that adroitly integrates Neuro Spectrum Net (NSN) techniques with Kalman filtering, enabling precise eye position estimation. The overarching objective is to enhance deep learning models for early autism detection by leveraging eye-tracking data, a critical consideration given the pivotal role of early intervention in mitigating the disorder’s impact. Through the synergistic incorporation of NSN and contrast-limited adaptive histogram equalization for feature extraction, the proposed model exhibits superior scalability and accuracy when compared to existing methodologies, thereby holding promising potential for clinical applications. A comprehensive series of experiments and rigorous evaluations underscore the system’s efficacy in eye movement classification and pupil position identification, outperforming traditional Recurrent Neural Network approaches. The dataset utilized in the aforementioned scholarly article is accessible through the Zenodo repository and can be retrieved via the following link: [https://zenodo.org/records/10935303?preview=1].

近年来，自闭症谱系障碍（ASD）在全球的发病率显著上升，其发病率为 0.62%，凸显了自闭症谱系障碍作为一种广泛影响儿童的神经发育障碍的重要性。被诊断为 ASD 的患者通常在语言习得和理解言语交流方面面临挑战，同时在手势和眼神交流等非言语交流方面也存在困难。眼动分析是一个横跨工业工程学和心理学的多层面领域，它为人类的注意力和行为模式提供了宝贵的见解。本研究提出了一种经济的眼动分析系统，该系统巧妙地将神经频谱网（NSN）技术与卡尔曼滤波技术相结合，实现了精确的眼球位置估计。鉴于早期干预在减轻自闭症影响方面的关键作用，这是一个至关重要的考虑因素。通过将 NSN 和对比度受限的自适应直方图均衡化协同整合到特征提取中，与现有方法相比，所提出的模型表现出更优越的可扩展性和准确性，因此在临床应用中大有可为。一系列全面的实验和严格的评估证明了该系统在眼球运动分类和瞳孔位置识别方面的功效，优于传统的循环神经网络方法。上述学术文章中使用的数据集可通过 Zenodo 存储库访问，并可通过以下链接检索：[https://zenodo.org/records/10935303?preview=1].

{"title":"Advancing autism prediction through visual-based AI approaches: integrating advanced eye movement analysis and shape recognition with Kalman filtering","authors":"Suresh Cheekaty, G. Muneeswari","doi":"10.1007/s00371-024-03529-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03529-6","url":null,"abstract":"In the recent past, the global prevalence of autism spectrum disorder (ASD) has witnessed a remarkable surge, underscoring its significance as a widespread neurodevelopmental disorder affecting children, with an incidence rate of 0.62%. Individuals diagnosed with ASD often grapple with challenges in language acquisition and comprehending verbal communication, compounded by difficulties in nonverbal communication aspects such as gestures and eye contact. Eye movement analysis, a multifaceted field spanning industrial engineering to psychology, offers invaluable insights into human attention and behavior patterns. The present study proposes an economical eye movement analysis system that adroitly integrates Neuro Spectrum Net (NSN) techniques with Kalman filtering, enabling precise eye position estimation. The overarching objective is to enhance deep learning models for early autism detection by leveraging eye-tracking data, a critical consideration given the pivotal role of early intervention in mitigating the disorder’s impact. Through the synergistic incorporation of NSN and contrast-limited adaptive histogram equalization for feature extraction, the proposed model exhibits superior scalability and accuracy when compared to existing methodologies, thereby holding promising potential for clinical applications. A comprehensive series of experiments and rigorous evaluations underscore the system’s efficacy in eye movement classification and pupil position identification, outperforming traditional Recurrent Neural Network approaches. The dataset utilized in the aforementioned scholarly article is accessible through the Zenodo repository and can be retrieved via the following link: [https://zenodo.org/records/10935303?preview=1].","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0