Image and Vision Computing最新文献_第8页

APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105384

Qiujie Ma , Shuqi Yang , Lijuan Zhang , Qing Lan , Dongdong Yang , Honghan Chen , Ying Tan

In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and Vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel-level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tasks on the Open Images, Pascal VOC 2012, Pascal Context, and ADE20K datasets. The code will be available at https://github.com/ijetma/APOVIS.

{"title":"APOVIS: Automated pixel-level open-vocabulary instance segmentation through integration of pre-trained vision-language models and foundational segmentation models","authors":"Qiujie Ma , Shuqi Yang , Lijuan Zhang , Qing Lan , Dongdong Yang , Honghan Chen , Ying Tan","doi":"10.1016/j.imavis.2024.105384","DOIUrl":"10.1016/j.imavis.2024.105384","url":null,"abstract":"<div><div>In recent years, substantial advancements have been achieved in vision-language integration and image segmentation, particularly through the use of pre-trained models like BERT and Vision Transformer (ViT). Within the domain of open-vocabulary instance segmentation (OVIS), accurately identifying an instance's positional information is critical, as it directly influences the precision of subsequent segmentation tasks. However, many existing methods rely on supplementary networks to generate pseudo-labels, such as multiple anchor frames containing object positional information. While these pseudo-labels aid visual language models in recognizing the absolute position of objects, they often compromise the overall efficiency and performance of the OVIS pipeline. In this study, we introduce a novel Automated Pixel-level OVIS (APOVIS) framework aimed at enhancing OVIS. Our approach automatically generates pixel-level annotations by leveraging the matching capabilities of pre-trained vision-language models for image-text pairs alongside a foundational segmentation model that accepts multiple prompts (e.g., points or anchor boxes) to guide the segmentation process. Specifically, our method first utilizes a pre-trained vision-language model to match instances within image-text pairs to identify relative positions. Next, we employ activation maps to visualize the instances, enabling us to extract instance location information and generate pseudo-label prompts that direct the segmentation process. These pseudo-labels then guide the segmentation model to execute pixel-level segmentation, enhancing both the accuracy and generalizability of object segmentation across images. Extensive experimental results demonstrate that our model significantly outperforms current state-of-the-art models in object detection accuracy and pixel-level instance segmentation on the COCO dataset. Additionally, the generalizability of our approach is validated through image-text pair data inference tasks on the Open Images, Pascal VOC 2012, Pascal Context, and ADE20K datasets. The code will be available at <span><span>https://github.com/ijetma/APOVIS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105384"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An ontological approach to investigate the impact of deep convolutional neural networks in anomaly detection of left ventricular hypertrophy using echocardiography images

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2025.105427

Umar Islam , Hathal Salamah Alwageed , Saleh Alyahyan , Manal Alghieth , Hanif Ullah , Naveed Khan

Left Ventricular Hypertrophy (LVH) is a critical predictor of cardiovascular disease, making it essential to incorporate it as a fundamental parameter in both diagnostic screening and clinical management. Addressing the need for efficient, accurate, and scalable medical image analysis, we introduce a state-of-the-art preprocessing pipeline coupled with a novel Deep Convolutional Neural Network (DCNN) architecture. This paper details our choice of the HMC-QU dataset, selected for its robustness and its proven efficacy in enhancing model generalization. We also describe innovative preprocessing techniques aimed at improving the quality of input data, thereby boosting the model's feature extraction capabilities. Our multi-disciplinary approach includes deploying a DCNN for automated LVH diagnosis using echocardiography A4C and A2C images. We evaluated the model using architectures based on VGG16, ResNet50, and InceptionV3, where our proposed DCNN exhibited enhanced performance. In our study, 93 out of 162 A4C recordings and 68 out of 130 A2C recordings confirmed the presence of LVH. The novel DCNN model achieved an impressive 99.8% accuracy on the training set and 98.0% on the test set. Comparatively, ResNet50 and InceptionV3 models showed lower accuracy and higher loss values both in training and testing phases. Our results underscore the potential of our DCNN architecture in enhancing the precision of MRI echocardiograms in diagnosing LVH, thereby providing critical support in the screening and treatment of cardiovascular conditions. The high accuracy and minimal losses observed with the novel DCNN model indicate its utility in clinical settings, making it a valuable tool for improving patient outcomes in cardiovascular care.

{"title":"An ontological approach to investigate the impact of deep convolutional neural networks in anomaly detection of left ventricular hypertrophy using echocardiography images","authors":"Umar Islam , Hathal Salamah Alwageed , Saleh Alyahyan , Manal Alghieth , Hanif Ullah , Naveed Khan","doi":"10.1016/j.imavis.2025.105427","DOIUrl":"10.1016/j.imavis.2025.105427","url":null,"abstract":"<div><div>Left Ventricular Hypertrophy (LVH) is a critical predictor of cardiovascular disease, making it essential to incorporate it as a fundamental parameter in both diagnostic screening and clinical management. Addressing the need for efficient, accurate, and scalable medical image analysis, we introduce a state-of-the-art preprocessing pipeline coupled with a novel Deep Convolutional Neural Network (DCNN) architecture. This paper details our choice of the HMC-QU dataset, selected for its robustness and its proven efficacy in enhancing model generalization. We also describe innovative preprocessing techniques aimed at improving the quality of input data, thereby boosting the model's feature extraction capabilities. Our multi-disciplinary approach includes deploying a DCNN for automated LVH diagnosis using echocardiography A4C and A2C images. We evaluated the model using architectures based on VGG16, ResNet50, and InceptionV3, where our proposed DCNN exhibited enhanced performance. In our study, 93 out of 162 A4C recordings and 68 out of 130 A2C recordings confirmed the presence of LVH. The novel DCNN model achieved an impressive 99.8% accuracy on the training set and 98.0% on the test set. Comparatively, ResNet50 and InceptionV3 models showed lower accuracy and higher loss values both in training and testing phases. Our results underscore the potential of our DCNN architecture in enhancing the precision of MRI echocardiograms in diagnosing LVH, thereby providing critical support in the screening and treatment of cardiovascular conditions. The high accuracy and minimal losses observed with the novel DCNN model indicate its utility in clinical settings, making it a valuable tool for improving patient outcomes in cardiovascular care.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105427"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust auxiliary modality is beneficial for video-based cloth-changing person re-identification

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2025-02-01 DOI: 10.1016/j.imavis.2024.105400

Youming Chen , Ting Tuo , Lijun Guo , Rong Zhang , Yirui Wang , Shangce Gao

The core of video-based cloth-changing person re-identification is the extraction of cloth-irrelevant features, such as body shape, face, and gait. Most current methods rely on auxiliary modalities to help the model focus on these features. Although these modalities can resist the interference of clothing appearance, they are not robust against cloth-changing, which affects model recognition. The joint point information of pedestrians was considered to better resist the impact of cloth-changing; however, it contained limited pedestrian discrimination information. In contrast, the silhouettes had rich pedestrian discrimination information and could resist interference from clothing appearance but were vulnerable to cloth-changing. Therefore, we combined these two modalities to construct a more robust modality that minimized the impact of clothing on the model. We designed different usage methods for the temporal and spatial aspects based on the characteristics of the fusion modality to enhance the model for extracting cloth-irrelevant features. Specifically, at the spatial level, we developed a guiding method retaining fine-grained, cloth-irrelevant features while using fused features to reduce the focus on cloth-relevant features in the original image. At the temporal level, we designed a fusion method that combined action features from the silhouette and joint point sequences, resulting in more robust action features for cloth-changing pedestrians. Experiments on two video-based cloth-changing datasets, CCPG-D and CCVID, indicated that our proposed model outperformed existing state-of-the-art methods. Additionally, tests on the gait dataset CASIA-B demonstrated that our model achieved optimal average precision.

{"title":"Robust auxiliary modality is beneficial for video-based cloth-changing person re-identification","authors":"Youming Chen , Ting Tuo , Lijun Guo , Rong Zhang , Yirui Wang , Shangce Gao","doi":"10.1016/j.imavis.2024.105400","DOIUrl":"10.1016/j.imavis.2024.105400","url":null,"abstract":"<div><div>The core of video-based cloth-changing person re-identification is the extraction of cloth-irrelevant features, such as body shape, face, and gait. Most current methods rely on auxiliary modalities to help the model focus on these features. Although these modalities can resist the interference of clothing appearance, they are not robust against cloth-changing, which affects model recognition. The joint point information of pedestrians was considered to better resist the impact of cloth-changing; however, it contained limited pedestrian discrimination information. In contrast, the silhouettes had rich pedestrian discrimination information and could resist interference from clothing appearance but were vulnerable to cloth-changing. Therefore, we combined these two modalities to construct a more robust modality that minimized the impact of clothing on the model. We designed different usage methods for the temporal and spatial aspects based on the characteristics of the fusion modality to enhance the model for extracting cloth-irrelevant features. Specifically, at the spatial level, we developed a guiding method retaining fine-grained, cloth-irrelevant features while using fused features to reduce the focus on cloth-relevant features in the original image. At the temporal level, we designed a fusion method that combined action features from the silhouette and joint point sequences, resulting in more robust action features for cloth-changing pedestrians. Experiments on two video-based cloth-changing datasets, CCPG-D and CCVID, indicated that our proposed model outperformed existing state-of-the-art methods. Additionally, tests on the gait dataset CASIA-B demonstrated that our model achieved optimal average precision.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105400"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143372696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Infrared and visible image fusion using quantum computing induced edge preserving filter 利用量子计算诱导的边缘保持滤波器进行红外和可见光图像融合

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-28 DOI: 10.1016/j.imavis.2024.105344

Priyadarsan Parida , Manoj Kumar Panda , Deepak Kumar Rout , Saroj Kumar Panda

Information fusion by utilization of visible and thermal images provides a more comprehensive scene understanding in the resulting image rather than individual source images. It applies to wide areas of applications such as navigation, surveillance, remote sensing, and military where significant information is obtained from diverse modalities making it quite challenging. The challenges involved in integrating the various sources of data are due to the diverse modalities of imaging sensors along with the complementary information. So, there is a need for precise information integration in terms of infrared (IR) and visible image fusion while retaining useful information from both sources. Therefore, in this article, a unique image fusion methodology is presented that focuses on enhancing the prominent details of both images, preserving the textural information with reduced noise from either of the sources. In this regard, we put forward a quantum computing-induced IR and visible image fusion technique which preserves the required information with highlighted details from the source images efficiently. Initially, the proposed edge detail preserving strategy is capable of retaining the salient details accurately from the source images. Further, the proposed quantum computing-induced weight map generation mechanism preserves the complementary details with fewer redundant details which produces quantum details. Again the prominent features of the source images are retained using highly rich information. Finally, the quantum and the prominent details are utilized to produce the fused image for the corresponding source image pair. Both subjective and objective analyses are utilized to validate the effectiveness of the proposed algorithm. The efficacy of the developed model is validated by comparing the outcomes attained by it against twenty-six existing fusion algorithms. From various experiments, it is observed that the developed framework achieved higher accuracy in terms of visual demonstration as well as quantitative assessments compared to different deep-learning and non-deep learning-based state-of-the-art (SOTA) techniques.

利用可见光和热成像的信息融合在最终图像中提供了更全面的场景理解，而不是单个源图像。它适用于广泛的应用领域，如导航、监视、遥感和军事，从各种方式获得重要信息，这使得它相当具有挑战性。整合各种数据源所涉及的挑战是由于成像传感器的不同模式以及补充信息。因此，需要在红外和可见光图像融合方面进行精确的信息集成，同时保留两种来源的有用信息。因此，本文提出了一种独特的图像融合方法，该方法侧重于增强两幅图像的突出细节，在减少两幅图像噪声的同时保留纹理信息。为此，我们提出了一种量子计算诱导的红外和可见光图像融合技术，该技术有效地保留了源图像中突出显示细节的所需信息。首先，所提出的边缘细节保留策略能够准确地保留源图像中的显著细节。此外，提出的量子计算诱导的权重图生成机制保留了互补细节和较少冗余细节，从而产生量子细节。同样，源图像的突出特征使用高度丰富的信息被保留。最后，利用量子和突出细节产生相应源图像对的融合图像。通过主观和客观分析，验证了所提算法的有效性。通过与现有的26种融合算法进行比较，验证了该模型的有效性。从各种实验中可以观察到，与不同的深度学习和非基于深度学习的最先进技术（SOTA）相比，开发的框架在视觉演示和定量评估方面取得了更高的准确性。

{"title":"Infrared and visible image fusion using quantum computing induced edge preserving filter","authors":"Priyadarsan Parida , Manoj Kumar Panda , Deepak Kumar Rout , Saroj Kumar Panda","doi":"10.1016/j.imavis.2024.105344","DOIUrl":"10.1016/j.imavis.2024.105344","url":null,"abstract":"<div><div>Information fusion by utilization of visible and thermal images provides a more comprehensive scene understanding in the resulting image rather than individual source images. It applies to wide areas of applications such as navigation, surveillance, remote sensing, and military where significant information is obtained from diverse modalities making it quite challenging. The challenges involved in integrating the various sources of data are due to the diverse modalities of imaging sensors along with the complementary information. So, there is a need for precise information integration in terms of infrared (IR) and visible image fusion while retaining useful information from both sources. Therefore, in this article, a unique image fusion methodology is presented that focuses on enhancing the prominent details of both images, preserving the textural information with reduced noise from either of the sources. In this regard, we put forward a quantum computing-induced IR and visible image fusion technique which preserves the required information with highlighted details from the source images efficiently. Initially, the proposed edge detail preserving strategy is capable of retaining the salient details accurately from the source images. Further, the proposed quantum computing-induced weight map generation mechanism preserves the complementary details with fewer redundant details which produces quantum details. Again the prominent features of the source images are retained using highly rich information. Finally, the quantum and the prominent details are utilized to produce the fused image for the corresponding source image pair. Both subjective and objective analyses are utilized to validate the effectiveness of the proposed algorithm. The efficacy of the developed model is validated by comparing the outcomes attained by it against twenty-six existing fusion algorithms. From various experiments, it is observed that the developed framework achieved higher accuracy in terms of visual demonstration as well as quantitative assessments compared to different deep-learning and non-deep learning-based state-of-the-art (SOTA) techniques.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105344"},"PeriodicalIF":4.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142748135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unified Volumetric Avatar: Enabling flexible editing and rendering of neural human representations 统一的体积头像：允许灵活的编辑和渲染神经人类表征

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-23 DOI: 10.1016/j.imavis.2024.105345

Jinlong Fan, Xudong Lv, Xuepu Zeng, Zhengyi Bao, Zhiwei He, Mingyu Gao

Neural Radiance Field (NeRF) has emerged as a leading method for reconstructing 3D human avatars with exceptional rendering capabilities, particularly for novel view and pose synthesis. However, current approaches for editing these avatars are limited, typically allowing only global geometry adjustments or texture modifications via neural texture maps. This paper introduces Unified Volumetric Avatar, a novel framework enabling independent and simultaneous global and local editing of both geometry and texture of 3D human avatars and user-friendly manipulation. The proposed approach seamlessly integrates implicit neural fields with an explicit polygonal mesh, leveraging distinct geometry and appearance latent codes attached to the body mesh for precise local edits. These trackable latent codes permeate through the 3D space via barycentric interpolation, mitigating spatial ambiguity with the aid of a local signed height indicator. Furthermore, our method enhances surface illumination representation across different poses by incorporating a pose-dependent shading factor instead of relying on view-dependent radiance color. Experimental results on multiple human avatars demonstrate its efficacy in achieving competitive results for novel view synthesis and novel pose rendering, showcasing its potential for versatile human representation. The source code will be made publicly available.

神经辐射场（NeRF）已成为重建具有卓越渲染能力的3D人类化身的领先方法，特别是用于新颖的视图和姿态合成。然而，目前编辑这些角色的方法是有限的，通常只允许通过神经纹理映射进行全局几何调整或纹理修改。本文介绍了统一体积头像，这是一种新颖的框架，可以独立、同时地对三维人体头像的几何和纹理进行全局和局部编辑，并进行用户友好的操作。所提出的方法将隐式神经场与显式多边形网格无缝集成，利用附着在身体网格上的不同几何形状和外观潜在代码进行精确的局部编辑。这些可跟踪的潜在代码通过以重心为中心的插值渗透到3D空间，借助局部符号高度指示器减轻空间模糊性。此外，我们的方法通过结合与姿态相关的阴影因子而不是依赖于视图相关的亮度颜色来增强不同姿态的表面照明表示。在多个人类化身上的实验结果表明，该方法在新视角合成和新姿态渲染方面取得了有竞争力的结果，展示了其在多功能人类表征方面的潜力。源代码将公开提供。

{"title":"Unified Volumetric Avatar: Enabling flexible editing and rendering of neural human representations","authors":"Jinlong Fan, Xudong Lv, Xuepu Zeng, Zhengyi Bao, Zhiwei He, Mingyu Gao","doi":"10.1016/j.imavis.2024.105345","DOIUrl":"10.1016/j.imavis.2024.105345","url":null,"abstract":"<div><div>Neural Radiance Field (NeRF) has emerged as a leading method for reconstructing 3D human avatars with exceptional rendering capabilities, particularly for novel view and pose synthesis. However, current approaches for editing these avatars are limited, typically allowing only global geometry adjustments or texture modifications via neural texture maps. This paper introduces Unified Volumetric Avatar, a novel framework enabling independent and simultaneous global and local editing of both geometry and texture of 3D human avatars and user-friendly manipulation. The proposed approach seamlessly integrates implicit neural fields with an explicit polygonal mesh, leveraging distinct geometry and appearance latent codes attached to the body mesh for precise local edits. These trackable latent codes permeate through the 3D space via barycentric interpolation, mitigating spatial ambiguity with the aid of a local signed height indicator. Furthermore, our method enhances surface illumination representation across different poses by incorporating a pose-dependent shading factor instead of relying on view-dependent radiance color. Experimental results on multiple human avatars demonstrate its efficacy in achieving competitive results for novel view synthesis and novel pose rendering, showcasing its potential for versatile human representation. The source code will be made publicly available.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105345"},"PeriodicalIF":4.2,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142748134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IFE-Net: Integrated feature enhancement network for image manipulation localization IFE-Net：用于图像处理定位的综合特征增强网络

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-21 DOI: 10.1016/j.imavis.2024.105342

Lichao Su , Chenwei Dai , Hao Yu , Yun Chen

Image tampering techniques can lead to distorted or misleading information, which in turn poses a threat in many areas, including social, legal and commercial. Numerous image tampering detection algorithms lose important low-level detail information when extracting deep features, reducing the accuracy and robustness of detection. In order to solve the problems of current methods, this paper proposes a new network called IFE-Net to detect three types of tampered images, namely copy-move, heterologous splicing and removal. Firstly, this paper constructs the noise stream using the attention mechanism CBAM to extract and optimize the noise features. The high-level features are extracted by the backbone network of RGB stream, and the FEASPP module is built for capturing and enhancing the features at different scales. In addition, in this paper, the initial features of RGB stream are additionally supervised so as to limit the detection area and reduce the false alarm. Finally, the final prediction results are obtained by fusing the noise features with the RGB features through the Dual Attention Mechanism (DAM) module. Extensive experimental results on multiple standard datasets show that IFE-Net can accurately locate the tampering region and effectively reduce false alarms, demonstrating superior performance.

图像篡改技术可导致信息失真或误导，进而在社会、法律和商业等多个领域构成威胁。许多图像篡改检测算法在提取深层特征时会丢失重要的低层细节信息，从而降低了检测的准确性和鲁棒性。为了解决现有方法存在的问题，本文提出了一种名为 IFE-Net 的新网络来检测三种类型的篡改图像，即复制移动、异源拼接和移除。首先，本文利用注意力机制 CBAM 构建噪声流，提取并优化噪声特征。通过 RGB 流的骨干网络提取高级特征，并构建 FEASPP 模块用于捕捉和增强不同尺度的特征。此外，本文还对 RGB 流的初始特征进行了额外的监督，以限制检测区域并减少误报。最后，通过双重关注机制（DAM）模块将噪声特征与 RGB 特征融合，得到最终预测结果。在多个标准数据集上的大量实验结果表明，IFE-Net 能够准确定位篡改区域，并有效减少误报，表现出卓越的性能。

{"title":"IFE-Net: Integrated feature enhancement network for image manipulation localization","authors":"Lichao Su , Chenwei Dai , Hao Yu , Yun Chen","doi":"10.1016/j.imavis.2024.105342","DOIUrl":"10.1016/j.imavis.2024.105342","url":null,"abstract":"<div><div>Image tampering techniques can lead to distorted or misleading information, which in turn poses a threat in many areas, including social, legal and commercial. Numerous image tampering detection algorithms lose important low-level detail information when extracting deep features, reducing the accuracy and robustness of detection. In order to solve the problems of current methods, this paper proposes a new network called IFE-Net to detect three types of tampered images, namely copy-move, heterologous splicing and removal. Firstly, this paper constructs the noise stream using the attention mechanism CBAM to extract and optimize the noise features. The high-level features are extracted by the backbone network of RGB stream, and the FEASPP module is built for capturing and enhancing the features at different scales. In addition, in this paper, the initial features of RGB stream are additionally supervised so as to limit the detection area and reduce the false alarm. Finally, the final prediction results are obtained by fusing the noise features with the RGB features through the Dual Attention Mechanism (DAM) module. Extensive experimental results on multiple standard datasets show that IFE-Net can accurately locate the tampering region and effectively reduce false alarms, demonstrating superior performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105342"},"PeriodicalIF":4.2,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142704488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mobile-friendly and multi-feature aggregation via transformer for human pose estimation 通过变换器进行移动友好型多特征聚合，用于人体姿态估计

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-20 DOI: 10.1016/j.imavis.2024.105343

Biao Li , Shoufeng Tang , Wenyi Li

Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.

人体姿态估计对于以人为中心的视觉任务至关重要，但由于参数数量多、计算要求高，在移动设备上部署此类模型仍具有挑战性。在本文中，我们研究了用于人体姿态估计的移动友好和多特征聚合架构设计，并提出了一种名为 MobileMultiPose 的新型模型。具体来说，一种结合了多尺度和多特征的轻量级聚合方法可减轻冗余的浅层语义提取和局部深层语义限制。为了有效聚合多样化的局部和全局特征，设计了一个轻量级转换器模块，该模块由具有线性复杂性的自我注意机制构建而成，实现了浅层和深层语义的深度融合。此外，还在训练过程中加入了多尺度损失监督方法，以提高模型性能，促进不同尺度边缘信息的有效融合。大量实验表明，在 COCO 验证集上，MobileMultiPose 的最小变体以更少的参数和 FLOPs 分别比轻量级模型（MobileNetv2、ShuffleNetv2 和 Small HRNet）高出 0.7、5.4 和 10.1 分。特别是，最大的 MobileMultiPose 变体在 COCO 测试验证集上获得了 72.4 分的惊人 AP 分数，值得注意的是，其参数和 FLOP 分别只有 HRNet-W32 的 16% 和 18%，以及 DARK 的 7% 和 9%。我们的目标是为设计轻量级高效特征提取网络提供新见解，支持移动友好模型部署。

{"title":"Mobile-friendly and multi-feature aggregation via transformer for human pose estimation","authors":"Biao Li , Shoufeng Tang , Wenyi Li","doi":"10.1016/j.imavis.2024.105343","DOIUrl":"10.1016/j.imavis.2024.105343","url":null,"abstract":"<div><div>Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi-scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105343"},"PeriodicalIF":4.2,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142704487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detection of fractional difference in inter vertebral disk MRI images for recognition of low back pain 检测椎间盘磁共振成像图像中的分数差以识别腰背痛

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-19 DOI: 10.1016/j.imavis.2024.105333

Manvendra Singh , Md. Sarfaraj Alam Ansari , Mahesh Chandra Govil

Low Back Pain (LBP) diagnosis through MR images of IVDs is a challenging task due to complex spinal anatomy and varying image quality. These factors make it difficult to analyse and segment IVD images accurately. Further, simple metrics are ineffective in interpreting nuanced features from IVD images for accurate diagnoses. Overcoming these challenges is crucial to improving the precision and reliability of IVD-based LBP diagnosis. Also, the existing systems have a very high false negative rate pushes the system towards less use. This research study proposes a new framework for the detection of LBP symptoms using the Otsu Segmented Structural and Gray-Level Co-occurrence Matrix (GLCM) feature-based ML-model (OSSG-ML model) that eliminates manual intervention for low back pain detection. The proposed framework uses Otsu segmentation’s dynamic thresholding to differentiate spinal and backdrop pixel clusters. The segmented image is then used by the feature extraction using GLCM and Wavelet-Fourier module to extract two types of features. The first feature type analyzes the structural variation between normal and low back pain symptom patients. The second feature type detects LBP using statistical measures in image analysis and texture recognition of the MRI IVD segmented image. Various machine learning models are built for LBP detection, utilizing both features separately. First, the model employs structural and geometric differences, while the second model analyzes statistical measurements. On evaluating the model’s performance, it accurately detects low back pain with a 98 to 100% accuracy rate and a very low false negative rate.

由于脊柱解剖结构复杂，图像质量参差不齐，通过 IVD 的 MR 图像诊断腰背痛（LBP）是一项具有挑战性的任务。这些因素导致难以准确分析和分割 IVD 图像。此外，简单的度量标准也无法有效解释 IVD 图像中的细微特征，从而进行准确诊断。要提高基于 IVD 的 LBP 诊断的准确性和可靠性，克服这些挑战至关重要。此外，现有系统的假阴性率非常高，导致系统的使用率降低。本研究提出了一种使用大津分割结构和灰度共现矩阵（GLCM）基于特征的 ML 模型（OSSG-ML 模型）检测腰背痛症状的新框架，该框架消除了腰背痛检测中的人工干预。建议的框架使用大津分割的动态阈值来区分脊柱和背景像素簇。然后，利用 GLCM 和 Wavelet-Fourier 模块对分割后的图像进行特征提取，以提取两种类型的特征。第一类特征用于分析正常人和腰痛症状患者之间的结构差异。第二种特征类型是利用核磁共振成像 IVD 分段图像的图像分析和纹理识别中的统计量检测枸杞多糖症。利用这两种特征分别建立了各种机器学习模型，用于枸杞多糖检测。第一种模型利用结构和几何差异，第二种模型分析统计测量。在对模型的性能进行评估时，它能准确检测出腰背痛，准确率高达 98% 至 100% ，而且假阴性率非常低。

{"title":"Detection of fractional difference in inter vertebral disk MRI images for recognition of low back pain","authors":"Manvendra Singh , Md. Sarfaraj Alam Ansari , Mahesh Chandra Govil","doi":"10.1016/j.imavis.2024.105333","DOIUrl":"10.1016/j.imavis.2024.105333","url":null,"abstract":"<div><div>Low Back Pain (LBP) diagnosis through MR images of IVDs is a challenging task due to complex spinal anatomy and varying image quality. These factors make it difficult to analyse and segment IVD images accurately. Further, simple metrics are ineffective in interpreting nuanced features from IVD images for accurate diagnoses. Overcoming these challenges is crucial to improving the precision and reliability of IVD-based LBP diagnosis. Also, the existing systems have a very high false negative rate pushes the system towards less use. This research study proposes a new framework for the detection of LBP symptoms using the Otsu Segmented Structural and Gray-Level Co-occurrence Matrix (GLCM) feature-based ML-model (OSSG-ML model) that eliminates manual intervention for low back pain detection. The proposed framework uses Otsu segmentation’s dynamic thresholding to differentiate spinal and backdrop pixel clusters. The segmented image is then used by the feature extraction using GLCM and Wavelet-Fourier module to extract two types of features. The first feature type analyzes the structural variation between normal and low back pain symptom patients. The second feature type detects LBP using statistical measures in image analysis and texture recognition of the MRI IVD segmented image. Various machine learning models are built for LBP detection, utilizing both features separately. First, the model employs structural and geometric differences, while the second model analyzes statistical measurements. On evaluating the model’s performance, it accurately detects low back pain with a 98 to 100% accuracy rate and a very low false negative rate.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105333"},"PeriodicalIF":4.2,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142704489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Camouflaged Object Detection via location-awareness and feature fusion 通过位置感知和特征融合探测伪装物体

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-15 DOI: 10.1016/j.imavis.2024.105339

Yanliang Ge , Yuxi Zhong , Junchao Ren , Min He , Hongbo Bi , Qiao Zhang

Camouflaged object detection aims to completely segment objects immersed in their surroundings from the background. However, existing deep learning methods often suffer from the following shortcomings: (1) They have difficulty in accurately perceiving the target location; (2) The extraction of multi-scale feature is insufficient. To address the above problems, we proposed a camouflaged object detection network(LFNet) based on location-awareness and feature fusion. Specifically, we designed a status location module(SLM) that dynamically captures the structural features of targets across spatial and channel dimensions to achieve accurate segmentation. Beyond that, a residual feature fusion module(RFFM) was devised to address the challenge of insufficient multi-scale feature integration. Experiments conducted on three standard datasets(CAMO,COD10K and NC4K) demonstrate that LFNet achieves significant improvements compared with 15 state-of-the-art methods. The code will be available at https://github.com/ZX123445/LFNet.

伪装物体检测旨在将沉浸在周围环境中的物体从背景中完全分离出来。然而，现有的深度学习方法往往存在以下缺陷：（1）难以准确感知目标位置；（2）多尺度特征提取不足。针对上述问题，我们提出了一种基于位置感知和特征融合的伪装物体检测网络（LFNet）。具体来说，我们设计了一个状态定位模块（SLM），它能动态捕捉目标在空间和通道维度上的结构特征，从而实现精确分割。此外，我们还设计了残差特征融合模块（RFFM），以解决多尺度特征融合不足的难题。在三个标准数据集（CAMO、COD10K 和 NC4K）上进行的实验表明，与 15 种最先进的方法相比，LFNet 取得了显著的改进。代码可在 https://github.com/ZX123445/LFNet 上获取。

{"title":"Camouflaged Object Detection via location-awareness and feature fusion","authors":"Yanliang Ge , Yuxi Zhong , Junchao Ren , Min He , Hongbo Bi , Qiao Zhang","doi":"10.1016/j.imavis.2024.105339","DOIUrl":"10.1016/j.imavis.2024.105339","url":null,"abstract":"<div><div>Camouflaged object detection aims to completely segment objects immersed in their surroundings from the background. However, existing deep learning methods often suffer from the following shortcomings: <strong>(1)</strong> They have difficulty in accurately perceiving the target location; <strong>(2)</strong> The extraction of multi-scale feature is insufficient. To address the above problems, we proposed a camouflaged object detection network(LFNet) based on location-awareness and feature fusion. Specifically, we designed a status location module(SLM) that dynamically captures the structural features of targets across spatial and channel dimensions to achieve accurate segmentation. Beyond that, a residual feature fusion module(RFFM) was devised to address the challenge of insufficient multi-scale feature integration. Experiments conducted on three standard datasets(CAMO,COD10K and NC4K) demonstrate that LFNet achieves significant improvements compared with 15 state-of-the-art methods. The code will be available at <span><span>https://github.com/ZX123445/LFNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"152 ","pages":"Article 105339"},"PeriodicalIF":4.2,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142701225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A lightweight depth completion network with spatial efficient fusion 具有空间高效融合功能的轻量级深度补全网络

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing

Pub Date : 2024-11-14 DOI: 10.1016/j.imavis.2024.105335

Zhichao Fu , Anran Wu , Zisong Zhuang , Xingjiao Wu , Jun He

Depth completion is a low-level task rebuilding the dense depth from a sparse set of measurements from LiDAR sensors and corresponding RGB images. Current state-of-the-art depth completion methods used complicated network designs with much computational cost increase, which is incompatible with the realistic-scenario limited computational environment. In this paper, we explore a lightweight and efficient depth completion model named Light-SEF. Light-SEF is a two-stage framework that introduces local fusion and global fusion modules to extract and fuse local and global information in the sparse LiDAR data and RGB images. We also propose a unit convolutional structure named spatial efficient block (SEB), which has a lightweight design and extracts spatial features efficiently. As the unit block of the whole network, SEB is much more cost-efficient compared to the baseline design. Experimental results on the KITTI benchmark demonstrate that our Light-SEF achieves significant declines in computational cost (about 53% parameters, 50% FLOPs & MACs, and 36% running time) while showing competitive results compared to state-of-the-art methods.

深度补全是一项从激光雷达传感器的稀疏测量数据集和相应的 RGB 图像中重建密集深度的底层任务。目前最先进的深度补全方法采用复杂的网络设计，计算成本大幅增加，不符合现实场景有限的计算环境。在本文中，我们探索了一种名为 Light-SEF 的轻量级高效深度补全模型。Light-SEF 是一个两阶段框架，引入了局部融合和全局融合模块，以提取和融合稀疏激光雷达数据和 RGB 图像中的局部和全局信息。我们还提出了一种名为 "空间高效块（SEB）"的单元卷积结构，该结构设计轻巧，能高效提取空间特征。作为整个网络的单元块，SEB 与基线设计相比更具成本效益。在 KITTI 基准上的实验结果表明，与最先进的方法相比，我们的 Light-SEF 实现了计算成本的显著下降（约 53% 的参数、50% 的 FLOPs & MACs 和 36% 的运行时间），同时显示出具有竞争力的结果。

{"title":"A lightweight depth completion network with spatial efficient fusion","authors":"Zhichao Fu , Anran Wu , Zisong Zhuang , Xingjiao Wu , Jun He","doi":"10.1016/j.imavis.2024.105335","DOIUrl":"10.1016/j.imavis.2024.105335","url":null,"abstract":"<div><div>Depth completion is a low-level task rebuilding the dense depth from a sparse set of measurements from LiDAR sensors and corresponding RGB images. Current state-of-the-art depth completion methods used complicated network designs with much computational cost increase, which is incompatible with the realistic-scenario limited computational environment. In this paper, we explore a lightweight and efficient depth completion model named Light-SEF. Light-SEF is a two-stage framework that introduces local fusion and global fusion modules to extract and fuse local and global information in the sparse LiDAR data and RGB images. We also propose a unit convolutional structure named spatial efficient block (SEB), which has a lightweight design and extracts spatial features efficiently. As the unit block of the whole network, SEB is much more cost-efficient compared to the baseline design. Experimental results on the KITTI benchmark demonstrate that our Light-SEF achieves significant declines in computational cost (about 53% parameters, 50% FLOPs & MACs, and 36% running time) while showing competitive results compared to state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"153 ","pages":"Article 105335"},"PeriodicalIF":4.2,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142719587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0