Signal Processing-Image Communication最新文献_第4页

AggNet: Learning to aggregate faces for group membership verification

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-12-02 DOI: 10.1016/j.image.2024.117237

Marzieh Gheisari , Javad Amirian , Teddy Furon , Laurent Amsaleg

In certain applications of face recognition, our goal is to verify whether an individual belongs to a particular group while keeping their identity undisclosed. Existing methods have suggested a process of quantizing pre-computed face descriptors into discrete embeddings and aggregating them into a single representation for the group. However, this mechanism is only optimized for a given closed set of individuals and requires relearning the group representations from scratch whenever the groups change. In this paper, we introduce a deep architecture that simultaneously learns face descriptors and the aggregation mechanism to enhance overall performance. Our system can be utilized for new groups comprising individuals who have never been encountered before, and it easily handles new memberships or the termination of existing memberships. Through experiments conducted on multiple extensive, real-world face datasets, we demonstrate that our proposed method achieves superior verification performance compared to other baseline approaches.

引用次数: 0

Multi-granular inter-frame relation exploration and global residual embedding for video-based person re-identification

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-30 DOI: 10.1016/j.image.2024.117240

Zhiqin Zhu , Sixin Chen , Guanqiu Qi , Huafeng Li , Xinbo Gao

In recent years, the field of video-based person re-identification (re-ID) has conducted in-depth research on how to effectively utilize spatiotemporal clues, which has attracted attention for its potential in providing comprehensive view representations of pedestrians. However, although the discriminability and correlation of spatiotemporal features are often studied, the exploration of the complex relationships between these features has been relatively neglected. Especially when dealing with multi-granularity features, how to depict the different spatial representations of the same person under different perspectives becomes a challenge. To address this challenge, this paper proposes a multi-granularity inter-frame relationship exploration and global residual embedding network specifically designed to solve the above problems. This method successfully extracts more comprehensive and discriminative feature representations by deeply exploring the interactions and global differences between multi-granularity features. Specifically, by simulating the dynamic relationship of different granularity features in long video sequences and using a structured perceptual adjacency matrix to synthesize spatiotemporal information, cross-granularity information is effectively integrated into individual features. In addition, by introducing a residual learning mechanism, this method can also guide the diversified development of global features and reduce the negative impacts caused by factors such as occlusion. Experimental results verify the effectiveness of this method on three mainstream benchmark datasets, significantly surpassing state-of-the-art solutions. This shows that this paper successfully solves the challenging problem of how to accurately identify and utilize the complex relationships between multi-granularity spatiotemporal features in video-based person re-ID.

{"title":"Multi-granular inter-frame relation exploration and global residual embedding for video-based person re-identification","authors":"Zhiqin Zhu , Sixin Chen , Guanqiu Qi , Huafeng Li , Xinbo Gao","doi":"10.1016/j.image.2024.117240","DOIUrl":"10.1016/j.image.2024.117240","url":null,"abstract":"<div><div>In recent years, the field of video-based person re-identification (re-ID) has conducted in-depth research on how to effectively utilize spatiotemporal clues, which has attracted attention for its potential in providing comprehensive view representations of pedestrians. However, although the discriminability and correlation of spatiotemporal features are often studied, the exploration of the complex relationships between these features has been relatively neglected. Especially when dealing with multi-granularity features, how to depict the different spatial representations of the same person under different perspectives becomes a challenge. To address this challenge, this paper proposes a multi-granularity inter-frame relationship exploration and global residual embedding network specifically designed to solve the above problems. This method successfully extracts more comprehensive and discriminative feature representations by deeply exploring the interactions and global differences between multi-granularity features. Specifically, by simulating the dynamic relationship of different granularity features in long video sequences and using a structured perceptual adjacency matrix to synthesize spatiotemporal information, cross-granularity information is effectively integrated into individual features. In addition, by introducing a residual learning mechanism, this method can also guide the diversified development of global features and reduce the negative impacts caused by factors such as occlusion. Experimental results verify the effectiveness of this method on three mainstream benchmark datasets, significantly surpassing state-of-the-art solutions. This shows that this paper successfully solves the challenging problem of how to accurately identify and utilize the complex relationships between multi-granularity spatiotemporal features in video-based person re-ID.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117240"},"PeriodicalIF":3.4,"publicationDate":"2024-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GAN-based multi-view video coding with spatio-temporal EPI reconstruction

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-29 DOI: 10.1016/j.image.2024.117242

Chengdong Lan, Hao Yan, Cheng Luo, Tiesong Zhao

The introduction of multiple viewpoints in video scenes inevitably increases the bitrates required for storage and transmission. To reduce bitrates, researchers have developed methods to skip intermediate viewpoints during compression and delivery, and ultimately reconstruct them using Side Information (SInfo). Typically, depth maps are used to construct SInfo. However, these methods suffer from reconstruction inaccuracies and inherently high bitrates. In this paper, we propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adversarial Network (GAN) to improve the reconstruction accuracy of SInfo. Additionally, we consider incorporating information from adjacent temporal and spatial viewpoints to further reduce SInfo redundancy. At the encoder, we construct a spatio-temporal Epipolar Plane Image (EPI) and further utilize a convolutional network to extract the latent code of a GAN as SInfo. At the decoder, we combine the SInfo and adjacent viewpoints to reconstruct intermediate views using the GAN generator. Specifically, we establish a joint encoder constraint for reconstruction cost and SInfo entropy to achieve an optimal trade-off between reconstruction quality and bitrate overhead. Experiments demonstrate the significant improvement in Rate–Distortion (RD) performance compared to state-of-the-art methods.

{"title":"GAN-based multi-view video coding with spatio-temporal EPI reconstruction","authors":"Chengdong Lan, Hao Yan, Cheng Luo, Tiesong Zhao","doi":"10.1016/j.image.2024.117242","DOIUrl":"10.1016/j.image.2024.117242","url":null,"abstract":"<div><div>The introduction of multiple viewpoints in video scenes inevitably increases the bitrates required for storage and transmission. To reduce bitrates, researchers have developed methods to skip intermediate viewpoints during compression and delivery, and ultimately reconstruct them using Side Information (SInfo). Typically, depth maps are used to construct SInfo. However, these methods suffer from reconstruction inaccuracies and inherently high bitrates. In this paper, we propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adversarial Network (GAN) to improve the reconstruction accuracy of SInfo. Additionally, we consider incorporating information from adjacent temporal and spatial viewpoints to further reduce SInfo redundancy. At the encoder, we construct a spatio-temporal Epipolar Plane Image (EPI) and further utilize a convolutional network to extract the latent code of a GAN as SInfo. At the decoder, we combine the SInfo and adjacent viewpoints to reconstruct intermediate views using the GAN generator. Specifically, we establish a joint encoder constraint for reconstruction cost and SInfo entropy to achieve an optimal trade-off between reconstruction quality and bitrate overhead. Experiments demonstrate the significant improvement in Rate–Distortion (RD) performance compared to state-of-the-art methods.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117242"},"PeriodicalIF":3.4,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-layer feature fusion based image style transfer with arbitrary text condition

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-28 DOI: 10.1016/j.image.2024.117243

Yue Yu, Jingshuo Xing, Nengli Li

Style transfer refers to the conversion of images in two different domains. Compared with the style transfer based on the style image, the image style transfer through the text description is more free and applicable to more practical scenarios. However, the image style transfer method under the text condition needs to be trained and optimized for different text and image inputs each time, resulting in limited style transfer efficiency. Therefore, this paper proposes a multi-layer feature fusion based style transfer method (MlFFST) with arbitrary text condition. To address the problems of distortion and missing semantic content, we also introduce a multi-layer attention normalization module. The experimental results show that the method in this paper can generate stylized results with high quality, good effect and high stability for images and videos. And this method can meet real-time requirements to generate more artistic and aesthetic images and videos.

引用次数: 0

Adaptive histogram equalization framework based on new visual prior and optimization model

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-28 DOI: 10.1016/j.image.2024.117246

Shiqi Liu, Qiding Lu, Shengkui Dai

Histogram Equalization (HE) algorithm remains one of the research hotspots in the field of image enhancement due to its computational simplicity. Despite numerous improvements made to HE algorithms, few can comprehensively account for all major drawbacks of HE. To address this issue, this paper proposes a novel histogram equalization framework, which is an adaptive and systematic resolution. Firstly, a novel optimization mathematical model is proposed to seek the optimal controlling parameters for modifying the histogram. Additionally, a new visual prior knowledge, termed Narrow Dynamic Prior (NDP), is summarized, which describes and reveals the subjective perceptual characteristics of the Human Visual System (HVS) for some special types of images. Then, this new knowledge is organically integrated with the new model to expand the application scope of HE. Lastly, unlike common brightness preservation algorithms, a novel method for brightness estimation and precise control is proposed. Experimental results demonstrate that the proposed equalization framework significantly mitigates the major drawbacks of HE, achieving notable advancements in striking a balance between contrast, brightness and detail of the output image. Both objective evaluation metrics and subjective visual perception indicate that the proposed algorithm outperforms other excellent competition algorithms selected in this paper.

{"title":"Adaptive histogram equalization framework based on new visual prior and optimization model","authors":"Shiqi Liu, Qiding Lu, Shengkui Dai","doi":"10.1016/j.image.2024.117246","DOIUrl":"10.1016/j.image.2024.117246","url":null,"abstract":"<div><div>Histogram Equalization (HE) algorithm remains one of the research hotspots in the field of image enhancement due to its computational simplicity. Despite numerous improvements made to HE algorithms, few can comprehensively account for all major drawbacks of HE. To address this issue, this paper proposes a novel histogram equalization framework, which is an adaptive and systematic resolution. Firstly, a novel optimization mathematical model is proposed to seek the optimal controlling parameters for modifying the histogram. Additionally, a new visual prior knowledge, termed Narrow Dynamic Prior (NDP), is summarized, which describes and reveals the subjective perceptual characteristics of the Human Visual System (HVS) for some special types of images. Then, this new knowledge is organically integrated with the new model to expand the application scope of HE. Lastly, unlike common brightness preservation algorithms, a novel method for brightness estimation and precise control is proposed. Experimental results demonstrate that the proposed equalization framework significantly mitigates the major drawbacks of HE, achieving notable advancements in striking a balance between contrast, brightness and detail of the output image. Both objective evaluation metrics and subjective visual perception indicate that the proposed algorithm outperforms other excellent competition algorithms selected in this paper.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117246"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual information fidelity based frame level rate control for H.265/HEVC 基于视觉信息保真度的H.265/HEVC帧级速率控制

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-28 DOI: 10.1016/j.image.2024.117245

Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia

Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based

λ

and MSE-based

λ

for

λ

-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based

λ

directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4

%

, 4.0

%

and 3.3

%

in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .

视频编码中的速率控制需要在比特率和重建质量之间进行各种权衡，这与图像质量评估密切相关。广泛使用的均方误差（MSE）测量方法不足以描述人类的视觉特征，因此，基于MSE的速率控制算法往往无法提供最佳的视觉质量。为了解决这个问题，我们提出了一种基于简化版视觉信息保真度（VIF）作为质量评估标准的帧级速率控制算法，以提高编码效率。首先，我们对VIF进行了简化，建立了VIF与MSE的关系，降低了计算复杂度，使VIF能够应用于视频编码框架；然后，我们建立了基于vif的λ和基于mse的λ之间的关系，用于λ域速率控制，包括位分配和参数调整。此外，使用基于vif的λ直接将基于vif的失真集成到基于mse的率失真优化编码框架中。实验结果表明，在PSNR、SSIM和VMAF等失真指标下，该方法的编码效率平均比默认帧级速率控制算法高3.4%、4.0%和3.3%。此外，该方法降低了重构视频在高码率范围内的质量波动，提高了分层配置下的码率精度。

{"title":"Visual information fidelity based frame level rate control for H.265/HEVC","authors":"Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia","doi":"10.1016/j.image.2024.117245","DOIUrl":"10.1016/j.image.2024.117245","url":null,"abstract":"<div><div>Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based <span><math><mi>λ</mi></math></span> and MSE-based <span><math><mi>λ</mi></math></span> for <span><math><mi>λ</mi></math></span>-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based <span><math><mi>λ</mi></math></span> directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4<span><math><mtext>%</mtext></math></span>, 4.0<span><math><mtext>%</mtext></math></span> and 3.3<span><math><mtext>%</mtext></math></span> in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117245"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos 基于变换的多视点时空特征交互融合深度视频人体动作识别

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-23 DOI: 10.1016/j.image.2024.117244

Hanbo Wu, Xin Ma, Yibin Li

Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.

时空特征建模是人体动作识别任务的关键。多视图数据有助于获取大量线索，提高特征描述的鲁棒性和准确性。然而，多视点动作识别还没有得到很好的探索。现有的方法大多只从单个视图进行动作识别，这导致了性能的限制。深度数据对光照和颜色变化不敏感，提供可靠的人体三维几何信息，具有显著的优势。在本研究中，我们将重点放在深度视频的动作识别上，引入一种基于transform的多视点时空特征交互融合框架，通过深度融合多视点信息实现有效的动作识别。具体而言，该框架包括视图内时空特征建模（ISTFM）和跨视图特征交互融合（CFIF）。首先，我们将深度视频投影到三个正交视图中，构建描述人类行为三维时空演变的多视图深度动态体。ISTFM以多视图深度动态体为输入，利用3D CNN提取三个视图的时空特征，然后利用transformer中的自关注机制来模拟每个视图内的全局上下文依赖关系。CFIF随后将自注意扩展为交叉注意，在不同视图之间进行深度交互，并进一步将交叉视图特征整合在一起，生成多视图联合特征表示。在两个大规模的RGBD数据集上进行了大量的实验，证明了该方法在提高识别性能方面的显著改进。

{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu, Xin Ma, Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":"10.1016/j.image.2024.117244","url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vocal cord anomaly detection based on Local Fine-Grained Contour Features 基于局部精细轮廓特征的声带异常检测

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-02 DOI: 10.1016/j.image.2024.117225

Yuqi Fan , Han Ye , Xiaohui Yuan

Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.

喉镜检查是诊断声带疾病的常用检查方法。传统的喉镜图像筛查工作耗费大量人力物力，而且在很大程度上依赖于医学专家的经验。从喉镜图像中自动检测声带疾病以协助常规图像读取的做法备受青睐。在喉镜图像中，声带疾病的症状主要集中在声带内部轮廓，通常表现为植被和小突起。现有的分类方法很少关注声带轮廓在声带疾病诊断中的作用，也无法有效捕捉其细粒度特征。本文提出了一种用于声带异常检测的新型局部细粒度轮廓特征提取方法。我们提出的方法包括四个阶段：图像分割，得到声带整体轮廓；声带内轮廓隔离，通过比较相邻像素值的变化得到声带内轮廓曲线；提取声带内轮廓的潜特征，以轮廓上各点的正切倾角作为潜特征；以及分类模块。实验结果表明，所提出的方法提高了声带异常的检测性能，准确率达到 97.21%。

{"title":"Vocal cord anomaly detection based on Local Fine-Grained Contour Features","authors":"Yuqi Fan , Han Ye , Xiaohui Yuan","doi":"10.1016/j.image.2024.117225","DOIUrl":"10.1016/j.image.2024.117225","url":null,"abstract":"<div><div>Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117225"},"PeriodicalIF":3.4,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions SES-ReNet：用于雾霾天气条件下人体检测的轻量级深度学习模型

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-30 DOI: 10.1016/j.image.2024.117223

Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli

Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.

准确检测室外场景中的人员对改善人身安全和安保起着至关重要的作用。然而，当能见度降低、人的外观退化时，尤其是在雾霾天气条件下，现有的人体检测算法面临着巨大挑战。为解决这一问题，我们提出了一种基于 RetinaNet 检测架构的新型轻量级模型。该模型包含一个轻量级骨干特征提取器、一个基于知识提炼（KD）的去毛刺功能和一个基于挤压和激励（SE）原理的多尺度关注机制。知识蒸馏是通过在未去毛刺的清晰图像上训练的大型网络来实现的，而注意力则被纳入网络的低级和高级特征中。实验结果表明，该方法性能卓越，在以 22 FPS 的速度运行时，性能优于最先进的方法。高准确度和实时性的结合使我们的方法成为在具有挑战性的天气条件下进行有效人体检测的一种有前途的解决方案，并适用于实时应用。

{"title":"SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions","authors":"Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli","doi":"10.1016/j.image.2024.117223","DOIUrl":"10.1016/j.image.2024.117223","url":null,"abstract":"<div><div>Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117223"},"PeriodicalIF":3.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos HOI-V：基于视频多特征融合的单阶段人-物互动检测

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-29 DOI: 10.1016/j.image.2024.117224

Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu

Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.

有效的人-物交互（HOI）检测对于机器理解真实世界场景非常重要。目前，基于图像的 HOI 检测已得到广泛研究，最近的单阶段方法在准确性和效率之间取得了平衡。然而，由于引入的时间上下文信息有限，因此很难从静态图像中预测时间感知的交互行为。同时，由于早期大规模视频 HOI 数据集的缺乏以及时空 HOI 模型训练的计算成本较高，近年来的探索性研究大多采用两阶段范式，但独立的对象检测和交互识别仍存在计算冗余和独立优化的问题。因此，受单级交互点检测框架的启发，本文提出了单级时空 HOI 检测基线，其中短期局部运动特征和长期时空上下文特征由提出的时差激励模块（TDEM）和 DLA-TSM 骨干模块获得。然后，通过多特征融合提取多个片段之间的互补视觉特征，并将其输入并行检测分支。最后，我们构建了一个视频数据集，其中只包含数据量较小的动作（HOI-V），以激励对端到端视频 HOI 检测的进一步研究。我们还进行了广泛的实验，以验证我们提出的基线的有效性。

{"title":"HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos","authors":"Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu","doi":"10.1016/j.image.2024.117224","DOIUrl":"10.1016/j.image.2024.117224","url":null,"abstract":"<div><div>Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117224"},"PeriodicalIF":3.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0