In certain applications of face recognition, our goal is to verify whether an individual belongs to a particular group while keeping their identity undisclosed. Existing methods have suggested a process of quantizing pre-computed face descriptors into discrete embeddings and aggregating them into a single representation for the group. However, this mechanism is only optimized for a given closed set of individuals and requires relearning the group representations from scratch whenever the groups change. In this paper, we introduce a deep architecture that simultaneously learns face descriptors and the aggregation mechanism to enhance overall performance. Our system can be utilized for new groups comprising individuals who have never been encountered before, and it easily handles new memberships or the termination of existing memberships. Through experiments conducted on multiple extensive, real-world face datasets, we demonstrate that our proposed method achieves superior verification performance compared to other baseline approaches.
{"title":"AggNet: Learning to aggregate faces for group membership verification","authors":"Marzieh Gheisari , Javad Amirian , Teddy Furon , Laurent Amsaleg","doi":"10.1016/j.image.2024.117237","DOIUrl":"10.1016/j.image.2024.117237","url":null,"abstract":"<div><div>In certain applications of face recognition, our goal is to verify whether an individual belongs to a particular group while keeping their identity undisclosed. Existing methods have suggested a process of quantizing pre-computed face descriptors into discrete embeddings and aggregating them into a single representation for the group. However, this mechanism is only optimized for a given closed set of individuals and requires relearning the group representations from scratch whenever the groups change. In this paper, we introduce a deep architecture that simultaneously learns face descriptors and the aggregation mechanism to enhance overall performance. Our system can be utilized for new groups comprising individuals who have never been encountered before, and it easily handles new memberships or the termination of existing memberships. Through experiments conducted on multiple extensive, real-world face datasets, we demonstrate that our proposed method achieves superior verification performance compared to other baseline approaches.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117237"},"PeriodicalIF":3.4,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the field of video-based person re-identification (re-ID) has conducted in-depth research on how to effectively utilize spatiotemporal clues, which has attracted attention for its potential in providing comprehensive view representations of pedestrians. However, although the discriminability and correlation of spatiotemporal features are often studied, the exploration of the complex relationships between these features has been relatively neglected. Especially when dealing with multi-granularity features, how to depict the different spatial representations of the same person under different perspectives becomes a challenge. To address this challenge, this paper proposes a multi-granularity inter-frame relationship exploration and global residual embedding network specifically designed to solve the above problems. This method successfully extracts more comprehensive and discriminative feature representations by deeply exploring the interactions and global differences between multi-granularity features. Specifically, by simulating the dynamic relationship of different granularity features in long video sequences and using a structured perceptual adjacency matrix to synthesize spatiotemporal information, cross-granularity information is effectively integrated into individual features. In addition, by introducing a residual learning mechanism, this method can also guide the diversified development of global features and reduce the negative impacts caused by factors such as occlusion. Experimental results verify the effectiveness of this method on three mainstream benchmark datasets, significantly surpassing state-of-the-art solutions. This shows that this paper successfully solves the challenging problem of how to accurately identify and utilize the complex relationships between multi-granularity spatiotemporal features in video-based person re-ID.
{"title":"Multi-granular inter-frame relation exploration and global residual embedding for video-based person re-identification","authors":"Zhiqin Zhu , Sixin Chen , Guanqiu Qi , Huafeng Li , Xinbo Gao","doi":"10.1016/j.image.2024.117240","DOIUrl":"10.1016/j.image.2024.117240","url":null,"abstract":"<div><div>In recent years, the field of video-based person re-identification (re-ID) has conducted in-depth research on how to effectively utilize spatiotemporal clues, which has attracted attention for its potential in providing comprehensive view representations of pedestrians. However, although the discriminability and correlation of spatiotemporal features are often studied, the exploration of the complex relationships between these features has been relatively neglected. Especially when dealing with multi-granularity features, how to depict the different spatial representations of the same person under different perspectives becomes a challenge. To address this challenge, this paper proposes a multi-granularity inter-frame relationship exploration and global residual embedding network specifically designed to solve the above problems. This method successfully extracts more comprehensive and discriminative feature representations by deeply exploring the interactions and global differences between multi-granularity features. Specifically, by simulating the dynamic relationship of different granularity features in long video sequences and using a structured perceptual adjacency matrix to synthesize spatiotemporal information, cross-granularity information is effectively integrated into individual features. In addition, by introducing a residual learning mechanism, this method can also guide the diversified development of global features and reduce the negative impacts caused by factors such as occlusion. Experimental results verify the effectiveness of this method on three mainstream benchmark datasets, significantly surpassing state-of-the-art solutions. This shows that this paper successfully solves the challenging problem of how to accurately identify and utilize the complex relationships between multi-granularity spatiotemporal features in video-based person re-ID.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117240"},"PeriodicalIF":3.4,"publicationDate":"2024-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-29DOI: 10.1016/j.image.2024.117242
Chengdong Lan, Hao Yan, Cheng Luo, Tiesong Zhao
The introduction of multiple viewpoints in video scenes inevitably increases the bitrates required for storage and transmission. To reduce bitrates, researchers have developed methods to skip intermediate viewpoints during compression and delivery, and ultimately reconstruct them using Side Information (SInfo). Typically, depth maps are used to construct SInfo. However, these methods suffer from reconstruction inaccuracies and inherently high bitrates. In this paper, we propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adversarial Network (GAN) to improve the reconstruction accuracy of SInfo. Additionally, we consider incorporating information from adjacent temporal and spatial viewpoints to further reduce SInfo redundancy. At the encoder, we construct a spatio-temporal Epipolar Plane Image (EPI) and further utilize a convolutional network to extract the latent code of a GAN as SInfo. At the decoder, we combine the SInfo and adjacent viewpoints to reconstruct intermediate views using the GAN generator. Specifically, we establish a joint encoder constraint for reconstruction cost and SInfo entropy to achieve an optimal trade-off between reconstruction quality and bitrate overhead. Experiments demonstrate the significant improvement in Rate–Distortion (RD) performance compared to state-of-the-art methods.
{"title":"GAN-based multi-view video coding with spatio-temporal EPI reconstruction","authors":"Chengdong Lan, Hao Yan, Cheng Luo, Tiesong Zhao","doi":"10.1016/j.image.2024.117242","DOIUrl":"10.1016/j.image.2024.117242","url":null,"abstract":"<div><div>The introduction of multiple viewpoints in video scenes inevitably increases the bitrates required for storage and transmission. To reduce bitrates, researchers have developed methods to skip intermediate viewpoints during compression and delivery, and ultimately reconstruct them using Side Information (SInfo). Typically, depth maps are used to construct SInfo. However, these methods suffer from reconstruction inaccuracies and inherently high bitrates. In this paper, we propose a novel multi-view video coding method that leverages the image generation capabilities of Generative Adversarial Network (GAN) to improve the reconstruction accuracy of SInfo. Additionally, we consider incorporating information from adjacent temporal and spatial viewpoints to further reduce SInfo redundancy. At the encoder, we construct a spatio-temporal Epipolar Plane Image (EPI) and further utilize a convolutional network to extract the latent code of a GAN as SInfo. At the decoder, we combine the SInfo and adjacent viewpoints to reconstruct intermediate views using the GAN generator. Specifically, we establish a joint encoder constraint for reconstruction cost and SInfo entropy to achieve an optimal trade-off between reconstruction quality and bitrate overhead. Experiments demonstrate the significant improvement in Rate–Distortion (RD) performance compared to state-of-the-art methods.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117242"},"PeriodicalIF":3.4,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-28DOI: 10.1016/j.image.2024.117243
Yue Yu, Jingshuo Xing, Nengli Li
Style transfer refers to the conversion of images in two different domains. Compared with the style transfer based on the style image, the image style transfer through the text description is more free and applicable to more practical scenarios. However, the image style transfer method under the text condition needs to be trained and optimized for different text and image inputs each time, resulting in limited style transfer efficiency. Therefore, this paper proposes a multi-layer feature fusion based style transfer method (MlFFST) with arbitrary text condition. To address the problems of distortion and missing semantic content, we also introduce a multi-layer attention normalization module. The experimental results show that the method in this paper can generate stylized results with high quality, good effect and high stability for images and videos. And this method can meet real-time requirements to generate more artistic and aesthetic images and videos.
{"title":"Multi-layer feature fusion based image style transfer with arbitrary text condition","authors":"Yue Yu, Jingshuo Xing, Nengli Li","doi":"10.1016/j.image.2024.117243","DOIUrl":"10.1016/j.image.2024.117243","url":null,"abstract":"<div><div>Style transfer refers to the conversion of images in two different domains. Compared with the style transfer based on the style image, the image style transfer through the text description is more free and applicable to more practical scenarios. However, the image style transfer method under the text condition needs to be trained and optimized for different text and image inputs each time, resulting in limited style transfer efficiency. Therefore, this paper proposes a multi-layer feature fusion based style transfer method (MlFFST) with arbitrary text condition. To address the problems of distortion and missing semantic content, we also introduce a multi-layer attention normalization module. The experimental results show that the method in this paper can generate stylized results with high quality, good effect and high stability for images and videos. And this method can meet real-time requirements to generate more artistic and aesthetic images and videos.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117243"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-28DOI: 10.1016/j.image.2024.117246
Shiqi Liu, Qiding Lu, Shengkui Dai
Histogram Equalization (HE) algorithm remains one of the research hotspots in the field of image enhancement due to its computational simplicity. Despite numerous improvements made to HE algorithms, few can comprehensively account for all major drawbacks of HE. To address this issue, this paper proposes a novel histogram equalization framework, which is an adaptive and systematic resolution. Firstly, a novel optimization mathematical model is proposed to seek the optimal controlling parameters for modifying the histogram. Additionally, a new visual prior knowledge, termed Narrow Dynamic Prior (NDP), is summarized, which describes and reveals the subjective perceptual characteristics of the Human Visual System (HVS) for some special types of images. Then, this new knowledge is organically integrated with the new model to expand the application scope of HE. Lastly, unlike common brightness preservation algorithms, a novel method for brightness estimation and precise control is proposed. Experimental results demonstrate that the proposed equalization framework significantly mitigates the major drawbacks of HE, achieving notable advancements in striking a balance between contrast, brightness and detail of the output image. Both objective evaluation metrics and subjective visual perception indicate that the proposed algorithm outperforms other excellent competition algorithms selected in this paper.
{"title":"Adaptive histogram equalization framework based on new visual prior and optimization model","authors":"Shiqi Liu, Qiding Lu, Shengkui Dai","doi":"10.1016/j.image.2024.117246","DOIUrl":"10.1016/j.image.2024.117246","url":null,"abstract":"<div><div>Histogram Equalization (HE) algorithm remains one of the research hotspots in the field of image enhancement due to its computational simplicity. Despite numerous improvements made to HE algorithms, few can comprehensively account for all major drawbacks of HE. To address this issue, this paper proposes a novel histogram equalization framework, which is an adaptive and systematic resolution. Firstly, a novel optimization mathematical model is proposed to seek the optimal controlling parameters for modifying the histogram. Additionally, a new visual prior knowledge, termed Narrow Dynamic Prior (NDP), is summarized, which describes and reveals the subjective perceptual characteristics of the Human Visual System (HVS) for some special types of images. Then, this new knowledge is organically integrated with the new model to expand the application scope of HE. Lastly, unlike common brightness preservation algorithms, a novel method for brightness estimation and precise control is proposed. Experimental results demonstrate that the proposed equalization framework significantly mitigates the major drawbacks of HE, achieving notable advancements in striking a balance between contrast, brightness and detail of the output image. Both objective evaluation metrics and subjective visual perception indicate that the proposed algorithm outperforms other excellent competition algorithms selected in this paper.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"132 ","pages":"Article 117246"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143148375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-28DOI: 10.1016/j.image.2024.117245
Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia
Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based and MSE-based for -domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4, 4.0 and 3.3 in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .
{"title":"Visual information fidelity based frame level rate control for H.265/HEVC","authors":"Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia","doi":"10.1016/j.image.2024.117245","DOIUrl":"10.1016/j.image.2024.117245","url":null,"abstract":"<div><div>Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based <span><math><mi>λ</mi></math></span> and MSE-based <span><math><mi>λ</mi></math></span> for <span><math><mi>λ</mi></math></span>-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based <span><math><mi>λ</mi></math></span> directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4<span><math><mtext>%</mtext></math></span>, 4.0<span><math><mtext>%</mtext></math></span> and 3.3<span><math><mtext>%</mtext></math></span> in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117245"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-23DOI: 10.1016/j.image.2024.117244
Hanbo Wu, Xin Ma, Yibin Li
Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.
{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu, Xin Ma, Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":"10.1016/j.image.2024.117244","url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-02DOI: 10.1016/j.image.2024.117225
Yuqi Fan , Han Ye , Xiaohui Yuan
Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.
{"title":"Vocal cord anomaly detection based on Local Fine-Grained Contour Features","authors":"Yuqi Fan , Han Ye , Xiaohui Yuan","doi":"10.1016/j.image.2024.117225","DOIUrl":"10.1016/j.image.2024.117225","url":null,"abstract":"<div><div>Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117225"},"PeriodicalIF":3.4,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.
{"title":"SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions","authors":"Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli","doi":"10.1016/j.image.2024.117223","DOIUrl":"10.1016/j.image.2024.117223","url":null,"abstract":"<div><div>Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117223"},"PeriodicalIF":3.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1016/j.image.2024.117224
Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu
Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.
{"title":"HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos","authors":"Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu","doi":"10.1016/j.image.2024.117224","DOIUrl":"10.1016/j.image.2024.117224","url":null,"abstract":"<div><div>Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117224"},"PeriodicalIF":3.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}