Signal Processing-Image Communication最新文献

Graph-based image captioning with semantic and spatial features

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2025-01-15 DOI: 10.1016/j.image.2025.117273

Mohammad Javad Parseh, Saeed Ghadiri

Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions.

{"title":"Graph-based image captioning with semantic and spatial features","authors":"Mohammad Javad Parseh, Saeed Ghadiri","doi":"10.1016/j.image.2025.117273","DOIUrl":"10.1016/j.image.2025.117273","url":null,"abstract":"<div><div>Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions.</div><div>© 2025 Elsevier Inc. All rights reserved.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"133 ","pages":"Article 117273"},"PeriodicalIF":3.4,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143093314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual information fidelity based frame level rate control for H.265/HEVC 基于视觉信息保真度的H.265/HEVC帧级速率控制

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-28 DOI: 10.1016/j.image.2024.117245

Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia

Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based

λ

and MSE-based

λ

for

λ

-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based

λ

directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4

%

, 4.0

%

and 3.3

%

in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .

视频编码中的速率控制需要在比特率和重建质量之间进行各种权衡，这与图像质量评估密切相关。广泛使用的均方误差（MSE）测量方法不足以描述人类的视觉特征，因此，基于MSE的速率控制算法往往无法提供最佳的视觉质量。为了解决这个问题，我们提出了一种基于简化版视觉信息保真度（VIF）作为质量评估标准的帧级速率控制算法，以提高编码效率。首先，我们对VIF进行了简化，建立了VIF与MSE的关系，降低了计算复杂度，使VIF能够应用于视频编码框架；然后，我们建立了基于vif的λ和基于mse的λ之间的关系，用于λ域速率控制，包括位分配和参数调整。此外，使用基于vif的λ直接将基于vif的失真集成到基于mse的率失真优化编码框架中。实验结果表明，在PSNR、SSIM和VMAF等失真指标下，该方法的编码效率平均比默认帧级速率控制算法高3.4%、4.0%和3.3%。此外，该方法降低了重构视频在高码率范围内的质量波动，提高了分层配置下的码率精度。

{"title":"Visual information fidelity based frame level rate control for H.265/HEVC","authors":"Luheng Jia , Haoqiang Ren , Zuhai Zhang , Li Song , Kebin Jia","doi":"10.1016/j.image.2024.117245","DOIUrl":"10.1016/j.image.2024.117245","url":null,"abstract":"<div><div>Rate control in video coding seeks for various trade-off between bitrate and reconstruction quality, which is closely tied to image quality assessment. The widely used measurement of mean squared error (MSE) is inadequate in describing human visual characteristics, therefore, rate control algorithms based on MSE often fail to deliver optimal visual quality. To address this issue, we propose a frame level rate control algorithm based on a simplified version of visual information fidelity (VIF) as the quality assessment criterion to improve coding efficiency. Firstly, we simplify the VIF and establish its relationship with MSE, which reduce the computational complexity to make it possible for VIF to be used in video coding framework. Then we establish the relationship between VIF-based <span><math><mi>λ</mi></math></span> and MSE-based <span><math><mi>λ</mi></math></span> for <span><math><mi>λ</mi></math></span>-domain rate control including bit allocation and parameter adjustment. Moreover, using VIF-based <span><math><mi>λ</mi></math></span> directly integrates VIF-based distortion into the MSE-based rate–distortion optimized coding framework. Experimental results demonstrate that the coding efficiency of the proposed method outperforms the default frame-level rate control algorithms under distortion metrics of PSNR, SSIM, and VMAF by 3.4<span><math><mtext>%</mtext></math></span>, 4.0<span><math><mtext>%</mtext></math></span> and 3.3<span><math><mtext>%</mtext></math></span> in average. Furthermore, the proposed method reduces the quality fluctuation of the reconstructed video at high bitrate range and improves the bitrate accuracy under hierarchical configuration .</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117245"},"PeriodicalIF":3.4,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos 基于变换的多视点时空特征交互融合深度视频人体动作识别

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-23 DOI: 10.1016/j.image.2024.117244

Hanbo Wu, Xin Ma, Yibin Li

Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.

时空特征建模是人体动作识别任务的关键。多视图数据有助于获取大量线索，提高特征描述的鲁棒性和准确性。然而，多视点动作识别还没有得到很好的探索。现有的方法大多只从单个视图进行动作识别，这导致了性能的限制。深度数据对光照和颜色变化不敏感，提供可靠的人体三维几何信息，具有显著的优势。在本研究中，我们将重点放在深度视频的动作识别上，引入一种基于transform的多视点时空特征交互融合框架，通过深度融合多视点信息实现有效的动作识别。具体而言，该框架包括视图内时空特征建模（ISTFM）和跨视图特征交互融合（CFIF）。首先，我们将深度视频投影到三个正交视图中，构建描述人类行为三维时空演变的多视图深度动态体。ISTFM以多视图深度动态体为输入，利用3D CNN提取三个视图的时空特征，然后利用transformer中的自关注机制来模拟每个视图内的全局上下文依赖关系。CFIF随后将自注意扩展为交叉注意，在不同视图之间进行深度交互，并进一步将交叉视图特征整合在一起，生成多视图联合特征表示。在两个大规模的RGBD数据集上进行了大量的实验，证明了该方法在提高识别性能方面的显著改进。

{"title":"Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos","authors":"Hanbo Wu, Xin Ma, Yibin Li","doi":"10.1016/j.image.2024.117244","DOIUrl":"10.1016/j.image.2024.117244","url":null,"abstract":"<div><div>Spatiotemporal feature modeling is the key to human action recognition task. Multiview data is helpful in acquiring numerous clues to improve the robustness and accuracy of feature description. However, multiview action recognition has not been well explored yet. Most existing methods perform action recognition only from a single view, which leads to the limited performance. Depth data is insensitive to illumination and color variations and offers significant advantages by providing reliable 3D geometric information of the human body. In this study, we concentrate on action recognition from depth videos and introduce a transformer-based framework for the interactive fusion of multiview spatiotemporal features, facilitating effective action recognition through deep integration of multiview information. Specifically, the proposed framework consists of intra-view spatiotemporal feature modeling (ISTFM) and cross-view feature interactive fusion (CFIF). Firstly, we project a depth video into three orthogonal views to construct multiview depth dynamic volumes that describe the 3D spatiotemporal evolution of human actions. ISTFM takes multiview depth dynamic volumes as input to extract spatiotemporal features of three views with 3D CNN, then applies self-attention mechanism in transformer to model global context dependency within each view. CFIF subsequently extends self-attention into cross-attention to conduct deep interaction between different views, and further integrates cross-view features together to generate a multiview joint feature representation. Our proposed method is tested on two large-scale RGBD datasets by extensive experiments to demonstrate the remarkable improvement for enhancing the recognition performance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117244"},"PeriodicalIF":3.4,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vocal cord anomaly detection based on Local Fine-Grained Contour Features 基于局部精细轮廓特征的声带异常检测

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-11-02 DOI: 10.1016/j.image.2024.117225

Yuqi Fan , Han Ye , Xiaohui Yuan

Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.

喉镜检查是诊断声带疾病的常用检查方法。传统的喉镜图像筛查工作耗费大量人力物力，而且在很大程度上依赖于医学专家的经验。从喉镜图像中自动检测声带疾病以协助常规图像读取的做法备受青睐。在喉镜图像中，声带疾病的症状主要集中在声带内部轮廓，通常表现为植被和小突起。现有的分类方法很少关注声带轮廓在声带疾病诊断中的作用，也无法有效捕捉其细粒度特征。本文提出了一种用于声带异常检测的新型局部细粒度轮廓特征提取方法。我们提出的方法包括四个阶段：图像分割，得到声带整体轮廓；声带内轮廓隔离，通过比较相邻像素值的变化得到声带内轮廓曲线；提取声带内轮廓的潜特征，以轮廓上各点的正切倾角作为潜特征；以及分类模块。实验结果表明，所提出的方法提高了声带异常的检测性能，准确率达到 97.21%。

{"title":"Vocal cord anomaly detection based on Local Fine-Grained Contour Features","authors":"Yuqi Fan , Han Ye , Xiaohui Yuan","doi":"10.1016/j.image.2024.117225","DOIUrl":"10.1016/j.image.2024.117225","url":null,"abstract":"<div><div>Laryngoscopy is a popular examination for vocal cord disease diagnosis. The conventional screening of laryngoscopic images is labor-intensive and depends heavily on the experience of the medical specialists. Automatic detection of vocal cord diseases from laryngoscopic images is highly sought to assist regular image reading. In laryngoscopic images, the symptoms of vocal cord diseases are concentrated in the inner vocal cord contour, which is often characterized as vegetation and small protuberances. The existing classification methods pay little, if any, attention to the role of vocal cord contour in the diagnosis of vocal cord diseases and fail to effectively capture the fine-grained features. In this paper, we propose a novel Local Fine-grained Contour Feature extraction method for vocal cord anomaly detection. Our proposed method consists of four stages: image segmentation to obtain the overall vocal cord contour, inner vocal cord contour isolation to obtain the inner contour curve by comparing the changes of adjacent pixel values, extraction of the latent feature in the inner vocal cord contour by taking the tangent inclination angle of each point on the contour as the latent feature, and the classification module. Our experimental results demonstrate that the proposed method improves the detection performance of vocal cord anomaly and achieves an accuracy of 97.21%.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"131 ","pages":"Article 117225"},"PeriodicalIF":3.4,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions SES-ReNet：用于雾霾天气条件下人体检测的轻量级深度学习模型

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-30 DOI: 10.1016/j.image.2024.117223

Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli

Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.

准确检测室外场景中的人员对改善人身安全和安保起着至关重要的作用。然而，当能见度降低、人的外观退化时，尤其是在雾霾天气条件下，现有的人体检测算法面临着巨大挑战。为解决这一问题，我们提出了一种基于 RetinaNet 检测架构的新型轻量级模型。该模型包含一个轻量级骨干特征提取器、一个基于知识提炼（KD）的去毛刺功能和一个基于挤压和激励（SE）原理的多尺度关注机制。知识蒸馏是通过在未去毛刺的清晰图像上训练的大型网络来实现的，而注意力则被纳入网络的低级和高级特征中。实验结果表明，该方法性能卓越，在以 22 FPS 的速度运行时，性能优于最先进的方法。高准确度和实时性的结合使我们的方法成为在具有挑战性的天气条件下进行有效人体检测的一种有前途的解决方案，并适用于实时应用。

{"title":"SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions","authors":"Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli","doi":"10.1016/j.image.2024.117223","DOIUrl":"10.1016/j.image.2024.117223","url":null,"abstract":"<div><div>Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117223"},"PeriodicalIF":3.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos HOI-V：基于视频多特征融合的单阶段人-物互动检测

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-29 DOI: 10.1016/j.image.2024.117224

Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu

Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.

有效的人-物交互（HOI）检测对于机器理解真实世界场景非常重要。目前，基于图像的 HOI 检测已得到广泛研究，最近的单阶段方法在准确性和效率之间取得了平衡。然而，由于引入的时间上下文信息有限，因此很难从静态图像中预测时间感知的交互行为。同时，由于早期大规模视频 HOI 数据集的缺乏以及时空 HOI 模型训练的计算成本较高，近年来的探索性研究大多采用两阶段范式，但独立的对象检测和交互识别仍存在计算冗余和独立优化的问题。因此，受单级交互点检测框架的启发，本文提出了单级时空 HOI 检测基线，其中短期局部运动特征和长期时空上下文特征由提出的时差激励模块（TDEM）和 DLA-TSM 骨干模块获得。然后，通过多特征融合提取多个片段之间的互补视觉特征，并将其输入并行检测分支。最后，我们构建了一个视频数据集，其中只包含数据量较小的动作（HOI-V），以激励对端到端视频 HOI 检测的进一步研究。我们还进行了广泛的实验，以验证我们提出的基线的有效性。

{"title":"HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos","authors":"Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu","doi":"10.1016/j.image.2024.117224","DOIUrl":"10.1016/j.image.2024.117224","url":null,"abstract":"<div><div>Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117224"},"PeriodicalIF":3.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High efficiency deep image compression via channel-wise scale adaptive latent representation learning 通过信道尺度自适应潜表征学习实现高效深度图像压缩

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-28 DOI: 10.1016/j.image.2024.117227

Chenhao Wu, Qingbo Wu, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu

Recent learning based neural image compression methods have achieved impressive rate–distortion (RD) performance via the sophisticated context entropy model, which performs well in capturing the spatial correlations of latent features. However, due to the dependency on the adjacent or distant decoded features, existing methods require an inefficient serial processing structure, which significantly limits its practicability. Instead of pursuing computationally expensive entropy estimation, we propose to reduce the spatial redundancy via the channel-wise scale adaptive latent representation learning, whose entropy coding is spatially context-free and parallelizable. Specifically, the proposed encoder adaptively determines the scale of the latent features via a learnable binary mask, which is optimized with the RD cost. In this way, lower-scale latent representation will be allocated to the channels with higher spatial redundancy, which consumes fewer bits and vice versa. The downscaled latent features could be well recovered with a lightweight inter-channel upconversion module in the decoder. To compensate for the entropy estimation performance degradation, we further develop an inter-scale hyperprior entropy model, which supports the high efficiency parallel encoding/decoding within each scale of the latent features. Extensive experiments are conducted to illustrate the efficacy of the proposed method. Our method achieves bitrate savings of 18.23%, 19.36%, and 27.04% over HEVC Intra, along with decoding speeds that are 46 times, 48 times, and 51 times faster than the baseline method on the Kodak, Tecnick, and CLIC datasets, respectively.

近期基于学习的神经图像压缩方法通过复杂的上下文熵模型实现了令人印象深刻的速率-失真（RD）性能，该模型在捕捉潜在特征的空间相关性方面表现出色。然而，由于依赖于相邻或相距较远的解码特征，现有方法需要低效的串行处理结构，这大大限制了其实用性。我们建议通过信道尺度自适应潜表征学习来减少空间冗余，而不是追求计算成本高昂的熵估计，其熵编码是无空间上下文且可并行处理的。具体来说，建议的编码器通过可学习的二进制掩码自适应地确定潜在特征的尺度，并根据 RD 成本对其进行优化。这样，较低尺度的潜在表示将分配给空间冗余度较高的信道，从而减少比特消耗，反之亦然。解码器中的轻量级信道间上变频模块可以很好地恢复降尺度潜特征。为了弥补熵估计性能的下降，我们进一步开发了一种跨尺度超优先熵模型，它支持在潜特征的每个尺度内进行高效的并行编码/解码。我们进行了大量实验，以说明所提方法的功效。在柯达、Tecnick 和 CLIC 数据集上，我们的方法比 HEVC Intra 分别节省了 18.23%、19.36% 和 27.04% 的比特率，解码速度比基准方法分别快 46 倍、48 倍和 51 倍。

{"title":"High efficiency deep image compression via channel-wise scale adaptive latent representation learning","authors":"Chenhao Wu, Qingbo Wu, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu","doi":"10.1016/j.image.2024.117227","DOIUrl":"10.1016/j.image.2024.117227","url":null,"abstract":"<div><div>Recent learning based neural image compression methods have achieved impressive rate–distortion (RD) performance via the sophisticated context entropy model, which performs well in capturing the spatial correlations of latent features. However, due to the dependency on the adjacent or distant decoded features, existing methods require an inefficient serial processing structure, which significantly limits its practicability. Instead of pursuing computationally expensive entropy estimation, we propose to reduce the spatial redundancy via the channel-wise scale adaptive latent representation learning, whose entropy coding is spatially context-free and parallelizable. Specifically, the proposed encoder adaptively determines the scale of the latent features via a learnable binary mask, which is optimized with the RD cost. In this way, lower-scale latent representation will be allocated to the channels with higher spatial redundancy, which consumes fewer bits and vice versa. The downscaled latent features could be well recovered with a lightweight inter-channel upconversion module in the decoder. To compensate for the entropy estimation performance degradation, we further develop an inter-scale hyperprior entropy model, which supports the high efficiency parallel encoding/decoding within each scale of the latent features. Extensive experiments are conducted to illustrate the efficacy of the proposed method. Our method achieves bitrate savings of 18.23%, 19.36%, and 27.04% over HEVC Intra, along with decoding speeds that are 46 times, 48 times, and 51 times faster than the baseline method on the Kodak, Tecnick, and CLIC datasets, respectively.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117227"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text in the dark: Extremely low-light text image enhancement 黑暗中的文字：极低照度下的文字图像增强

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-28 DOI: 10.1016/j.image.2024.117222

Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach

Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder–decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model’s efficacy in text detection under extremely low-light conditions. Code and datasets will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark.

极低照度下的文本图像给场景文本检测带来了巨大挑战。现有方法在文本检测前使用低照度图像增强技术来增强这些图像。然而，这些方法未能解决低层次特征的重要性问题，而低层次特征对于优化下游场景文本任务的性能至关重要。此外，极低照度文本数据集的缺乏也限制了进一步的研究。为了解决这些局限性，我们提出了一种新颖的、文本感知的极低照度图像增强框架。我们的方法首先集成了文本感知复制-粘贴（Text-CP）增强方法作为预处理步骤，然后采用边缘感知注意模块增强的双编码器-解码器架构。我们还引入了文本检测和边缘重建损失，以训练模型生成具有更高文本可见性的图像。此外，我们还提出了一种用于合成极低光图像的监督深度曲线估计（Supervised-DCE）模型，允许在 IC15 等公开可用的场景文本数据集上进行训练。为了进一步推动这一领域的发展，我们在极低照度的 "See In the Dark"（SID）和普通的 "LOw-Light"（LOL）数据集中对文本进行了注释。在新标注的 SID-Sony-文本、SID-Fuji-文本、LOL-文本和合成的极低照度 IC15 数据集上，我们针对各种传统方法和基于深度学习的方法对所提出的框架进行了严格测试。我们的大量实验表明，该模型在图像增强和场景文本任务方面都有显著改进，展示了该模型在极弱光条件下进行文本检测的功效。代码和数据集将在 https://github.com/chunchet-ng/Text-in-the-Dark 上公开发布。

{"title":"Text in the dark: Extremely low-light text image enhancement","authors":"Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach","doi":"10.1016/j.image.2024.117222","DOIUrl":"10.1016/j.image.2024.117222","url":null,"abstract":"<div><div>Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder–decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model’s efficacy in text detection under extremely low-light conditions. Code and datasets will be released publicly at <span><span>https://github.com/chunchet-ng/Text-in-the-Dark</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117222"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Double supervision for scene text detection and recognition based on BMINet 基于 BMINet 的场景文本检测和识别双重监督

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-26 DOI: 10.1016/j.image.2024.117226

Hanyang Wan, Ruoyun Liu, Li Yu

Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.

场景文本检测和识别目前是计算机视觉领域的重要研究领域，在智能驾驶和自动化生产等领域有着广泛的潜在应用。然而，现有的主流方法都存在明显的缺陷，包括文本区域检测不完整、背景噪声过大以及忽略全局信息和上下文相关性。在本研究中，我们介绍了基于边界拟合的创新性场景文本检测方法 BMINet，以及结合文本区域校正的双重监督场景文本识别方法。BMINet 框架主要由边界拟合模块和多尺度融合模块组成。边界拟合模块沿着预测的边界等距采样特定数量的控制点，并调整它们的位置，使检测框与文本形状更好地对齐。多尺度融合模块整合来自多尺度特征图的信息，以扩大网络的感受野。结合文本区域校正的双重监督场景文本识别方法集成了旋转矩形框和二值图像分割的图像处理模块。此外，它还引入了一个校正网络来细化文本区域边界。该方法整合了基于 CTC 损失和注意力机制的识别技术，强调文本图像中的纹理细节和上下文依赖关系，通过双重监督提高网络性能。广泛的消融和对比实验证实了两阶段模型在实现稳健检测和识别结果方面的功效，在 Total-Text 数据集上实现了 80.6% 的识别准确率。

{"title":"Double supervision for scene text detection and recognition based on BMINet","authors":"Hanyang Wan, Ruoyun Liu, Li Yu","doi":"10.1016/j.image.2024.117226","DOIUrl":"10.1016/j.image.2024.117226","url":null,"abstract":"<div><div>Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117226"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new two-stage low-light enhancement network with progressive attention fusion strategy 采用渐进式注意力融合策略的新型两级低照度增强网络

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication

Pub Date : 2024-10-26 DOI: 10.1016/j.image.2024.117229

Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao

Low-light image enhancement is a very challenging subject in the field of computer vision such as visual surveillance, driving behavior analysis, and medical imaging . It has a large number of degradation problems such as accumulated noise, artifacts, and color distortion. Therefore, how to solve the degradation problems and obtain clear images with high visual quality has become an important issue. It can effectively improve the performance of high-level computer vision tasks. In this study, we propose a new two-stage low-light enhancement network with a progressive attention fusion strategy, and the two hallmarks of this method are the use of global feature fusion (GFF) and local detail restoration (LDR), which can enrich the global content of the image and restore local details. Experimental results on the LOL dataset show that the proposed model can achieve good enhancement effects. Moreover, on the benchmark dataset without reference images, the proposed model also obtains a better NIQE score, which outperforms most existing state-of-the-art methods in both quantitative and qualitative evaluations. All these verify the effectiveness and superiority of the proposed method.

低照度图像增强是视觉监控、驾驶行为分析和医学成像等计算机视觉领域一个极具挑战性的课题。它存在大量的劣化问题，如累积噪声、伪像和色彩失真。因此，如何解决退化问题，获得视觉质量高的清晰图像已成为一个重要问题。它可以有效提高高级计算机视觉任务的性能。在本研究中，我们提出了一种新的两阶段低照度增强网络，采用渐进式注意力融合策略，该方法的两大特点是使用全局特征融合（GFF）和局部细节还原（LDR），既能丰富图像的全局内容，又能还原局部细节。在 LOL 数据集上的实验结果表明，所提出的模型可以达到良好的增强效果。此外，在没有参考图像的基准数据集上，所提出的模型也获得了较好的 NIQE 分数，在定量和定性评估中均优于大多数现有的先进方法。所有这些都验证了所提出方法的有效性和优越性。

{"title":"A new two-stage low-light enhancement network with progressive attention fusion strategy","authors":"Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao","doi":"10.1016/j.image.2024.117229","DOIUrl":"10.1016/j.image.2024.117229","url":null,"abstract":"<div><div>Low-light image enhancement is a very challenging subject in the field of computer vision such as visual surveillance, driving behavior analysis, and medical imaging . It has a large number of degradation problems such as accumulated noise, artifacts, and color distortion. Therefore, how to solve the degradation problems and obtain clear images with high visual quality has become an important issue. It can effectively improve the performance of high-level computer vision tasks. In this study, we propose a new two-stage low-light enhancement network with a progressive attention fusion strategy, and the two hallmarks of this method are the use of global feature fusion (GFF) and local detail restoration (LDR), which can enrich the global content of the image and restore local details. Experimental results on the LOL dataset show that the proposed model can achieve good enhancement effects. Moreover, on the benchmark dataset without reference images, the proposed model also obtains a better NIQE score, which outperforms most existing state-of-the-art methods in both quantitative and qualitative evaluations. All these verify the effectiveness and superiority of the proposed method.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117229"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0